Service VM Orchestration and Management

Astara Orchestrator

astara-orchestrator is a multi-processed, multithreaded Python process composed of three primary subsystems, each of which are spawned as a subprocess of the main astara-orchestrator process:

L3 and DHCP Event Consumption

astara.notifications uses kombu and a Python multiprocessing.Queue to listen for specific Neutron service events (e.g., router.interface.create, subnet.create.end, port.create.end, port.delete.end) and normalize them into one of several event types:

  • CREATE - a router creation was requested
  • UPDATE - services on a router need to be reconfigured
  • DELETE - a router was deleted
  • POLL - used by the health monitor for checking aliveness of a Service VM
  • REBUILD - a Service VM should be destroyed and recreated

As events are normalized and shuttled onto the multiprocessing.Queue, astara.scheduler shards (by Tenant ID, by default) and distributes them amongst a pool of worker processes it manages.

This system also consumes and distributes special astara.command events which are published by the rug-ctl operator tools.

State Machine Workers and Router Lifecycle

Each multithreaded worker process manages a pool of state machines (one per virtual router), each of which represents the lifecycle of an individual router. As the scheduler distributes events for a specific router, logic in the worker (dependent on the router’s current state) determines which action to take next:

digraph sample_boot {
  node [shape = square];
  AMQP;
  "Event Processing + Scheduler";
  Nova;
  Neutron;

  node [shape = circle];

  AMQP -> "Event Processing + Scheduler";
  subgraph clusterrug {
      "Event Processing + Scheduler" -> "Worker 1";
      "Event Processing + Scheduler" -> "Worker ...";
      "Event Processing + Scheduler" -> "Worker N";

      "Worker 1" -> "Thread 1"
      "Worker 1" -> "Thread ..."
      "Worker 1" -> "Thread N"
  }

  "Thread 1" -> "Service VM 1";
  "Thread 1" -> "Service VM ..." [ label = "Appliance REST API" ];
  "Thread 1" -> "Service VM N";

  "Thread 1" -> "Nova" [ label = "Nova API" ];
  "Thread 1" -> "Neutron" [ label = "Neutron API" ];
}

For example, let’s say a user created a new Neutron network, subnet, and router. In this scenario, a router-interface-create event would be handled by the appropriate worker (based by tenant ID), and a transition through the state machine might look something like this:

digraph sample_boot {
  rankdir=LR;

  node [shape = doublecircle];
  CalcAction;

  node [shape = circle];

  CalcAction -> Alive;
  Alive -> CreateVM;
  CreateVM -> CheckBoot;
  CheckBoot -> CheckBoot;
  CheckBoot -> ConfigureVM;
}

State Machine Flow

The supported states in the state machine are:

CalcAction:The entry point of the state machine. Depending on the current status of the Service VM (e.g., ACTIVE, BUILD, SHUTDOWN) and the current event, determine the first step in the state machine to transition to.
Alive:Check aliveness of the Service VM by attempting to communicate with it via its REST HTTP API.
CreateVM:Call nova boot to boot a new Service VM. This will attempt to boot a Service VM up to a (configurable) number of times before placing the router into ERROR state.
CheckBoot:Check aliveness (up to a configurable number of seconds) of the router until the VM is responsive and ready for initial configuration.
ConfigureVM:Configure the Service VM and its services. This is generally the final step in the process of booting and configuring a router. This step communicates with the Neutron API to generate a comprehensive network configuration for the router (which is pushed to the router via its REST API). On success, the state machine yields control back to the worker thread and that thread handles the next event in its queue (likely for a different Service VM and its state machine).
ReplugVM:Attempt to hot-plug/unplug a network from the router via nova interface-attach or nova-interface-detach.
StopVM:Terminate a running Service VM. This is generally performed when a Neutron router is deleted or via explicit operator tools.
ClearError:After a (configurable) number of nova boot failures, Neutron routers are automatically transitioned into a cool down ERROR state (so that astara will not continue to boot them forever; this is to prevent further exasperation of failing hypervisors). This state transition is utilized to add routers back into management after issues are resolved and signal to astara-orchestrator that it should attempt to manage them again.
STATS:Reads traffic data from the router.
CONFIG:Configures the VM and its services.
EXIT:Processing stops.

ACT(ion) Variables are:

Create:Create router was requested.
Read:Read router traffic stats.
Update:Update router configuration.
Delete:Delete router.
Poll:Poll router alive status.
rEbuild:Recreate a router from scratch.

VM Variables are:

Down:VM is known to be down.
Booting:VM is booting.
Up:VM is known to be up (pingable).
Configured:VM is known to be configured.
Restart Needed:VM needs to be rebooted.
Hotplug Needed:VM needs to be replugged.
Gone:The router definition has been removed from neutron.
Error:The router has been rebooted too many times, or has had some other error.

digraph rug {
  // rankdir=LR;

  node [shape = rectangle];
  START;

  // These nodes enter and exit the state machine.

  node [shape = doublecircle];
  EXIT;
  CALC_ACTION;

  node [shape = circle];

  START -> CALC_ACTION;

  CALC_ACTION -> ALIVE [ label = "ACT>[CRUP],vm:[UC]" ];
  CALC_ACTION -> CREATE_VM [ label = "ACT>[CRUP],vm:D" ];
  CALC_ACTION -> CHECK_BOOT [ label = "ACT>[CRUP],vm:B" ];
  CALC_ACTION -> REBUILD_VM [ label = "ACT:E" ];
  CALC_ACTION -> STOP_VM [ label = "ACT>D or vm:G" ];
  CALC_ACTION -> CLEAR_ERROR [ label = "vm:E" ];

  ALIVE -> CREATE_VM [ label = "vm>D" ];
  ALIVE -> CONFIG [ label = "ACT:[CU],vm:[UC]" ];
  ALIVE -> STATS [ label = "ACT:R,vm:C" ];
  ALIVE -> CALC_ACTION [ label = "ACT:P,vm>[UC]" ];
  ALIVE -> STOP_VM [ label = "vm:G" ];

  CREATE_VM -> CHECK_BOOT [ label = "ACT:[CRUDP],vm:[DBUCR]" ];
  CREATE_VM -> STOP_VM [ label = "vm:G" ];
  CREATE_VM -> CALC_ACTION [ label = "vm:E" ];
  CREATE_VM -> CREATE_VM [ label = "vm:D" ];

  CHECK_BOOT -> CONFIG [ label = "vm>U" ];
  CHECK_BOOT -> CALC_ACTION [ label = "vm:[BCR]" ];
  CHECK_BOOT -> STOP_VM [ label = "vm:[DG]" ];

  CONFIG -> STATS [ label = "ACT:R,vm>C" ];
  CONFIG -> CALC_ACTION [ label = "ACT>P,vm>C" ];
  CONFIG -> REPLUG_VM [ label = "vm>[H]" ];
  CONFIG -> STOP_VM [ label = "vm>[RDG]" ];

  REPLUG_VM -> CONFIG [ label = "vm>[H]" ];
  REPLUG_VM -> STOP_VM [ label = "vm>[R]" ];

  STATS -> CALC_ACTION [ label = "ACT>P" ];

  CLEAR_ERROR -> CALC_ACTION [ label = "no pause before next action" ];

  REBUILD_VM -> REBUILD_VM [ label = "vm!=[DG]" ];
  REBUILD_VM -> CREATE_VM [ label = "ACT:E,vm:D" ];

  STOP_VM -> CREATE_VM [ label = "ACT:E or vm>D" ];
  STOP_VM -> EXIT [ label = "ACT:D,vm>D or vm:G" ];

}

Health Monitoring

astara.health is a subprocess which (at a configurable interval) periodically delivers POLL events to every known virtual router. This event transitions the state machine into the Alive state, which (depending on the availability of the router), may simply exit the state machine (because the router’s status API replies with an HTTP 200) or transition to the CreateVM state (because the router is unresponsive and must be recreated).

High Availability

Astara supports high-availability (HA) on both the control plane and data plane.

The astara-orchestrator service may be deployed in a configuration that allows multiple service processes to span nodes to allow load-distribution and HA. For more information on clustering, see the install docs.

It also supports orchestrating pairs of virtual appliances to provide HA of the data path, allowing pairs of virtual routers to be clustered among themselves using VRRP and connection tracking. To enable this, simply create Neutron routers with the ha=True parameter or set this property on existing routers and issue a rebuild command via astara-ctl for that router.