Docs/State Machine

Agent State Machine

The authoritative state machine that governs agent behavior and prevents command conflicts.

Version 1.0
~15 min read

Overview

The agent state machine is authoritative - it lives in the agent and determines what actions are allowed at any given time. The collector tracks a shadow copy of this state via heartbeats for monitoring and multi-collector coordination.

Key Principles:

  • All state transitions must be explicit and validated
  • Invalid transitions are rejected (not silently ignored)
  • Each state defines what operations are permitted
  • State persists across restarts (stored in BoltDB)
  • State is reported to collector in every heartbeat

State Diagram

+-----------+
|  STOPPED  |  Initial state / final state
+-----+-----+
      | Start()
      v
+-----------+
| STARTING  |  Loading config, initializing storage
+-----+-----+
      | initialized
      v
+-----------+     not enrolled      +-----------+
|   READY   |<-------------------->| ENROLLING |
| (offline) |     enrolled          +-----------+
+-----+-----+
      | Connect()
      v
+-----------+     connection lost   +-------------+
|CONNECTING |<-------------------->| DISCONNECTED|
+-----+-----+     reconnecting      +------+------+
      | connected                          | reconnect
      v                                    | timeout
+-----------+                              v
|   READY   |<---------------------- (back to CONNECTING)
| (online)  |
+-----+-----+
      | command received
      v
+-----------+
| EXECUTING |  Running a command
+-----+-----+
      | command complete
      v
+-----------+
|   READY   |
| (online)  |
+-----+-----+
      | drain notice / shutdown signal
      v
+-----------+
| DRAINING  |  Finishing current work, refusing new commands
+-----+-----+
      | drained
      v
+-----------+
|  STOPPED  |
+-----------+

State Definitions

StateDescriptionAllowed Operations
STOPPED Agent not running Start
STARTING Initializing None (internal)
ENROLLING Requesting enrollment Cancel
READY Idle, waiting for work Connect, ExecuteCommand, Drain, Stop
CONNECTING Establishing connection Cancel, Stop
DISCONNECTED Connection lost, will retry Reconnect, Stop
EXECUTING Running a command (command-specific), Cancel
DRAINING Graceful shutdown in progress None (wait for completion)

State Timeouts

States have timeouts to prevent the agent from being stuck:

StateDefault TimeoutOn Timeout
STARTING 30s STOPPED (fatal)
ENROLLING 5min READY (retry later)
CONNECTING 30s DISCONNECTED
EXECUTING per-command READY (report timeout)
DRAINING 60s STOPPED (force kill)

Command Timeout Classes

Commands specify a timeout class, not a fixed duration:

ClassDurationExamples
instant 5s ping, status
quick 15s set_log_level, config
medium 30s exec, manifest apply
human none manifest approval (cancellable)
system none reboot, shutdown (fire-and-forget)

Valid Transitions

var validTransitions = map[State][]State{
    STOPPED:      {STARTING},
    STARTING:     {READY, STOPPED},           // success or fatal error
    ENROLLING:    {READY, STOPPED},           // enrolled or rejected
    READY:        {CONNECTING, ENROLLING, EXECUTING, DRAINING, STOPPED},
    CONNECTING:   {READY, DISCONNECTED, STOPPED},
    DISCONNECTED: {CONNECTING, STOPPED},
    EXECUTING:    {READY, DRAINING, STOPPED}, // complete, drain, or error
    DRAINING:     {STOPPED},
}

Connection Status vs State

These are separate concepts:

ConceptTracked ByValuesPurpose
State Agent (authoritative) READY, EXECUTING, DRAINING, etc. What agent is doing
Connection Status Collector ONLINE, OFFLINE, UNRESPONSIVE Can collector reach agent?

The collector derives connection status from:

  • ONLINE = received heartbeat within timeout
  • OFFLINE = no connection
  • UNRESPONSIVE = connected but no recent heartbeat

State Reporting via Heartbeat

Every heartbeat includes current state and timeout info:

message Heartbeat {
    string state = 1;              // READY, EXECUTING, etc.
    string state_detail = 2;       // e.g., "executing: info.inventory.hardware"
    int64 state_since = 3;         // Unix timestamp of last state change
    int64 state_timeout_at = 4;    // When current state will timeout (0 = no timeout)
    string manifest_id = 5;
    repeated string capabilities = 6;
    int64 timestamp = 7;
}

Command Execution Rules

Commands can only be accepted in specific states:

Execution Flow:

1. Receive command from collector
2. Validate state == READY
3. Transition to EXECUTING
4. Execute command (may take time)
5. On completion: transition to READY, send response
6. On error: transition to READY, send error response
7. On drain during execution: finish command, then drain

State Persistence

State is persisted to BoltDB to survive restarts:

type PersistedState struct {
    State           State     // Current state
    LastTransition  time.Time // When state changed
    PendingCommand  *Command  // If EXECUTING, what command
    DrainReason     string    // If DRAINING, why
}

On restart:

  1. Load persisted state
  2. If was EXECUTING → check if command was atomic, resume or mark failed
  3. If was DRAINING → continue draining
  4. Otherwise → transition to READY

Example Scenario

Timeline:
---------------------------------------------------------------------------

T0: Agent in READY state (online)
    +-- Admin A: "Run inventory scan" -> ACCEPTED
    +-- Agent -> EXECUTING state

T1: Executing... (30% complete)
    +-- Admin B: "Restart agent" -> REJECTED
    |   +-- Reason: "Agent is executing a command"
    +-- Collector shows: "Cannot restart during execution"

T2: Executing... (100% complete)
    +-- Agent -> READY state

T3: Agent in READY state
    +-- Admin B: "Restart agent" -> ACCEPTED

---------------------------------------------------------------------------
Result: No interrupted commands. Clear feedback to Admin B.