Automation Patterns

Scaling, rolling upgrades, and troubleshooting workflows with MachineOperations.

This page shows common automation patterns using MachineOperations with kubectl and shell scripts.

Targeting a Single Machine

Most operations target a single machine with spec.machineRef:

apiVersion: unbounded-cloud.io/v1alpha3
kind: MachineOperation
metadata:
  name: reboot-worker-01
spec:
  machineRef: worker-01
  operationKind: HostReboot
  ttlSecondsAfterFinished: 3600

Targeting Multiple Machines with machineSelector

Agent-handled operations (NodeReboot, AgentUpgrade, AgentReset) support spec.machineSelector to target machines by label. Each agent independently checks whether its machine matches the selector and executes the operation if it does. Metalman-managed bare-metal host operations also support selectors when the selector is scoped to one metalman site with unbounded-cloud.io/site=<site>.

apiVersion: unbounded-cloud.io/v1alpha3
kind: MachineOperation
metadata:
  name: reboot-gpu-nodes
spec:
  machineSelector:
    matchLabels:
      role: gpu-worker
  operationKind: NodeReboot

kubectl apply -f reboot-gpu-nodes.yaml
kubectl get mop reboot-gpu-nodes -w

Note: Cloud VM host operations still require one operation per machine. Bare-metal host selectors are handled only by the metalman instance for the selected site.

Batch Host Operations

For cloud VM host operations, create one MachineOperation per machine:

# Reboot all machines in a list
for machine in worker-01 worker-02 worker-03; do
  cat <<EOF | kubectl apply -f -
apiVersion: unbounded-cloud.io/v1alpha3
kind: MachineOperation
metadata:
  name: reboot-${machine}
spec:
  machineRef: ${machine}
  operationKind: HostReboot
  ttlSecondsAfterFinished: 3600
EOF
done

# Watch all operations
kubectl get mop -w

Scaling In and Out

Scale In (Power Off)

Power off machines during off-hours or when demand is low:

for machine in worker-01 worker-02 worker-03; do
  cat <<EOF | kubectl apply -f -
apiVersion: unbounded-cloud.io/v1alpha3
kind: MachineOperation
metadata:
  name: poweroff-${machine}
spec:
  machineRef: ${machine}
  operationKind: HostPowerOff
  ttlSecondsAfterFinished: 3600
EOF
done

# Verify machines are powered off
kubectl get machines worker-01 worker-02 worker-03

Scale Out (Power On)

Bring machines back online when capacity is needed:

for machine in worker-01 worker-02 worker-03; do
  cat <<EOF | kubectl apply -f -
apiVersion: unbounded-cloud.io/v1alpha3
kind: MachineOperation
metadata:
  name: poweron-${machine}
spec:
  machineRef: ${machine}
  operationKind: HostPowerOn
  ttlSecondsAfterFinished: 3600
EOF
done

# Watch nodes rejoin
kubectl get nodes -w

Rolling Upgrades

Agent Upgrade

Upgrade agents across a fleet sequentially, waiting for each to complete before proceeding:

for machine in worker-01 worker-02 worker-03; do
  cat <<EOF | kubectl apply -f -
apiVersion: unbounded-cloud.io/v1alpha3
kind: MachineOperation
metadata:
  name: upgrade-agent-${machine}
spec:
  machineRef: ${machine}
  operationKind: AgentUpgrade
  parameters:
    downloadURL: https://example.com/releases/unbounded-agent-v1.2.0-linux-amd64.tar.gz
  ttlSecondsAfterFinished: 7200
EOF

  echo "Waiting for ${machine}..."
  kubectl wait mop upgrade-agent-${machine} \
    --for=jsonpath='{.status.phase}'=Complete \
    --timeout=120s
done

If any upgrade fails, the agent automatically rolls back to the last-known-good binary. Check failed operations before continuing:

kubectl get mop --field-selector status.phase=Failed

Alternatively, use machineSelector to upgrade all matching agents at once:

apiVersion: unbounded-cloud.io/v1alpha3
kind: MachineOperation
metadata:
  name: upgrade-all-gpu-agents
spec:
  machineSelector:
    matchLabels:
      role: gpu-worker
  operationKind: AgentUpgrade
  parameters:
    downloadURL: https://example.com/releases/unbounded-agent-v1.2.0-linux-amd64.tar.gz

Node Upgrade via Recreation

To roll out a new Kubernetes version or rootfs configuration, cordon, drain, and delete Node objects one at a time and let the agent reconcile:

for node in worker-01 worker-02 worker-03; do
  echo "Recreating ${node}..."
  kubectl cordon ${node}
  kubectl drain ${node} --ignore-daemonsets --delete-emptydir-data
  kubectl delete node ${node}

  # Wait for the node to rejoin
  kubectl wait node ${node} --for=condition=Ready --timeout=300s
done

Drain must complete before deletion so kubelet, containerd, and the configured CNI can tear down workload pod state. If the cluster uses an eBPF CNI such as Cilium and requires destructive dataplane cleanup, run the CNI-specific cleanup procedure before deleting the Node.

Host Replacement

For full host reimaging, use HostReplace operations sequentially:

for machine in worker-01 worker-02 worker-03; do
  cat <<EOF | kubectl apply -f -
apiVersion: unbounded-cloud.io/v1alpha3
kind: MachineOperation
metadata:
  name: replace-${machine}
spec:
  machineRef: ${machine}
  operationKind: HostReplace
  ttlSecondsAfterFinished: 7200
EOF

  echo "Waiting for ${machine} replacement..."
  kubectl wait mop replace-${machine} \
    --for=jsonpath='{.status.phase}'=Complete \
    --timeout=600s

  # Wait for the node to rejoin after replacement
  kubectl wait node ${machine} --for=condition=Ready --timeout=300s
done

Troubleshooting Escalation

When a node is misbehaving, escalate through progressively more disruptive operations:

Step	Operation	When to Use
1	`NodeReboot`	Kubelet or containerd is stuck. Restarts the nspawn container.
2	`HostReboot`	Node reboot did not help. Reboots the entire host.
3	Delete the Node	Host is running but the node needs a fresh rootfs.
4	`HostReplace`	Host OS is corrupted or needs reimaging.
5	`AgentReset`	Decommission the node entirely.

Example escalation:

# Step 1: Try a node reboot
cat <<EOF | kubectl apply -f -
apiVersion: unbounded-cloud.io/v1alpha3
kind: MachineOperation
metadata:
  name: troubleshoot-worker-01-step1
spec:
  machineRef: worker-01
  operationKind: NodeReboot
  ttlSecondsAfterFinished: 3600
EOF

kubectl wait mop troubleshoot-worker-01-step1 \
  --for=jsonpath='{.status.phase}'=Complete \
  --timeout=120s

# Check if the node is healthy now
kubectl get node worker-01

# Step 2: If still unhealthy, escalate to host reboot
cat <<EOF | kubectl apply -f -
apiVersion: unbounded-cloud.io/v1alpha3
kind: MachineOperation
metadata:
  name: troubleshoot-worker-01-step2
spec:
  machineRef: worker-01
  operationKind: HostReboot
  ttlSecondsAfterFinished: 3600
EOF

Monitoring Operations

Watch All Operations

kubectl get mop -w

Filter by Phase

# Failed operations
kubectl get mop --field-selector status.phase=Failed

# In-progress operations
kubectl get mop --field-selector status.phase=InProgress

View Machine and Operation Status Together

kubectl get machines,mop

Example output:

NAME                                    PROVIDER    PHASE     AGE
machine.unbounded-cloud.io/worker-01    AzureVM     Ready     7d
machine.unbounded-cloud.io/worker-02    AzureVM     Ready     7d
machine.unbounded-cloud.io/worker-03    AzureVM     Ready     7d

NAME                                              KIND          MACHINE      PHASE      AGE
machineoperation.unbounded-cloud.io/reboot-w01     HostReboot    worker-01    Complete   5m
machineoperation.unbounded-cloud.io/upgrade-w02    AgentUpgrade  worker-02    Complete   3m

Automatic Cleanup

Use ttlSecondsAfterFinished to prevent completed operations from accumulating:

spec:
  ttlSecondsAfterFinished: 3600   # Clean up after 1 hour

For long-running workflows, set a longer TTL to preserve audit history. For high-frequency operations, use a shorter TTL to reduce resource count.