Files
kubesolo-os/docs/update-flow.md
Adolfo Delorenzo 8d25e1890e feat: add A/B partition updates with GRUB and Go update agent (Phase 3)
Implement atomic OS updates via A/B partition scheme with automatic
rollback. GRUB bootloader manages slot selection with a 3-attempt
boot counter that auto-rolls back on repeated health check failures.

GRUB boot config:
- A/B slot selection with boot_counter/boot_success env vars
- Automatic rollback when counter reaches 0 (3 failed boots)
- Debug, emergency shell, and manual slot-switch menu entries

Disk image (refactored):
- 4-partition GPT layout: EFI + System A + System B + Data
- GRUB EFI/BIOS installation with graceful fallbacks
- Both system partitions populated during image creation

Update agent (Go, zero external deps):
- pkg/grubenv: read/write GRUB env vars (grub-editenv + manual fallback)
- pkg/partition: find/mount/write system partitions by label
- pkg/image: HTTP download with SHA256 verification
- pkg/health: post-boot checks (containerd, API server, node Ready)
- 6 CLI commands: check, apply, activate, rollback, healthcheck, status
- 37 unit tests across all 4 packages

Deployment:
- K8s CronJob for automatic update checks (every 6 hours)
- ConfigMap for update server URL
- Health check Job for post-boot verification

Build pipeline:
- build-update-agent.sh compiles static Linux binary (~5.9 MB)
- inject-kubesolo.sh includes update agent in initramfs
- Makefile: build-update-agent, test-update-agent, test-update targets

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-11 11:12:46 -06:00

8.7 KiB

KubeSolo OS — Atomic Update Flow

This document describes the A/B partition update mechanism used by KubeSolo OS for safe, atomic OS updates with automatic rollback.

Partition Layout

KubeSolo OS uses a 4-partition GPT layout:

Disk (minimum 4 GB):
  Part 1: EFI/Boot    (256 MB, FAT32, label: KSOLOEFI)   — GRUB + boot config
  Part 2: System A    (512 MB, ext4,  label: KSOLOA)     — vmlinuz + kubesolo-os.gz
  Part 3: System B    (512 MB, ext4,  label: KSOLOB)     — vmlinuz + kubesolo-os.gz
  Part 4: Data        (remaining, ext4, label: KSOLODATA) — persistent K8s state

Only one system partition is active at a time. The other is the "passive" slot used for staging updates.

GRUB Environment Variables

The A/B boot logic is controlled by three GRUB environment variables stored in /boot/grub/grubenv:

Variable Values Description
active_slot A or B Which system partition to boot
boot_counter 30 Attempts remaining before rollback
boot_success 0 or 1 Whether the current boot has been verified healthy

Boot Flow

                    ┌──────────────┐
                    │ GRUB starts  │
                    └──────┬───────┘
                           │
                    ┌──────▼───────┐
                    │ Load grubenv │
                    └──────┬───────┘
                           │
                 ┌─────────▼─────────┐
                 │ boot_success == 1? │
                 └────┬──────────┬───┘
                   yes│          │no
                      │    ┌─────▼──────────┐
                      │    │ boot_counter=0? │
                      │    └──┬──────────┬──┘
                      │    no │          │ yes
                      │       │    ┌─────▼──────────┐
                      │       │    │ SWAP active_slot│
                      │       │    │ Reset counter=3 │
                      │       │    └─────┬───────────┘
                      │       │          │
                 ┌────▼───────▼──────────▼────┐
                 │ Set boot_success=0         │
                 │ Decrement boot_counter     │
                 │ Boot active_slot partition │
                 └────────────┬───────────────┘
                              │
                    ┌─────────▼─────────┐
                    │  System boots...  │
                    └─────────┬─────────┘
                              │
                    ┌─────────▼─────────────┐
                    │ Health check runs     │
                    │ (containerd, API,     │
                    │  node Ready)          │
                    └─────┬──────────┬──────┘
                       pass│          │fail
                    ┌─────▼─────┐     │
                    │ Mark boot │     │ boot_success stays 0
                    │ success=1 │     │ counter decremented
                    │ counter=3 │     │ on next reboot
                    └───────────┘     └──────────────────────

Rollback Behavior

The boot counter starts at 3 and decrements on each boot where boot_success remains 0:

  1. Boot 1: counter 3 → 2 (health check fails → reboot)
  2. Boot 2: counter 2 → 1 (health check fails → reboot)
  3. Boot 3: counter 1 → 0 (health check fails → reboot)
  4. Boot 4: counter = 0, GRUB swaps active_slot and resets counter to 3

This provides 3 chances for the new version to pass health checks before automatic rollback to the previous version.

Update Agent Commands

The kubesolo-update binary provides 6 subcommands:

check — Check for Updates

Queries the update server and compares against the current running version.

kubesolo-update check --server https://updates.example.com

Output:

Current version: 1.0.0 (slot A)
Latest version:  1.1.0
Status: update available

apply — Download and Write Update

Downloads the new OS image (vmlinuz + initramfs) from the update server, verifies SHA256 checksums, and writes to the passive partition.

kubesolo-update apply --server https://updates.example.com

This does NOT activate the new partition or trigger a reboot.

activate — Set Next Boot Target

Switches the GRUB boot target to the passive partition (the one with the new image) and sets boot_counter=3.

kubesolo-update activate

After activation, reboot to boot into the new version:

reboot

rollback — Force Rollback

Manually switches to the other partition, regardless of health check status.

kubesolo-update rollback
reboot

healthcheck — Post-Boot Health Verification

Runs after every boot to verify the system is healthy. If all checks pass, marks boot_success=1 in GRUB to prevent rollback.

Checks performed:

  1. containerd: Socket exists and ctr version responds
  2. API server: TCP connection to 127.0.0.1:6443 and /healthz endpoint
  3. Node Ready: kubectl get nodes shows Ready status
kubesolo-update healthcheck --timeout 120

status — Show A/B Slot Status

Displays the current partition state:

kubesolo-update status

Output:

KubeSolo OS — A/B Partition Status
───────────────────────────────────
  Active slot:   A
  Passive slot:  B
  Boot counter:  3
  Boot success:  1

  ✓ System is healthy (boot confirmed)

Update Server Protocol

The update server is a simple HTTP(S) file server that serves:

/latest.json            — Update metadata
/vmlinuz-<version>      — Linux kernel
/kubesolo-os-<version>.gz — Initramfs

latest.json Format

{
  "version": "1.1.0",
  "vmlinuz_url": "https://updates.example.com/vmlinuz-1.1.0",
  "vmlinuz_sha256": "abc123...",
  "initramfs_url": "https://updates.example.com/kubesolo-os-1.1.0.gz",
  "initramfs_sha256": "def456...",
  "release_notes": "Bug fixes and performance improvements",
  "release_date": "2025-01-15"
}

Any static file server (nginx, S3, GitHub Releases) can serve as an update server.

Automated Updates via CronJob

KubeSolo OS includes a Kubernetes CronJob for automatic update checking:

# Deploy the update CronJob
kubectl apply -f /usr/lib/kubesolo-os/update-cronjob.yaml

# Configure the update server URL
kubectl -n kube-system create configmap kubesolo-update-config \
  --from-literal=server-url=https://updates.example.com

# Manually trigger an update check
kubectl create job --from=cronjob/kubesolo-update kubesolo-update-manual -n kube-system

The CronJob runs every 6 hours and performs apply (download + write). It does NOT reboot — the administrator controls when to reboot.

Complete Update Cycle

A full update cycle looks like:

# 1. Check if update is available
kubesolo-update check --server https://updates.example.com

# 2. Download and write to passive partition
kubesolo-update apply --server https://updates.example.com

# 3. Activate the new partition
kubesolo-update activate

# 4. Reboot into the new version
reboot

# 5. (Automatic) Health check runs, marks boot successful
# kubesolo-update healthcheck is run by init system

# 6. Verify status
kubesolo-update status

If the health check fails 3 times, GRUB automatically rolls back to the previous version on the next reboot.

Command-Line Options

All subcommands accept these options:

Option Default Description
--server URL (none) Update server URL
--grubenv PATH /boot/grub/grubenv Path to GRUB environment file
--timeout SECS 120 Health check timeout in seconds

File Locations

File Description
/usr/lib/kubesolo-os/kubesolo-update Update agent binary
/boot/grub/grubenv GRUB environment (on EFI partition)
/boot/grub/grub.cfg GRUB boot config with A/B logic
<system-partition>/vmlinuz Linux kernel
<system-partition>/kubesolo-os.gz Initramfs
<system-partition>/version Version string