Implement atomic OS updates via A/B partition scheme with automatic rollback. GRUB bootloader manages slot selection with a 3-attempt boot counter that auto-rolls back on repeated health check failures. GRUB boot config: - A/B slot selection with boot_counter/boot_success env vars - Automatic rollback when counter reaches 0 (3 failed boots) - Debug, emergency shell, and manual slot-switch menu entries Disk image (refactored): - 4-partition GPT layout: EFI + System A + System B + Data - GRUB EFI/BIOS installation with graceful fallbacks - Both system partitions populated during image creation Update agent (Go, zero external deps): - pkg/grubenv: read/write GRUB env vars (grub-editenv + manual fallback) - pkg/partition: find/mount/write system partitions by label - pkg/image: HTTP download with SHA256 verification - pkg/health: post-boot checks (containerd, API server, node Ready) - 6 CLI commands: check, apply, activate, rollback, healthcheck, status - 37 unit tests across all 4 packages Deployment: - K8s CronJob for automatic update checks (every 6 hours) - ConfigMap for update server URL - Health check Job for post-boot verification Build pipeline: - build-update-agent.sh compiles static Linux binary (~5.9 MB) - inject-kubesolo.sh includes update agent in initramfs - Makefile: build-update-agent, test-update-agent, test-update targets Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
8.7 KiB
KubeSolo OS — Atomic Update Flow
This document describes the A/B partition update mechanism used by KubeSolo OS for safe, atomic OS updates with automatic rollback.
Partition Layout
KubeSolo OS uses a 4-partition GPT layout:
Disk (minimum 4 GB):
Part 1: EFI/Boot (256 MB, FAT32, label: KSOLOEFI) — GRUB + boot config
Part 2: System A (512 MB, ext4, label: KSOLOA) — vmlinuz + kubesolo-os.gz
Part 3: System B (512 MB, ext4, label: KSOLOB) — vmlinuz + kubesolo-os.gz
Part 4: Data (remaining, ext4, label: KSOLODATA) — persistent K8s state
Only one system partition is active at a time. The other is the "passive" slot used for staging updates.
GRUB Environment Variables
The A/B boot logic is controlled by three GRUB environment variables stored in /boot/grub/grubenv:
| Variable | Values | Description |
|---|---|---|
active_slot |
A or B |
Which system partition to boot |
boot_counter |
3 → 0 |
Attempts remaining before rollback |
boot_success |
0 or 1 |
Whether the current boot has been verified healthy |
Boot Flow
┌──────────────┐
│ GRUB starts │
└──────┬───────┘
│
┌──────▼───────┐
│ Load grubenv │
└──────┬───────┘
│
┌─────────▼─────────┐
│ boot_success == 1? │
└────┬──────────┬───┘
yes│ │no
│ ┌─────▼──────────┐
│ │ boot_counter=0? │
│ └──┬──────────┬──┘
│ no │ │ yes
│ │ ┌─────▼──────────┐
│ │ │ SWAP active_slot│
│ │ │ Reset counter=3 │
│ │ └─────┬───────────┘
│ │ │
┌────▼───────▼──────────▼────┐
│ Set boot_success=0 │
│ Decrement boot_counter │
│ Boot active_slot partition │
└────────────┬───────────────┘
│
┌─────────▼─────────┐
│ System boots... │
└─────────┬─────────┘
│
┌─────────▼─────────────┐
│ Health check runs │
│ (containerd, API, │
│ node Ready) │
└─────┬──────────┬──────┘
pass│ │fail
┌─────▼─────┐ │
│ Mark boot │ │ boot_success stays 0
│ success=1 │ │ counter decremented
│ counter=3 │ │ on next reboot
└───────────┘ └──────────────────────
Rollback Behavior
The boot counter starts at 3 and decrements on each boot where boot_success remains 0:
- Boot 1: counter 3 → 2 (health check fails → reboot)
- Boot 2: counter 2 → 1 (health check fails → reboot)
- Boot 3: counter 1 → 0 (health check fails → reboot)
- Boot 4: counter = 0, GRUB swaps
active_slotand resets counter to 3
This provides 3 chances for the new version to pass health checks before automatic rollback to the previous version.
Update Agent Commands
The kubesolo-update binary provides 6 subcommands:
check — Check for Updates
Queries the update server and compares against the current running version.
kubesolo-update check --server https://updates.example.com
Output:
Current version: 1.0.0 (slot A)
Latest version: 1.1.0
Status: update available
apply — Download and Write Update
Downloads the new OS image (vmlinuz + initramfs) from the update server, verifies SHA256 checksums, and writes to the passive partition.
kubesolo-update apply --server https://updates.example.com
This does NOT activate the new partition or trigger a reboot.
activate — Set Next Boot Target
Switches the GRUB boot target to the passive partition (the one with the new image) and sets boot_counter=3.
kubesolo-update activate
After activation, reboot to boot into the new version:
reboot
rollback — Force Rollback
Manually switches to the other partition, regardless of health check status.
kubesolo-update rollback
reboot
healthcheck — Post-Boot Health Verification
Runs after every boot to verify the system is healthy. If all checks pass, marks boot_success=1 in GRUB to prevent rollback.
Checks performed:
- containerd: Socket exists and
ctr versionresponds - API server: TCP connection to 127.0.0.1:6443 and
/healthzendpoint - Node Ready:
kubectl get nodesshows Ready status
kubesolo-update healthcheck --timeout 120
status — Show A/B Slot Status
Displays the current partition state:
kubesolo-update status
Output:
KubeSolo OS — A/B Partition Status
───────────────────────────────────
Active slot: A
Passive slot: B
Boot counter: 3
Boot success: 1
✓ System is healthy (boot confirmed)
Update Server Protocol
The update server is a simple HTTP(S) file server that serves:
/latest.json — Update metadata
/vmlinuz-<version> — Linux kernel
/kubesolo-os-<version>.gz — Initramfs
latest.json Format
{
"version": "1.1.0",
"vmlinuz_url": "https://updates.example.com/vmlinuz-1.1.0",
"vmlinuz_sha256": "abc123...",
"initramfs_url": "https://updates.example.com/kubesolo-os-1.1.0.gz",
"initramfs_sha256": "def456...",
"release_notes": "Bug fixes and performance improvements",
"release_date": "2025-01-15"
}
Any static file server (nginx, S3, GitHub Releases) can serve as an update server.
Automated Updates via CronJob
KubeSolo OS includes a Kubernetes CronJob for automatic update checking:
# Deploy the update CronJob
kubectl apply -f /usr/lib/kubesolo-os/update-cronjob.yaml
# Configure the update server URL
kubectl -n kube-system create configmap kubesolo-update-config \
--from-literal=server-url=https://updates.example.com
# Manually trigger an update check
kubectl create job --from=cronjob/kubesolo-update kubesolo-update-manual -n kube-system
The CronJob runs every 6 hours and performs apply (download + write). It does NOT reboot — the administrator controls when to reboot.
Complete Update Cycle
A full update cycle looks like:
# 1. Check if update is available
kubesolo-update check --server https://updates.example.com
# 2. Download and write to passive partition
kubesolo-update apply --server https://updates.example.com
# 3. Activate the new partition
kubesolo-update activate
# 4. Reboot into the new version
reboot
# 5. (Automatic) Health check runs, marks boot successful
# kubesolo-update healthcheck is run by init system
# 6. Verify status
kubesolo-update status
If the health check fails 3 times, GRUB automatically rolls back to the previous version on the next reboot.
Command-Line Options
All subcommands accept these options:
| Option | Default | Description |
|---|---|---|
--server URL |
(none) | Update server URL |
--grubenv PATH |
/boot/grub/grubenv |
Path to GRUB environment file |
--timeout SECS |
120 |
Health check timeout in seconds |
File Locations
| File | Description |
|---|---|
/usr/lib/kubesolo-os/kubesolo-update |
Update agent binary |
/boot/grub/grubenv |
GRUB environment (on EFI partition) |
/boot/grub/grub.cfg |
GRUB boot config with A/B logic |
<system-partition>/vmlinuz |
Linux kernel |
<system-partition>/kubesolo-os.gz |
Initramfs |
<system-partition>/version |
Version string |