feat: add A/B partition updates with GRUB and Go update agent (Phase 3)

Implement atomic OS updates via A/B partition scheme with automatic
rollback. GRUB bootloader manages slot selection with a 3-attempt
boot counter that auto-rolls back on repeated health check failures.

GRUB boot config:
- A/B slot selection with boot_counter/boot_success env vars
- Automatic rollback when counter reaches 0 (3 failed boots)
- Debug, emergency shell, and manual slot-switch menu entries

Disk image (refactored):
- 4-partition GPT layout: EFI + System A + System B + Data
- GRUB EFI/BIOS installation with graceful fallbacks
- Both system partitions populated during image creation

Update agent (Go, zero external deps):
- pkg/grubenv: read/write GRUB env vars (grub-editenv + manual fallback)
- pkg/partition: find/mount/write system partitions by label
- pkg/image: HTTP download with SHA256 verification
- pkg/health: post-boot checks (containerd, API server, node Ready)
- 6 CLI commands: check, apply, activate, rollback, healthcheck, status
- 37 unit tests across all 4 packages

Deployment:
- K8s CronJob for automatic update checks (every 6 hours)
- ConfigMap for update server URL
- Health check Job for post-boot verification

Build pipeline:
- build-update-agent.sh compiles static Linux binary (~5.9 MB)
- inject-kubesolo.sh includes update agent in initramfs
- Makefile: build-update-agent, test-update-agent, test-update targets

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This commit is contained in:
2026-02-11 11:12:46 -06:00
parent d900fa920e
commit 8d25e1890e
25 changed files with 2807 additions and 74 deletions

261
docs/update-flow.md Normal file
View File

@@ -0,0 +1,261 @@
# KubeSolo OS — Atomic Update Flow
This document describes the A/B partition update mechanism used by KubeSolo OS for safe, atomic OS updates with automatic rollback.
## Partition Layout
KubeSolo OS uses a 4-partition GPT layout:
```
Disk (minimum 4 GB):
Part 1: EFI/Boot (256 MB, FAT32, label: KSOLOEFI) — GRUB + boot config
Part 2: System A (512 MB, ext4, label: KSOLOA) — vmlinuz + kubesolo-os.gz
Part 3: System B (512 MB, ext4, label: KSOLOB) — vmlinuz + kubesolo-os.gz
Part 4: Data (remaining, ext4, label: KSOLODATA) — persistent K8s state
```
Only one system partition is active at a time. The other is the "passive" slot used for staging updates.
## GRUB Environment Variables
The A/B boot logic is controlled by three GRUB environment variables stored in `/boot/grub/grubenv`:
| Variable | Values | Description |
|---|---|---|
| `active_slot` | `A` or `B` | Which system partition to boot |
| `boot_counter` | `3``0` | Attempts remaining before rollback |
| `boot_success` | `0` or `1` | Whether the current boot has been verified healthy |
## Boot Flow
```
┌──────────────┐
│ GRUB starts │
└──────┬───────┘
┌──────▼───────┐
│ Load grubenv │
└──────┬───────┘
┌─────────▼─────────┐
│ boot_success == 1? │
└────┬──────────┬───┘
yes│ │no
│ ┌─────▼──────────┐
│ │ boot_counter=0? │
│ └──┬──────────┬──┘
│ no │ │ yes
│ │ ┌─────▼──────────┐
│ │ │ SWAP active_slot│
│ │ │ Reset counter=3 │
│ │ └─────┬───────────┘
│ │ │
┌────▼───────▼──────────▼────┐
│ Set boot_success=0 │
│ Decrement boot_counter │
│ Boot active_slot partition │
└────────────┬───────────────┘
┌─────────▼─────────┐
│ System boots... │
└─────────┬─────────┘
┌─────────▼─────────────┐
│ Health check runs │
│ (containerd, API, │
│ node Ready) │
└─────┬──────────┬──────┘
pass│ │fail
┌─────▼─────┐ │
│ Mark boot │ │ boot_success stays 0
│ success=1 │ │ counter decremented
│ counter=3 │ │ on next reboot
└───────────┘ └──────────────────────
```
### Rollback Behavior
The boot counter starts at 3 and decrements on each boot where `boot_success` remains 0:
1. **Boot 1**: counter 3 → 2 (health check fails → reboot)
2. **Boot 2**: counter 2 → 1 (health check fails → reboot)
3. **Boot 3**: counter 1 → 0 (health check fails → reboot)
4. **Boot 4**: counter = 0, GRUB swaps `active_slot` and resets counter to 3
This provides **3 chances** for the new version to pass health checks before automatic rollback to the previous version.
## Update Agent Commands
The `kubesolo-update` binary provides 6 subcommands:
### `check` — Check for Updates
Queries the update server and compares against the current running version.
```bash
kubesolo-update check --server https://updates.example.com
```
Output:
```
Current version: 1.0.0 (slot A)
Latest version: 1.1.0
Status: update available
```
### `apply` — Download and Write Update
Downloads the new OS image (vmlinuz + initramfs) from the update server, verifies SHA256 checksums, and writes to the passive partition.
```bash
kubesolo-update apply --server https://updates.example.com
```
This does NOT activate the new partition or trigger a reboot.
### `activate` — Set Next Boot Target
Switches the GRUB boot target to the passive partition (the one with the new image) and sets `boot_counter=3`.
```bash
kubesolo-update activate
```
After activation, reboot to boot into the new version:
```bash
reboot
```
### `rollback` — Force Rollback
Manually switches to the other partition, regardless of health check status.
```bash
kubesolo-update rollback
reboot
```
### `healthcheck` — Post-Boot Health Verification
Runs after every boot to verify the system is healthy. If all checks pass, marks `boot_success=1` in GRUB to prevent rollback.
Checks performed:
1. **containerd**: Socket exists and `ctr version` responds
2. **API server**: TCP connection to 127.0.0.1:6443 and `/healthz` endpoint
3. **Node Ready**: `kubectl get nodes` shows Ready status
```bash
kubesolo-update healthcheck --timeout 120
```
### `status` — Show A/B Slot Status
Displays the current partition state:
```bash
kubesolo-update status
```
Output:
```
KubeSolo OS — A/B Partition Status
───────────────────────────────────
Active slot: A
Passive slot: B
Boot counter: 3
Boot success: 1
✓ System is healthy (boot confirmed)
```
## Update Server Protocol
The update server is a simple HTTP(S) file server that serves:
```
/latest.json — Update metadata
/vmlinuz-<version> — Linux kernel
/kubesolo-os-<version>.gz — Initramfs
```
### `latest.json` Format
```json
{
"version": "1.1.0",
"vmlinuz_url": "https://updates.example.com/vmlinuz-1.1.0",
"vmlinuz_sha256": "abc123...",
"initramfs_url": "https://updates.example.com/kubesolo-os-1.1.0.gz",
"initramfs_sha256": "def456...",
"release_notes": "Bug fixes and performance improvements",
"release_date": "2025-01-15"
}
```
Any static file server (nginx, S3, GitHub Releases) can serve as an update server.
## Automated Updates via CronJob
KubeSolo OS includes a Kubernetes CronJob for automatic update checking:
```bash
# Deploy the update CronJob
kubectl apply -f /usr/lib/kubesolo-os/update-cronjob.yaml
# Configure the update server URL
kubectl -n kube-system create configmap kubesolo-update-config \
--from-literal=server-url=https://updates.example.com
# Manually trigger an update check
kubectl create job --from=cronjob/kubesolo-update kubesolo-update-manual -n kube-system
```
The CronJob runs every 6 hours and performs `apply` (download + write). It does NOT reboot — the administrator controls when to reboot.
## Complete Update Cycle
A full update cycle looks like:
```bash
# 1. Check if update is available
kubesolo-update check --server https://updates.example.com
# 2. Download and write to passive partition
kubesolo-update apply --server https://updates.example.com
# 3. Activate the new partition
kubesolo-update activate
# 4. Reboot into the new version
reboot
# 5. (Automatic) Health check runs, marks boot successful
# kubesolo-update healthcheck is run by init system
# 6. Verify status
kubesolo-update status
```
If the health check fails 3 times, GRUB automatically rolls back to the previous version on the next reboot.
## Command-Line Options
All subcommands accept these options:
| Option | Default | Description |
|---|---|---|
| `--server URL` | (none) | Update server URL |
| `--grubenv PATH` | `/boot/grub/grubenv` | Path to GRUB environment file |
| `--timeout SECS` | `120` | Health check timeout in seconds |
## File Locations
| File | Description |
|---|---|
| `/usr/lib/kubesolo-os/kubesolo-update` | Update agent binary |
| `/boot/grub/grubenv` | GRUB environment (on EFI partition) |
| `/boot/grub/grub.cfg` | GRUB boot config with A/B logic |
| `<system-partition>/vmlinuz` | Linux kernel |
| `<system-partition>/kubesolo-os.gz` | Initramfs |
| `<system-partition>/version` | Version string |