feat: add A/B partition updates with GRUB and Go update agent (Phase 3)
Implement atomic OS updates via A/B partition scheme with automatic rollback. GRUB bootloader manages slot selection with a 3-attempt boot counter that auto-rolls back on repeated health check failures. GRUB boot config: - A/B slot selection with boot_counter/boot_success env vars - Automatic rollback when counter reaches 0 (3 failed boots) - Debug, emergency shell, and manual slot-switch menu entries Disk image (refactored): - 4-partition GPT layout: EFI + System A + System B + Data - GRUB EFI/BIOS installation with graceful fallbacks - Both system partitions populated during image creation Update agent (Go, zero external deps): - pkg/grubenv: read/write GRUB env vars (grub-editenv + manual fallback) - pkg/partition: find/mount/write system partitions by label - pkg/image: HTTP download with SHA256 verification - pkg/health: post-boot checks (containerd, API server, node Ready) - 6 CLI commands: check, apply, activate, rollback, healthcheck, status - 37 unit tests across all 4 packages Deployment: - K8s CronJob for automatic update checks (every 6 hours) - ConfigMap for update server URL - Health check Job for post-boot verification Build pipeline: - build-update-agent.sh compiles static Linux binary (~5.9 MB) - inject-kubesolo.sh includes update agent in initramfs - Makefile: build-update-agent, test-update-agent, test-update targets Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This commit is contained in:
261
docs/update-flow.md
Normal file
261
docs/update-flow.md
Normal file
@@ -0,0 +1,261 @@
|
||||
# KubeSolo OS — Atomic Update Flow
|
||||
|
||||
This document describes the A/B partition update mechanism used by KubeSolo OS for safe, atomic OS updates with automatic rollback.
|
||||
|
||||
## Partition Layout
|
||||
|
||||
KubeSolo OS uses a 4-partition GPT layout:
|
||||
|
||||
```
|
||||
Disk (minimum 4 GB):
|
||||
Part 1: EFI/Boot (256 MB, FAT32, label: KSOLOEFI) — GRUB + boot config
|
||||
Part 2: System A (512 MB, ext4, label: KSOLOA) — vmlinuz + kubesolo-os.gz
|
||||
Part 3: System B (512 MB, ext4, label: KSOLOB) — vmlinuz + kubesolo-os.gz
|
||||
Part 4: Data (remaining, ext4, label: KSOLODATA) — persistent K8s state
|
||||
```
|
||||
|
||||
Only one system partition is active at a time. The other is the "passive" slot used for staging updates.
|
||||
|
||||
## GRUB Environment Variables
|
||||
|
||||
The A/B boot logic is controlled by three GRUB environment variables stored in `/boot/grub/grubenv`:
|
||||
|
||||
| Variable | Values | Description |
|
||||
|---|---|---|
|
||||
| `active_slot` | `A` or `B` | Which system partition to boot |
|
||||
| `boot_counter` | `3` → `0` | Attempts remaining before rollback |
|
||||
| `boot_success` | `0` or `1` | Whether the current boot has been verified healthy |
|
||||
|
||||
## Boot Flow
|
||||
|
||||
```
|
||||
┌──────────────┐
|
||||
│ GRUB starts │
|
||||
└──────┬───────┘
|
||||
│
|
||||
┌──────▼───────┐
|
||||
│ Load grubenv │
|
||||
└──────┬───────┘
|
||||
│
|
||||
┌─────────▼─────────┐
|
||||
│ boot_success == 1? │
|
||||
└────┬──────────┬───┘
|
||||
yes│ │no
|
||||
│ ┌─────▼──────────┐
|
||||
│ │ boot_counter=0? │
|
||||
│ └──┬──────────┬──┘
|
||||
│ no │ │ yes
|
||||
│ │ ┌─────▼──────────┐
|
||||
│ │ │ SWAP active_slot│
|
||||
│ │ │ Reset counter=3 │
|
||||
│ │ └─────┬───────────┘
|
||||
│ │ │
|
||||
┌────▼───────▼──────────▼────┐
|
||||
│ Set boot_success=0 │
|
||||
│ Decrement boot_counter │
|
||||
│ Boot active_slot partition │
|
||||
└────────────┬───────────────┘
|
||||
│
|
||||
┌─────────▼─────────┐
|
||||
│ System boots... │
|
||||
└─────────┬─────────┘
|
||||
│
|
||||
┌─────────▼─────────────┐
|
||||
│ Health check runs │
|
||||
│ (containerd, API, │
|
||||
│ node Ready) │
|
||||
└─────┬──────────┬──────┘
|
||||
pass│ │fail
|
||||
┌─────▼─────┐ │
|
||||
│ Mark boot │ │ boot_success stays 0
|
||||
│ success=1 │ │ counter decremented
|
||||
│ counter=3 │ │ on next reboot
|
||||
└───────────┘ └──────────────────────
|
||||
```
|
||||
|
||||
### Rollback Behavior
|
||||
|
||||
The boot counter starts at 3 and decrements on each boot where `boot_success` remains 0:
|
||||
|
||||
1. **Boot 1**: counter 3 → 2 (health check fails → reboot)
|
||||
2. **Boot 2**: counter 2 → 1 (health check fails → reboot)
|
||||
3. **Boot 3**: counter 1 → 0 (health check fails → reboot)
|
||||
4. **Boot 4**: counter = 0, GRUB swaps `active_slot` and resets counter to 3
|
||||
|
||||
This provides **3 chances** for the new version to pass health checks before automatic rollback to the previous version.
|
||||
|
||||
## Update Agent Commands
|
||||
|
||||
The `kubesolo-update` binary provides 6 subcommands:
|
||||
|
||||
### `check` — Check for Updates
|
||||
|
||||
Queries the update server and compares against the current running version.
|
||||
|
||||
```bash
|
||||
kubesolo-update check --server https://updates.example.com
|
||||
```
|
||||
|
||||
Output:
|
||||
```
|
||||
Current version: 1.0.0 (slot A)
|
||||
Latest version: 1.1.0
|
||||
Status: update available
|
||||
```
|
||||
|
||||
### `apply` — Download and Write Update
|
||||
|
||||
Downloads the new OS image (vmlinuz + initramfs) from the update server, verifies SHA256 checksums, and writes to the passive partition.
|
||||
|
||||
```bash
|
||||
kubesolo-update apply --server https://updates.example.com
|
||||
```
|
||||
|
||||
This does NOT activate the new partition or trigger a reboot.
|
||||
|
||||
### `activate` — Set Next Boot Target
|
||||
|
||||
Switches the GRUB boot target to the passive partition (the one with the new image) and sets `boot_counter=3`.
|
||||
|
||||
```bash
|
||||
kubesolo-update activate
|
||||
```
|
||||
|
||||
After activation, reboot to boot into the new version:
|
||||
```bash
|
||||
reboot
|
||||
```
|
||||
|
||||
### `rollback` — Force Rollback
|
||||
|
||||
Manually switches to the other partition, regardless of health check status.
|
||||
|
||||
```bash
|
||||
kubesolo-update rollback
|
||||
reboot
|
||||
```
|
||||
|
||||
### `healthcheck` — Post-Boot Health Verification
|
||||
|
||||
Runs after every boot to verify the system is healthy. If all checks pass, marks `boot_success=1` in GRUB to prevent rollback.
|
||||
|
||||
Checks performed:
|
||||
1. **containerd**: Socket exists and `ctr version` responds
|
||||
2. **API server**: TCP connection to 127.0.0.1:6443 and `/healthz` endpoint
|
||||
3. **Node Ready**: `kubectl get nodes` shows Ready status
|
||||
|
||||
```bash
|
||||
kubesolo-update healthcheck --timeout 120
|
||||
```
|
||||
|
||||
### `status` — Show A/B Slot Status
|
||||
|
||||
Displays the current partition state:
|
||||
|
||||
```bash
|
||||
kubesolo-update status
|
||||
```
|
||||
|
||||
Output:
|
||||
```
|
||||
KubeSolo OS — A/B Partition Status
|
||||
───────────────────────────────────
|
||||
Active slot: A
|
||||
Passive slot: B
|
||||
Boot counter: 3
|
||||
Boot success: 1
|
||||
|
||||
✓ System is healthy (boot confirmed)
|
||||
```
|
||||
|
||||
## Update Server Protocol
|
||||
|
||||
The update server is a simple HTTP(S) file server that serves:
|
||||
|
||||
```
|
||||
/latest.json — Update metadata
|
||||
/vmlinuz-<version> — Linux kernel
|
||||
/kubesolo-os-<version>.gz — Initramfs
|
||||
```
|
||||
|
||||
### `latest.json` Format
|
||||
|
||||
```json
|
||||
{
|
||||
"version": "1.1.0",
|
||||
"vmlinuz_url": "https://updates.example.com/vmlinuz-1.1.0",
|
||||
"vmlinuz_sha256": "abc123...",
|
||||
"initramfs_url": "https://updates.example.com/kubesolo-os-1.1.0.gz",
|
||||
"initramfs_sha256": "def456...",
|
||||
"release_notes": "Bug fixes and performance improvements",
|
||||
"release_date": "2025-01-15"
|
||||
}
|
||||
|
||||
```
|
||||
|
||||
Any static file server (nginx, S3, GitHub Releases) can serve as an update server.
|
||||
|
||||
## Automated Updates via CronJob
|
||||
|
||||
KubeSolo OS includes a Kubernetes CronJob for automatic update checking:
|
||||
|
||||
```bash
|
||||
# Deploy the update CronJob
|
||||
kubectl apply -f /usr/lib/kubesolo-os/update-cronjob.yaml
|
||||
|
||||
# Configure the update server URL
|
||||
kubectl -n kube-system create configmap kubesolo-update-config \
|
||||
--from-literal=server-url=https://updates.example.com
|
||||
|
||||
# Manually trigger an update check
|
||||
kubectl create job --from=cronjob/kubesolo-update kubesolo-update-manual -n kube-system
|
||||
```
|
||||
|
||||
The CronJob runs every 6 hours and performs `apply` (download + write). It does NOT reboot — the administrator controls when to reboot.
|
||||
|
||||
## Complete Update Cycle
|
||||
|
||||
A full update cycle looks like:
|
||||
|
||||
```bash
|
||||
# 1. Check if update is available
|
||||
kubesolo-update check --server https://updates.example.com
|
||||
|
||||
# 2. Download and write to passive partition
|
||||
kubesolo-update apply --server https://updates.example.com
|
||||
|
||||
# 3. Activate the new partition
|
||||
kubesolo-update activate
|
||||
|
||||
# 4. Reboot into the new version
|
||||
reboot
|
||||
|
||||
# 5. (Automatic) Health check runs, marks boot successful
|
||||
# kubesolo-update healthcheck is run by init system
|
||||
|
||||
# 6. Verify status
|
||||
kubesolo-update status
|
||||
```
|
||||
|
||||
If the health check fails 3 times, GRUB automatically rolls back to the previous version on the next reboot.
|
||||
|
||||
## Command-Line Options
|
||||
|
||||
All subcommands accept these options:
|
||||
|
||||
| Option | Default | Description |
|
||||
|---|---|---|
|
||||
| `--server URL` | (none) | Update server URL |
|
||||
| `--grubenv PATH` | `/boot/grub/grubenv` | Path to GRUB environment file |
|
||||
| `--timeout SECS` | `120` | Health check timeout in seconds |
|
||||
|
||||
## File Locations
|
||||
|
||||
| File | Description |
|
||||
|---|---|
|
||||
| `/usr/lib/kubesolo-os/kubesolo-update` | Update agent binary |
|
||||
| `/boot/grub/grubenv` | GRUB environment (on EFI partition) |
|
||||
| `/boot/grub/grub.cfg` | GRUB boot config with A/B logic |
|
||||
| `<system-partition>/vmlinuz` | Linux kernel |
|
||||
| `<system-partition>/kubesolo-os.gz` | Initramfs |
|
||||
| `<system-partition>/version` | Version string |
|
||||
Reference in New Issue
Block a user