Implement atomic OS updates via A/B partition scheme with automatic rollback. GRUB bootloader manages slot selection with a 3-attempt boot counter that auto-rolls back on repeated health check failures. GRUB boot config: - A/B slot selection with boot_counter/boot_success env vars - Automatic rollback when counter reaches 0 (3 failed boots) - Debug, emergency shell, and manual slot-switch menu entries Disk image (refactored): - 4-partition GPT layout: EFI + System A + System B + Data - GRUB EFI/BIOS installation with graceful fallbacks - Both system partitions populated during image creation Update agent (Go, zero external deps): - pkg/grubenv: read/write GRUB env vars (grub-editenv + manual fallback) - pkg/partition: find/mount/write system partitions by label - pkg/image: HTTP download with SHA256 verification - pkg/health: post-boot checks (containerd, API server, node Ready) - 6 CLI commands: check, apply, activate, rollback, healthcheck, status - 37 unit tests across all 4 packages Deployment: - K8s CronJob for automatic update checks (every 6 hours) - ConfigMap for update server URL - Health check Job for post-boot verification Build pipeline: - build-update-agent.sh compiles static Linux binary (~5.9 MB) - inject-kubesolo.sh includes update agent in initramfs - Makefile: build-update-agent, test-update-agent, test-update targets Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
262 lines
8.7 KiB
Markdown
262 lines
8.7 KiB
Markdown
# KubeSolo OS — Atomic Update Flow
|
|
|
|
This document describes the A/B partition update mechanism used by KubeSolo OS for safe, atomic OS updates with automatic rollback.
|
|
|
|
## Partition Layout
|
|
|
|
KubeSolo OS uses a 4-partition GPT layout:
|
|
|
|
```
|
|
Disk (minimum 4 GB):
|
|
Part 1: EFI/Boot (256 MB, FAT32, label: KSOLOEFI) — GRUB + boot config
|
|
Part 2: System A (512 MB, ext4, label: KSOLOA) — vmlinuz + kubesolo-os.gz
|
|
Part 3: System B (512 MB, ext4, label: KSOLOB) — vmlinuz + kubesolo-os.gz
|
|
Part 4: Data (remaining, ext4, label: KSOLODATA) — persistent K8s state
|
|
```
|
|
|
|
Only one system partition is active at a time. The other is the "passive" slot used for staging updates.
|
|
|
|
## GRUB Environment Variables
|
|
|
|
The A/B boot logic is controlled by three GRUB environment variables stored in `/boot/grub/grubenv`:
|
|
|
|
| Variable | Values | Description |
|
|
|---|---|---|
|
|
| `active_slot` | `A` or `B` | Which system partition to boot |
|
|
| `boot_counter` | `3` → `0` | Attempts remaining before rollback |
|
|
| `boot_success` | `0` or `1` | Whether the current boot has been verified healthy |
|
|
|
|
## Boot Flow
|
|
|
|
```
|
|
┌──────────────┐
|
|
│ GRUB starts │
|
|
└──────┬───────┘
|
|
│
|
|
┌──────▼───────┐
|
|
│ Load grubenv │
|
|
└──────┬───────┘
|
|
│
|
|
┌─────────▼─────────┐
|
|
│ boot_success == 1? │
|
|
└────┬──────────┬───┘
|
|
yes│ │no
|
|
│ ┌─────▼──────────┐
|
|
│ │ boot_counter=0? │
|
|
│ └──┬──────────┬──┘
|
|
│ no │ │ yes
|
|
│ │ ┌─────▼──────────┐
|
|
│ │ │ SWAP active_slot│
|
|
│ │ │ Reset counter=3 │
|
|
│ │ └─────┬───────────┘
|
|
│ │ │
|
|
┌────▼───────▼──────────▼────┐
|
|
│ Set boot_success=0 │
|
|
│ Decrement boot_counter │
|
|
│ Boot active_slot partition │
|
|
└────────────┬───────────────┘
|
|
│
|
|
┌─────────▼─────────┐
|
|
│ System boots... │
|
|
└─────────┬─────────┘
|
|
│
|
|
┌─────────▼─────────────┐
|
|
│ Health check runs │
|
|
│ (containerd, API, │
|
|
│ node Ready) │
|
|
└─────┬──────────┬──────┘
|
|
pass│ │fail
|
|
┌─────▼─────┐ │
|
|
│ Mark boot │ │ boot_success stays 0
|
|
│ success=1 │ │ counter decremented
|
|
│ counter=3 │ │ on next reboot
|
|
└───────────┘ └──────────────────────
|
|
```
|
|
|
|
### Rollback Behavior
|
|
|
|
The boot counter starts at 3 and decrements on each boot where `boot_success` remains 0:
|
|
|
|
1. **Boot 1**: counter 3 → 2 (health check fails → reboot)
|
|
2. **Boot 2**: counter 2 → 1 (health check fails → reboot)
|
|
3. **Boot 3**: counter 1 → 0 (health check fails → reboot)
|
|
4. **Boot 4**: counter = 0, GRUB swaps `active_slot` and resets counter to 3
|
|
|
|
This provides **3 chances** for the new version to pass health checks before automatic rollback to the previous version.
|
|
|
|
## Update Agent Commands
|
|
|
|
The `kubesolo-update` binary provides 6 subcommands:
|
|
|
|
### `check` — Check for Updates
|
|
|
|
Queries the update server and compares against the current running version.
|
|
|
|
```bash
|
|
kubesolo-update check --server https://updates.example.com
|
|
```
|
|
|
|
Output:
|
|
```
|
|
Current version: 1.0.0 (slot A)
|
|
Latest version: 1.1.0
|
|
Status: update available
|
|
```
|
|
|
|
### `apply` — Download and Write Update
|
|
|
|
Downloads the new OS image (vmlinuz + initramfs) from the update server, verifies SHA256 checksums, and writes to the passive partition.
|
|
|
|
```bash
|
|
kubesolo-update apply --server https://updates.example.com
|
|
```
|
|
|
|
This does NOT activate the new partition or trigger a reboot.
|
|
|
|
### `activate` — Set Next Boot Target
|
|
|
|
Switches the GRUB boot target to the passive partition (the one with the new image) and sets `boot_counter=3`.
|
|
|
|
```bash
|
|
kubesolo-update activate
|
|
```
|
|
|
|
After activation, reboot to boot into the new version:
|
|
```bash
|
|
reboot
|
|
```
|
|
|
|
### `rollback` — Force Rollback
|
|
|
|
Manually switches to the other partition, regardless of health check status.
|
|
|
|
```bash
|
|
kubesolo-update rollback
|
|
reboot
|
|
```
|
|
|
|
### `healthcheck` — Post-Boot Health Verification
|
|
|
|
Runs after every boot to verify the system is healthy. If all checks pass, marks `boot_success=1` in GRUB to prevent rollback.
|
|
|
|
Checks performed:
|
|
1. **containerd**: Socket exists and `ctr version` responds
|
|
2. **API server**: TCP connection to 127.0.0.1:6443 and `/healthz` endpoint
|
|
3. **Node Ready**: `kubectl get nodes` shows Ready status
|
|
|
|
```bash
|
|
kubesolo-update healthcheck --timeout 120
|
|
```
|
|
|
|
### `status` — Show A/B Slot Status
|
|
|
|
Displays the current partition state:
|
|
|
|
```bash
|
|
kubesolo-update status
|
|
```
|
|
|
|
Output:
|
|
```
|
|
KubeSolo OS — A/B Partition Status
|
|
───────────────────────────────────
|
|
Active slot: A
|
|
Passive slot: B
|
|
Boot counter: 3
|
|
Boot success: 1
|
|
|
|
✓ System is healthy (boot confirmed)
|
|
```
|
|
|
|
## Update Server Protocol
|
|
|
|
The update server is a simple HTTP(S) file server that serves:
|
|
|
|
```
|
|
/latest.json — Update metadata
|
|
/vmlinuz-<version> — Linux kernel
|
|
/kubesolo-os-<version>.gz — Initramfs
|
|
```
|
|
|
|
### `latest.json` Format
|
|
|
|
```json
|
|
{
|
|
"version": "1.1.0",
|
|
"vmlinuz_url": "https://updates.example.com/vmlinuz-1.1.0",
|
|
"vmlinuz_sha256": "abc123...",
|
|
"initramfs_url": "https://updates.example.com/kubesolo-os-1.1.0.gz",
|
|
"initramfs_sha256": "def456...",
|
|
"release_notes": "Bug fixes and performance improvements",
|
|
"release_date": "2025-01-15"
|
|
}
|
|
|
|
```
|
|
|
|
Any static file server (nginx, S3, GitHub Releases) can serve as an update server.
|
|
|
|
## Automated Updates via CronJob
|
|
|
|
KubeSolo OS includes a Kubernetes CronJob for automatic update checking:
|
|
|
|
```bash
|
|
# Deploy the update CronJob
|
|
kubectl apply -f /usr/lib/kubesolo-os/update-cronjob.yaml
|
|
|
|
# Configure the update server URL
|
|
kubectl -n kube-system create configmap kubesolo-update-config \
|
|
--from-literal=server-url=https://updates.example.com
|
|
|
|
# Manually trigger an update check
|
|
kubectl create job --from=cronjob/kubesolo-update kubesolo-update-manual -n kube-system
|
|
```
|
|
|
|
The CronJob runs every 6 hours and performs `apply` (download + write). It does NOT reboot — the administrator controls when to reboot.
|
|
|
|
## Complete Update Cycle
|
|
|
|
A full update cycle looks like:
|
|
|
|
```bash
|
|
# 1. Check if update is available
|
|
kubesolo-update check --server https://updates.example.com
|
|
|
|
# 2. Download and write to passive partition
|
|
kubesolo-update apply --server https://updates.example.com
|
|
|
|
# 3. Activate the new partition
|
|
kubesolo-update activate
|
|
|
|
# 4. Reboot into the new version
|
|
reboot
|
|
|
|
# 5. (Automatic) Health check runs, marks boot successful
|
|
# kubesolo-update healthcheck is run by init system
|
|
|
|
# 6. Verify status
|
|
kubesolo-update status
|
|
```
|
|
|
|
If the health check fails 3 times, GRUB automatically rolls back to the previous version on the next reboot.
|
|
|
|
## Command-Line Options
|
|
|
|
All subcommands accept these options:
|
|
|
|
| Option | Default | Description |
|
|
|---|---|---|
|
|
| `--server URL` | (none) | Update server URL |
|
|
| `--grubenv PATH` | `/boot/grub/grubenv` | Path to GRUB environment file |
|
|
| `--timeout SECS` | `120` | Health check timeout in seconds |
|
|
|
|
## File Locations
|
|
|
|
| File | Description |
|
|
|---|---|
|
|
| `/usr/lib/kubesolo-os/kubesolo-update` | Update agent binary |
|
|
| `/boot/grub/grubenv` | GRUB environment (on EFI partition) |
|
|
| `/boot/grub/grub.cfg` | GRUB boot config with A/B logic |
|
|
| `<system-partition>/vmlinuz` | Linux kernel |
|
|
| `<system-partition>/kubesolo-os.gz` | Initramfs |
|
|
| `<system-partition>/version` | Version string |
|