Phase 8 of v0.3. Tightens the update lifecycle on both ends.
Pre-flight (apply.go, before any download):
- Free-space check on the passive partition: image size + 10% headroom must
be available. Uses statfs(2) via the new pkg/partition.FreeBytes /
HasFreeSpaceFor helpers (tests cover happy path, tiny request, huge
request, missing path). Catches corrupted-FS and shrunk-partition cases
before we destroy the existing slot data.
- Node-block-label check: refuses if the local K8s node carries the
updates.kubesolo.io/block=true label. New pkg/health.CheckNodeBlocked
shells out to kubectl per the project's zero-deps stance. Silently bypassed
when no kubeconfig is reachable (air-gap case). Skipped by --force.
Healthcheck (extended via new pkg/health/extended.go + preflight.go):
- CheckKubeSystemReady waits until every kube-system pod has held the Running
phase for >= N seconds (default 30). Catches "started ok, will crash-loop"
bugs that a single-shot phase check misses.
- CheckProbeURL fetches an operator-supplied URL; 200 = pass. Wired through
update.conf as healthcheck_url= and cloud-init updates.healthcheck_url.
- CheckDiskWritable writes/fsyncs/reads a 1-KiB probe under /var/lib/kubesolo.
Always runs in healthcheck so a wedged data partition fails fast.
- pkg/health.Status grows KubeSystemReady, ProbeURL, DiskWritable booleans.
Optional checks default to true in RunAll() so they don't block when
unconfigured. health_test.go updated to the new 6-field shape.
Auto-rollback (healthcheck.go):
- state.UpdateState gains HealthCheckFailures (consecutive post-Activated
failures). Reset on a clean pass.
- --auto-rollback-after N (also auto_rollback_after= in update.conf) triggers
env.ForceRollback() when the failure count reaches the threshold. State
transitions to RolledBack with a descriptive LastError. The command still
exits with the healthcheck error; the operator/init is expected to reboot.
- Only fires while Phase == Activated. Doesn't second-guess a long-stable
system that happens to fail one healthcheck.
config / opts / cloud-init plumbing:
- update.conf gains healthcheck_url= and auto_rollback_after= keys.
- New CLI flags: --healthcheck-url, --auto-rollback-after, --kube-system-settle.
- cloud-init full-config.yaml documents the new updates: subfields.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Implement atomic OS updates via A/B partition scheme with automatic
rollback. GRUB bootloader manages slot selection with a 3-attempt
boot counter that auto-rolls back on repeated health check failures.
GRUB boot config:
- A/B slot selection with boot_counter/boot_success env vars
- Automatic rollback when counter reaches 0 (3 failed boots)
- Debug, emergency shell, and manual slot-switch menu entries
Disk image (refactored):
- 4-partition GPT layout: EFI + System A + System B + Data
- GRUB EFI/BIOS installation with graceful fallbacks
- Both system partitions populated during image creation
Update agent (Go, zero external deps):
- pkg/grubenv: read/write GRUB env vars (grub-editenv + manual fallback)
- pkg/partition: find/mount/write system partitions by label
- pkg/image: HTTP download with SHA256 verification
- pkg/health: post-boot checks (containerd, API server, node Ready)
- 6 CLI commands: check, apply, activate, rollback, healthcheck, status
- 37 unit tests across all 4 packages
Deployment:
- K8s CronJob for automatic update checks (every 6 hours)
- ConfigMap for update server URL
- Health check Job for post-boot verification
Build pipeline:
- build-update-agent.sh compiles static Linux binary (~5.9 MB)
- inject-kubesolo.sh includes update agent in initramfs
- Makefile: build-update-agent, test-update-agent, test-update targets
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>