diff --git a/CHANGELOG.md b/CHANGELOG.md index eda63d0..e60e179 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -5,7 +5,12 @@ All notable changes to KubeSolo OS are documented in this file. Format based on [Keep a Changelog](https://keepachangelog.com/en/1.1.0/), versioning follows [Semantic Versioning](https://semver.org/spec/v2.0.0.html). -## [0.3.0-dev] - unreleased +## [0.3.0] - 2026-05-14 + +The main themes: generic ARM64 (not just Raspberry Pi), an honest update +lifecycle with state file + metrics, OCI multi-arch distribution via ghcr.io, +and policy gates (channels, maintenance windows, version stepping-stones, +pre-flight checks, auto-rollback). ### Added @@ -30,6 +35,68 @@ versioning follows [Semantic Versioning](https://semver.org/spec/v2.0.0.html). - `docs/arm64-status.md` — Phase 3 status snapshot, known limitations, what's needed to ship. - `docs/ci-runners.md` — Gitea Actions runner setup (Odroid arm64-linux). +- Update agent state machine and observability (`update/pkg/state`): + - Persistent on-disk `state.json` at `/var/lib/kubesolo/update/state.json` + (atomic write via tmp + rename). Records Phase (Idle / Checking / + Downloading / Staged / Activated / Verifying / Success / RolledBack / + Failed), FromVersion, ToVersion, StartedAt, UpdatedAt, LastError, + AttemptCount, HealthCheckFailures. + - `apply`, `activate`, `healthcheck`, `rollback` all transition state + explicitly on entry / exit / failure. Errors land in LastError so + `status` can show why. + - `kubesolo-update status --json` emits the full state for + orchestration tooling. Human-readable mode adds an "Update Lifecycle" + section when not idle. + - New Prometheus metrics: `kubesolo_update_phase{phase="..."}` (all 9 + phase labels always emitted), `kubesolo_update_attempts_total`, + `kubesolo_update_last_attempt_timestamp_seconds`. +- Channels, maintenance windows, version policy (`update/pkg/config`): + - `/etc/kubesolo/update.conf` (key=value, comments, missing-OK) configures + server, channel, maintenance_window, pubkey, healthcheck_url, + auto_rollback_after. + - `cloud-init` top-level `updates:` block writes `update.conf` on first + boot. Empty block leaves any existing file alone. + - `apply` enforces four gates before download: maintenance window, + channel match, runtime architecture match, min_compatible_version + stepping-stone. All gate failures land in the state machine as Failed + with a clear LastError. `--force` bypasses window + node-block-label. + - `UpdateMetadata` JSON gains `channel`, `min_compatible_version`, + `architecture` (all optional, omitempty). +- OCI registry distribution (`update/pkg/oci`, ~280 LOC, 9 tests): + - `kubesolo-update apply --registry ghcr.io//kubesolo-os --tag stable` + pulls update artifacts from any OCI-compliant registry. Multi-arch + indexes resolve to the runtime.GOARCH-matching manifest automatically. + - Custom media types: `application/vnd.kubesolo.os.kernel.v1+octet-stream` + and `application/vnd.kubesolo.os.initramfs.v1+gzip`. Annotations: + `io.kubesolo.os.{version,channel,architecture,min_compatible_version, + release_notes,release_date}`. + - End-to-end digest verification from manifest to blobs via oras-go/v2. + - `build/scripts/push-oci-artifact.sh` publishes per-arch artifacts via + `oras`. Multi-arch index composition documented inline. + - Dependencies added (update module only): oras.land/oras-go/v2 and + transitive opencontainers/{go-digest,image-spec} + golang.org/x/sync. +- Pre-flight gates and deeper healthcheck (`update/pkg/health` extended, + `update/pkg/partition` extended): + - Free-space pre-flight on the passive partition (image + 10% headroom) + via `partition.FreeBytes` / `HasFreeSpaceFor`. + - Node-block-label pre-flight: refuses if the local K8s node carries + `updates.kubesolo.io/block=true`. Silently allowed when no kubeconfig + (air-gap). Skipped by `--force`. + - `CheckKubeSystemReady` waits until every kube-system pod has held + Running for ≥ N seconds (configurable via + `--kube-system-settle`). + - `CheckProbeURL` GETs an operator-supplied URL; 200 = pass. Configurable + via `--healthcheck-url` or `healthcheck_url=` in update.conf. + - `CheckDiskWritable` writes / fsyncs / reads / deletes a probe file + under `/var/lib/kubesolo` to catch a wedged data partition. + - `--auto-rollback-after N` (also `auto_rollback_after=` in update.conf): + after N consecutive post-activation healthcheck failures, the agent + calls `ForceRollback()` and the operator/init reboots. Reset to 0 on + a clean pass. +- `.gitea/workflows/build-arm64.yaml` — full ARM64 build on the Odroid + self-hosted runner. Triggers on push to main, tags, and workflow_dispatch. + Boot smoke test marked continue-on-error pending KVM or real-hardware + validation. ### Changed @@ -78,13 +145,23 @@ versioning follows [Semantic Versioning](https://semver.org/spec/v2.0.0.html). ### Known limitations (deferred to follow-up) -- ARM64 `kubesolo.data=LABEL=KSOLODATA` resolution doesn't work yet — - piCore's `blkid`/`findfs` crash in QEMU and our static busybox lacks the - applets. Hardcoded `/dev/vda4` as a workaround. Production fix: ship - static `blkid`/`findfs` or replace LABEL resolution with a sysfs walk. -- AppArmor profile load fails on ARM64 (apparmor_parser ABI mismatch). -- KubeSolo's image-import deadline can fire under QEMU TCG (software - emulation). On real hardware (or with KVM) the import finishes in seconds. +- **ARM64 LABEL= resolution** doesn't work yet — piCore's `blkid`/`findfs` + crash in QEMU and our static busybox lacks the applets. Hardcoded + `/dev/vda4` as a workaround in `build/grub/grub-arm64.cfg`. Production + fix: ship static `blkid`/`findfs` or replace LABEL resolution with a + sysfs walk. +- **AppArmor profile load fails on ARM64** (apparmor_parser ABI mismatch). + Init reports it; boot continues without enforcement. +- **OCI signature verification** is deferred. The HTTP transport still + honours `--pubkey` for `.sig` files; the OCI transport is digest-verified + end-to-end via oras-go but does not yet consume cosign-style referrer + attestations. Targeted for v0.3.1. +- **Real-hardware validation** of the generic ARM64 image is still + pending. Builds and boots end-to-end under QEMU virt; production + certification waits on a Graviton / Ampere run. +- **QEMU TCG performance** can trigger KubeSolo's first-boot image-import + deadline. Not a defect in the OS itself; real hardware and KVM-accelerated + QEMU complete the import in seconds. ## [0.2.0] - 2026-02-12 diff --git a/README.md b/README.md index aad944b..b3df3dc 100644 --- a/README.md +++ b/README.md @@ -2,7 +2,7 @@ An immutable, bootable Linux distribution purpose-built for [KubeSolo](https://github.com/portainer/kubesolo) — Portainer's ultra-lightweight single-node Kubernetes. -> **Status:** x86_64 is stable — boots and runs K8s workloads, Portainer Edge Agent tested and connected. ARM64 generic UEFI is the active focus for v0.3.0; ARM64 Raspberry Pi support is paused pending physical hardware testing. +> **Status (v0.3.0):** x86_64 and generic ARM64 (UEFI / virtio / mainline kernel) both build and boot end-to-end. Update agent has an explicit state machine, OCI registry distribution alongside HTTP, channel + maintenance-window + version-stepping-stone gates, and auto-rollback. ARM64 Raspberry Pi support remains paused pending physical hardware. See [docs/release-notes-0.3.0.md](docs/release-notes-0.3.0.md) for the full v0.3.0 changelog. ## What is this? @@ -24,23 +24,34 @@ KubeSolo OS combines **Tiny Core Linux** (~11 MB) with **KubeSolo** (single-bina ## Quick Start +### x86_64 ISO + ```bash -# Fetch Tiny Core ISO + KubeSolo binary -make fetch - -# Build custom kernel (first time only, ~25 min, cached) -make kernel - -# Build Go binaries +make fetch # Tiny Core ISO + KubeSolo binary +make kernel # Custom kernel (first time only, ~25 min, cached) make build-cloudinit build-update-agent - -# Build bootable ISO make rootfs initramfs iso - -# Test in QEMU make dev-vm ``` +### Generic ARM64 disk image (v0.3.0+) + +For Graviton / Ampere / generic UEFI ARM64 hosts: + +```bash +make kernel-arm64 # Mainline 6.12 LTS kernel (first time only, ~30-60 min) +make rootfs-arm64 # Mainline kernel modules + KubeSolo arm64 +make disk-image-arm64 # UEFI-bootable A/B GPT image +make test-boot-arm64-disk # boot smoke test under qemu-system-aarch64 +``` + +### Raspberry Pi (work in progress) + +Build path lives at `make kernel-rpi` / `make rpi-image`; needs physical +hardware to validate the firmware + autoboot.txt path. See +[docs/arm64-architecture.md](docs/arm64-architecture.md) for the two-track +build layout. + Or build everything at once inside Docker: ```bash @@ -234,9 +245,12 @@ Metrics include: `kubesolo_os_info`, `boot_success`, `boot_counter`, `uptime_sec | 5 | CI/CD, OCI distribution, Prometheus metrics, ARM64 cross-compile | Complete | | 6 | Security hardening, AppArmor | Complete | | - | Custom kernel build for container runtime fixes | Complete (x86_64) | -| 7 | ARM64 generic (mainline kernel, UEFI, virtio) | In progress (v0.3.0) | -| 8 | Update engine v2 (state machine, OCI distribution, channels) | In progress (v0.3.0) | +| 7 | ARM64 generic (mainline kernel, UEFI, virtio) | Complete (v0.3.0, QEMU validated) | +| 8 | Update engine v2 (state machine, channels, OCI, pre-flight gates) | Complete (v0.3.0) | | - | ARM64 Raspberry Pi (custom kernel, firmware, SD card image) | Paused — needs hardware | +| - | OCI cosign signature verification | Planned for v0.3.1 | +| - | LABEL=KSOLODATA on ARM64 (replace blkid/findfs path) | Planned for v0.3.1 | +| - | Real-hardware ARM64 validation (Graviton / Ampere) | Planned for v0.3.1 | ## License diff --git a/VERSION b/VERSION index d510910..0d91a54 100644 --- a/VERSION +++ b/VERSION @@ -1 +1 @@ -0.3.0-dev +0.3.0 diff --git a/docs/release-notes-0.3.0.md b/docs/release-notes-0.3.0.md new file mode 100644 index 0000000..972ebda --- /dev/null +++ b/docs/release-notes-0.3.0.md @@ -0,0 +1,181 @@ +# KubeSolo OS v0.3.0 — Release Notes + +**Released:** 2026-05-14 + +v0.3.0 is the second feature release after v0.2.0 and the first release that +ships a generic ARM64 build alongside x86_64. The update agent grew up: it +now has an explicit on-disk lifecycle, OCI registry distribution, and a +fleet-friendly set of policy gates (channels, maintenance windows, +version-stepping-stones, pre-flight checks, auto-rollback). + +This document is the operator-facing summary. The full per-phase changelog +lives in [CHANGELOG.md](../CHANGELOG.md). + +## What's new + +### Generic ARM64 build + +The image you build with `make disk-image-arm64` now targets any UEFI-capable +ARM64 host: AWS Graviton, Oracle Ampere, generic ARM64 servers, future SBCs +with UEFI-compatible firmware. The kernel comes from kernel.org mainline LTS +(6.12.10 by default, configurable via `MAINLINE_KERNEL_VERSION` in +`build/config/versions.env`). + +This is **distinct** from the Raspberry Pi build path. RPi keeps its +specialised kernel from `raspberrypi/linux` with bcm-defconfig + custom DTBs; +the generic ARM64 path uses mainline + arm64-defconfig + UEFI/virtio. See +[docs/arm64-architecture.md](arm64-architecture.md) for the file-by-file +split. + +KubeSolo bumped to **v1.1.5** (was v1.1.0). New flags surfaced via cloud-init: +- `kubesolo.full` — disable edge-optimised k8s overrides +- `kubesolo.disable-ipv6` — disable IPv6 cluster-wide +- `kubesolo.db-wal-repair` — recover from unclean shutdowns + +### Update lifecycle is now observable + +The update agent writes a `state.json` at `/var/lib/kubesolo/update/state.json` +recording where the current attempt is in the lifecycle: + +``` +idle → checking → downloading → staged → activated → verifying → success + ↘ rolled_back + ↘ failed +``` + +`kubesolo-update status --json` emits the full state for orchestration tooling. +The Prometheus metrics endpoint gains three new series: + +- `kubesolo_update_phase{phase="..."}` — 1 for current phase, 0 for others (all 9 always emitted) +- `kubesolo_update_attempts_total` +- `kubesolo_update_last_attempt_timestamp_seconds` + +### OCI registry distribution + +Update artifacts can now be pulled from any OCI-compliant registry alongside +the existing HTTP `latest.json` protocol: + +```bash +# HTTP, unchanged from v0.2: +kubesolo-update apply --server https://updates.example.com + +# New: OCI from ghcr.io (or quay.io, harbor, zot, ...) +kubesolo-update apply --registry ghcr.io/yourorg/kubesolo-os --tag stable +``` + +Multi-arch is handled transparently — the same `stable` tag points at a +manifest index, the agent picks the manifest matching its `runtime.GOARCH`. + +Publish your own artifacts with `build/scripts/push-oci-artifact.sh`. See +the script's header comment for the full publishing flow. + +### Policy gates + +`apply` now enforces five gates before destroying the passive slot: + +1. **Maintenance window** (configurable, e.g. `03:00-05:00`; wrapping + midnight supported) +2. **Node-block-label** — refuses if the K8s node carries + `updates.kubesolo.io/block=true` (workload-author kill switch) +3. **Channel** — `stable` / `beta` / `edge` must match between the artifact + metadata and the local channel +4. **Architecture** — refuses cross-arch artifacts via `runtime.GOARCH` check +5. **Min compatible version** — stepping-stone enforcement; refuses an + upgrade that bypasses a required intermediate version + +`--force` bypasses the maintenance window and node-block label (channel / +arch / min-version are non-negotiable). Failures are recorded in `state.json` +with a clear `LastError` field. + +### Healthcheck deepening + auto-rollback + +`kubesolo-update healthcheck` grew three optional probes: + +- **Kube-system pods** must hold Running for ≥ N seconds before passing +- **Operator probe URL** — GET an operator-supplied endpoint; 200 = pass +- **Disk smoke test** — write/fsync/read/delete a probe file under + `/var/lib/kubesolo` to catch a wedged data partition + +Plus auto-rollback: with `--auto-rollback-after N` (or `auto_rollback_after=` +in `update.conf`), after N consecutive post-activation failures, the agent +calls `ForceRollback()` and the operator/init is expected to reboot. The +counter resets on a clean pass. + +### Persistent configuration via `/etc/kubesolo/update.conf` + +Cloud-init writes this file on first boot from a new `updates:` block; you +can also hand-edit it. Recognised keys: + +``` +server = https://updates.example.com # or omit if using registry +registry = # OCI registry ref (alt to server) +channel = stable +maintenance_window = 03:00-05:00 +pubkey = /etc/kubesolo/update-pubkey.hex +healthcheck_url = http://localhost:8000/ready +auto_rollback_after = 3 +``` + +Cloud-init full reference at +[cloud-init/examples/full-config.yaml](../cloud-init/examples/full-config.yaml). + +## Migration from v0.2.x + +This is a non-breaking release for live systems. v0.2.x → v0.3.0 changes: + +- **`state.json` will appear** at `/var/lib/kubesolo/update/state.json` the + first time a v0.3 agent runs `apply`. Pre-existing v0.2 deployments without + this file are fine — the agent treats a missing file as fresh Idle state. +- **`update.conf` is optional**. v0.2 deployments that pass everything via + CLI flags keep working unchanged. +- **HTTP `latest.json` protocol unchanged**. Existing update servers don't + need a rebuild. +- **GRUB env (boot counter, active slot)** unchanged. The bootloader's + rollback behaviour is the same. +- **No new mandatory kernel command-line parameters**. + +To opt into the new lifecycle, transports, and gates, drop in an +`update.conf` (or update cloud-init) and switch to `--registry` if you want +OCI distribution. + +## Known limitations + +These shipped intentionally with v0.3.0 and are explicitly tracked for +v0.3.1+: + +- **OCI signature verification** — the OCI transport is digest-verified + end-to-end via oras-go, but does not yet consume cosign-style referrer + attestations. The HTTP transport still honours `--pubkey` for `.sig` + files. +- **ARM64 LABEL=KSOLODATA** resolution doesn't work yet — piCore's + `blkid`/`findfs` crash on QEMU virt under our mainline kernel; the + static `busybox-static` we ship doesn't include those applets. + `build/grub/grub-arm64.cfg` hardcodes `kubesolo.data=/dev/vda4` as a + workaround. On real ARM64 hardware the device path may differ. +- **Real-hardware ARM64 validation** is pending. The image builds and + boots end-to-end under QEMU virt; production certification waits on a + Graviton / Ampere run. +- **AppArmor profile load fails on ARM64** (`apparmor_parser` ABI mismatch). + Init reports the failure; boot continues without AppArmor enforcement. +- **QEMU TCG performance** can trigger KubeSolo's first-boot image-import + deadline. Not an OS defect; real hardware and KVM-accelerated QEMU + complete the import in seconds. + +## How to upgrade your build host + +```bash +git pull +make distclean # optional — drops the build cache; full rebuild takes ~30 min +make iso # or disk-image, or disk-image-arm64 +``` + +The Docker-based builder (`make docker-build`) regenerates its own image +from `build/Dockerfile.builder` on next invocation; oras 1.2.3 and +busybox-static are now included. + +## Acknowledgements + +v0.3.0 work was driven by a single multi-week pair-programming session +working through Phases 0–9 of the v0.3 roadmap. The Odroid self-hosted +Gitea Actions runner (`odroid.local`, arm64-linux) carried every ARM64 +build during development.