Files
kubesolo-os/CHANGELOG.md
Adolfo Delorenzo 53268a1564
All checks were successful
CI / Go Tests (push) Successful in 1m53s
CI / Shellcheck (push) Successful in 1m1s
CI / Build Go Binaries (amd64, linux, linux-amd64) (push) Successful in 1m28s
CI / Build Go Binaries (arm64, linux, linux-arm64) (push) Successful in 1m23s
docs: roll README + CHANGELOG forward past v0.3.1
README:
- Status line bumped from v0.3.0 to v0.3.1 with the actually-validated
  framing (K8s Ready under QEMU virt+HVF, CoreDNS + local-path +
  nginx all Running) and a link to CHANGELOG.md for full notes.
- Roadmap: Phase 7 (generic ARM64) flipped to "Complete (v0.3.1, K8s
  Ready under QEMU virt+HVF)". OCI cosign verification, LABEL=KSOLODATA
  on ARM64, and real-hardware ARM64 validation move from "Planned for
  v0.3.1" to "Planned for v0.3.2" — they didn't make this release.

CHANGELOG:
- New "[Unreleased]" section covering the four post-v0.3.1 CI / repo
  housekeeping commits: drop tag trigger on build-arm64.yaml (04a5cd2),
  gitignore .env/credentials (48267e1), fix gated x86 job staying
  "queued" instead of "skipped" (fb24e64), and paths-ignore on
  build-arm64.yaml so workflow/docs-only commits skip the 60-minute
  kernel rebuild (e1b8a69).

No runtime changes.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-15 22:46:12 -06:00

20 KiB
Raw Blame History

Changelog

All notable changes to KubeSolo OS are documented in this file.

Format based on Keep a Changelog, versioning follows Semantic Versioning.

[Unreleased]

Pure CI / repository housekeeping; no runtime changes since v0.3.1. All items below shake out workflow-loop bugs exposed by the v0.3.1 release flow on Gitea Actions.

Fixed (CI)

  • build-arm64.yaml no longer triggers on tag pushes. release.yaml already produces the ARM64 disk image as part of the release flow, so triggering both on the same tag wasted an hour of Odroid runner time on a duplicate kernel build. (04a5cd2)
  • The gated build-iso-amd64 job in release.yaml (if: false until an amd64-linux runner exists) used to advertise runs-on: amd64-linux. With no matching runner, Gitea left the job queued forever and the overall workflow run never transitioned to success — even though every load-bearing job had finished and the release was already published. Now uses runs-on: ubuntu-latest so any runner picks the job up just long enough to evaluate if: false and mark it skipped. (fb24e64)
  • build-arm64.yaml now ignores workflow-file, docs, and *.md changes via paths-ignore (.gitea/workflows/**, .github/workflows/**, docs/**, top-level *.md, .gitignore). Workflow- / docs-only commits no longer kick off a 60-minute kernel rebuild on the Odroid. Any change to a kernel fragment, init script, or build script still triggers the full build, as intended. (e1b8a69)

Changed

  • .gitignore now excludes .env, .env.*, *.token, *.pat to keep Gitea PATs and other credentials used during release ops from being accidentally committed. (48267e1)

[0.3.1] - 2026-05-15

First fully-functional generic ARM64 release. v0.3.0 shipped the build scaffold; v0.3.1 makes it actually boot a Kubernetes cluster end-to-end on QEMU virt under HVF acceleration. Validated by deploying CoreDNS, local-path-provisioner, and an nginx:alpine workload — all reach Running, kubectl get nodes reports Ready.

Fixed

  • Dual-glibc loading on ARM64 — piCore64's /lib/libc.so.6 and the build host's /lib/$LIB_ARCH/libc.so.6 could both be resolved into the same process by the dynamic linker, triggering *** stack smashing detected *** aborts when stack frames crossed between functions linked against different libcs. Fix: bundle the full glibc family (libc + libpthread + libdl + libm + libresolv + librt + libanl + libgcc_s + ld.so), delete piCore's duplicates in /lib/, and write /etc/ld.so.conf + ldconfig -r so the runtime linker has a deterministic search order. (76ed2ff)
  • nft binary not bundled — KubeSolo v1.1.4+ runs nft add table ip kubesolo-masq for pod-masquerade setup, but inject-kubesolo.sh only bundled xtables-nft-multi. Without standalone nft in $PATH, KubeSolo FATAL'd at startup. Fix: copy /usr/sbin/nft + its non-shared libs (libnftables, libedit, libjansson, libgmp, libtinfo, libbsd, libmd) into the rootfs. (51c1f78)
  • nftables address-family handlersnf_tables core was loaded but no address families were registered, so nft add table ip ... returned EOPNOTSUPP. The bool Kconfigs CONFIG_NF_TABLES_IPV4, CONFIG_NF_TABLES_IPV6, CONFIG_NF_TABLES_INET, CONFIG_NF_TABLES_NETDEV are required and weren't in the fragment. Fix: add to kernel-container.fragment as =y. (7e46f8f)
  • kube-proxy nftables-backend expression modules — Kubernetes 1.34's kube-proxy nft backend uses numgen, hash, limit, log expressions. The corresponding kernel modules (CONFIG_NFT_NUMGEN, etc.) were missing from the fragment AND the runtime module list, so even after a kernel rebuild stage 30 didn't load them and stage 85's kernel.modules_disabled=1 lockdown prevented on-demand loads. Fix: add to both kernel-container.fragment (as =m) and modules.list / modules-arm64.list. (31eee77, 3bcf2e1)
  • modules.list inline-comment parser bug — the inject script's comment-strip only matched lines starting with #, not lines with inline # comment tails. So nft_numgen # foo was passed verbatim to modprobe, resolved to nothing, and the .ko never made it into the initramfs. Fix: parse with mod="${mod%%#*}" to strip inline tails. (bc3300e)
  • Banner only printed on kubeconfig success90-kubesolo.sh gated the host-access banner behind if [ -f $KUBECONFIG_PATH ]. When KubeSolo crashed early (bug #2 above) or the wait loop timed out, the user never saw the connection instructions. Fix: write the banner to /etc/motd AND print it unconditionally after the wait loop. (51c1f78)
  • dev-vm-arm64.sh missing port-8080 hostfwd — the in-VM HTTP server that serves the kubeconfig listens on port 8080, but the QEMU -net user line only forwarded 6443 and 2222, so curl http://localhost:8080 from the host machine connected to nothing. Fix: add the third hostfwd. (fbe2d0b)

Fixed (CI)

  • release.yaml workflow rewritten so v0.3.1+ tag pushes auto-publish a complete release page on Gitea: actions/upload-artifact pinned to @v3 for act_runner compatibility, the softprops/action-gh-release@v2 step replaced with a direct curl against /api/v1/repos/.../releases (softprops hard-codes api.github.com so it silently no-ops on Gitea), added a build-disk-arm64 job that builds on the arm64-linux runner. v0.3.0's manual-upload-only release was the canary that exposed all three bugs. (f8c308d)

Known issues carried forward to v0.3.2

These don't block normal operation but are tracked:

  • xt_comment userspace extension load fails on the iptables-nft path, causing kubelet's KUBE-FIREWALL rule install to skip. Reported as Couldn't load match 'comment' in the boot log. kubelet continues without the localhost-drop rule.
  • containerd-shim-runc-v2 -info probe reports runc: executable file not found in $PATH. Cosmetic — containerd uses the absolute path from its config when actually launching containers.
  • kube-proxy conntrack cleanup logs Failed to list conntrack entries: invalid argument every cleanup cycle. Probably needs CONFIG_NF_CONNTRACK_PROCFS or netlink-glue tweaks.
  • Several pods restart 12 times on first boot due to a PLEG / runtime-probe race in the kubelet startup path. Pods stabilise.

[0.3.0] - 2026-05-14

The main themes: generic ARM64 (not just Raspberry Pi), an honest update lifecycle with state file + metrics, OCI multi-arch distribution via ghcr.io, and policy gates (channels, maintenance windows, version stepping-stones, pre-flight checks, auto-rollback).

Added

  • Generic ARM64 build track distinct from Raspberry Pi:
    • make kernel-arm64 builds a mainline kernel.org LTS kernel (6.12.10 by default) from arm64 defconfig + shared kernel-container.fragment + arm64 virt-host enables (VIRTIO_*, EFI_STUB, NVMe).
    • make disk-image-arm64 produces a UEFI-bootable raw GPT image with A/B system partitions and GRUB-EFI ARM64. Targets QEMU virt, Graviton, Ampere, or any UEFI ARM64 host.
    • hack/dev-vm-arm64.sh --disk boots the built image through QEMU UEFI for end-to-end testing.
    • test/qemu/test-boot-arm64-disk.sh automated boot smoke test.
  • Bumped KubeSolo to v1.1.5 (was v1.1.0). New cloud-init flags surfaced:
    • kubesolo.full (v1.1.4+) — disable edge-optimised overrides
    • kubesolo.disable-ipv6 (v1.1.5+)
    • kubesolo.db-wal-repair (v1.1.5+) — recover from unclean shutdowns
  • Per-arch supply-chain verification: KUBESOLO_SHA256_AMD64 and KUBESOLO_SHA256_ARM64 in versions.env, applied to the tarball before extract.
  • docs/arm64-architecture.md — defines the generic-vs-RPi two-track layout.
  • docs/arm64-status.md — Phase 3 status snapshot, known limitations, what's needed to ship.
  • docs/ci-runners.md — Gitea Actions runner setup (Odroid arm64-linux).
  • Update agent state machine and observability (update/pkg/state):
    • Persistent on-disk state.json at /var/lib/kubesolo/update/state.json (atomic write via tmp + rename). Records Phase (Idle / Checking / Downloading / Staged / Activated / Verifying / Success / RolledBack / Failed), FromVersion, ToVersion, StartedAt, UpdatedAt, LastError, AttemptCount, HealthCheckFailures.
    • apply, activate, healthcheck, rollback all transition state explicitly on entry / exit / failure. Errors land in LastError so status can show why.
    • kubesolo-update status --json emits the full state for orchestration tooling. Human-readable mode adds an "Update Lifecycle" section when not idle.
    • New Prometheus metrics: kubesolo_update_phase{phase="..."} (all 9 phase labels always emitted), kubesolo_update_attempts_total, kubesolo_update_last_attempt_timestamp_seconds.
  • Channels, maintenance windows, version policy (update/pkg/config):
    • /etc/kubesolo/update.conf (key=value, comments, missing-OK) configures server, channel, maintenance_window, pubkey, healthcheck_url, auto_rollback_after.
    • cloud-init top-level updates: block writes update.conf on first boot. Empty block leaves any existing file alone.
    • apply enforces four gates before download: maintenance window, channel match, runtime architecture match, min_compatible_version stepping-stone. All gate failures land in the state machine as Failed with a clear LastError. --force bypasses window + node-block-label.
    • UpdateMetadata JSON gains channel, min_compatible_version, architecture (all optional, omitempty).
  • OCI registry distribution (update/pkg/oci, ~280 LOC, 9 tests):
    • kubesolo-update apply --registry ghcr.io/<org>/kubesolo-os --tag stable pulls update artifacts from any OCI-compliant registry. Multi-arch indexes resolve to the runtime.GOARCH-matching manifest automatically.
    • Custom media types: application/vnd.kubesolo.os.kernel.v1+octet-stream and application/vnd.kubesolo.os.initramfs.v1+gzip. Annotations: io.kubesolo.os.{version,channel,architecture,min_compatible_version, release_notes,release_date}.
    • End-to-end digest verification from manifest to blobs via oras-go/v2.
    • build/scripts/push-oci-artifact.sh publishes per-arch artifacts via oras. Multi-arch index composition documented inline.
    • Dependencies added (update module only): oras.land/oras-go/v2 and transitive opencontainers/{go-digest,image-spec} + golang.org/x/sync.
  • Pre-flight gates and deeper healthcheck (update/pkg/health extended, update/pkg/partition extended):
    • Free-space pre-flight on the passive partition (image + 10% headroom) via partition.FreeBytes / HasFreeSpaceFor.
    • Node-block-label pre-flight: refuses if the local K8s node carries updates.kubesolo.io/block=true. Silently allowed when no kubeconfig (air-gap). Skipped by --force.
    • CheckKubeSystemReady waits until every kube-system pod has held Running for ≥ N seconds (configurable via --kube-system-settle).
    • CheckProbeURL GETs an operator-supplied URL; 200 = pass. Configurable via --healthcheck-url or healthcheck_url= in update.conf.
    • CheckDiskWritable writes / fsyncs / reads / deletes a probe file under /var/lib/kubesolo to catch a wedged data partition.
    • --auto-rollback-after N (also auto_rollback_after= in update.conf): after N consecutive post-activation healthcheck failures, the agent calls ForceRollback() and the operator/init reboots. Reset to 0 on a clean pass.
  • .gitea/workflows/build-arm64.yaml — full ARM64 build on the Odroid self-hosted runner. Triggers on push to main, tags, and workflow_dispatch. Boot smoke test marked continue-on-error pending KVM or real-hardware validation.

Changed

  • build/scripts/build-kernel-arm64.sh is now the generic ARM64 kernel build (mainline kernel.org LTS, generic UEFI/virtio).
  • Renamed build/scripts/build-kernel-rpi.sh (was build-kernel-arm64.sh). RPi kernel build (raspberrypi/linux fork, bcm2711_defconfig) lives here now.
  • Renamed build/config/kernel-container.fragment (was rpi-kernel-config.fragment). Misnomer: contents are arch-agnostic and now shared across x86, ARM64-generic, and RPi kernels.
  • build/scripts/build-kernel.sh (x86) refactored to consume the shared fragment via a generic apply_fragment function. ~50 lines of duplication killed.
  • KUBESOLO_VERSION moved out of fetch-components.sh defaults into versions.env. Bumping is now a one-line PR.

Fixed

  • Native ARM64 build hosts (e.g. an Odroid runner) no longer require the x86 cross-compiler. Both build-kernel-arm64.sh and build-kernel-rpi.sh detect uname -m and use the host's gcc directly when arch matches.
  • ARM64 grub.cfg console ordering: ttyAMA0 is now the primary console (console=ttyS0,... console=ttyAMA0,...). Init output is now visible on QEMU virt and most ARM64 SBCs without further configuration.
  • ARM64 boot: replaced piCore64's /init with our staged init at /init and /sbin/init. Previously the kernel ran piCore's TCE handler which segfaulted in our environment.
  • ARM64 boot: replaced piCore64's broken dynamic BusyBox with the build host's busybox-static. piCore's binary triggered EL0 instruction-abort panics on QEMU virt under both -cpu cortex-a72 and -cpu max.
  • POSIX-character-class portability: tr -d '[:space:]' in 30-kernel-modules.sh and 40-sysctl.sh replaced with explicit ' \t\r\n'. Ubuntu's busybox-static 1.30.1 doesn't parse [:space:] and instead deletes the literal characters [ : s p a c e ], which truncated module names (virtio_netvirtio_nt, etc.) and sysctl keys.
  • inject-kubesolo.sh no longer copies init/lib/functions.sh into init.d/. Previously the main init loop tried to run it as a stage after stage 90 and panicked with "Init completed without exec'ing KubeSolo".
  • ARM64 disk image: TARGET_ARCH=arm64 create-disk-image.sh produces BOOTAA64.EFI via grub-mkimage -O arm64-efi (not bootx64.efi). Skips the BIOS-only grub-install --target=i386-pc step.
  • build/Dockerfile.builder: added grub-efi-amd64-bin, grub-efi-arm64-bin, grub-pc-bin, grub-common, grub2-common, and busybox-static so the Docker-based build flow can produce ARM64 disk images and gets the same BusyBox swap behaviour as native builds.

Known limitations (deferred to follow-up)

  • ARM64 LABEL= resolution doesn't work yet — piCore's blkid/findfs crash in QEMU and our static busybox lacks the applets. Hardcoded /dev/vda4 as a workaround in build/grub/grub-arm64.cfg. Production fix: ship static blkid/findfs or replace LABEL resolution with a sysfs walk.
  • AppArmor profile load fails on ARM64 (apparmor_parser ABI mismatch). Init reports it; boot continues without enforcement.
  • OCI signature verification is deferred. The HTTP transport still honours --pubkey for .sig files; the OCI transport is digest-verified end-to-end via oras-go but does not yet consume cosign-style referrer attestations. Targeted for v0.3.1.
  • Real-hardware validation of the generic ARM64 image is still pending. Builds and boots end-to-end under QEMU virt; production certification waits on a Graviton / Ampere run.
  • QEMU TCG performance can trigger KubeSolo's first-boot image-import deadline. Not a defect in the OS itself; real hardware and KVM-accelerated QEMU complete the import in seconds.

[0.2.0] - 2026-02-12

Added

  • Cloud-init: support all documented KubeSolo CLI flags (--local-storage-shared-path, --debug, --pprof-server, --portainer-edge-id, --portainer-edge-key, --portainer-edge-async)
  • Cloud-init: full-config.yaml example showing all supported parameters
  • Cloud-init: KubeSolo configuration reference table in docs/cloud-init.md
  • Security hardening: mount hardening, sysctl, kernel module lock, AppArmor profiles
  • ARM64 Raspberry Pi support with A/B boot via tryboot
  • BootEnv abstraction for GRUB and RPi boot environments
  • Go 1.25.5 installed on host for native builds

[0.1.0] - 2026-02-12

First release with all 5 design-doc phases complete. ISO boots and runs K8s pods.

Added

Custom Kernel

  • Custom kernel build (6.18.2-tinycore64) with container-critical configs
  • Added CONFIG_CGROUP_BPF, CONFIG_DEVTMPFS, CONFIG_DEVTMPFS_MOUNT, CONFIG_MEMCG, CONFIG_CFS_BANDWIDTH
  • Stripped unnecessary subsystems (sound, GPU, wireless, Bluetooth, etc.)
  • Selective kernel module install — only modules.list + transitive deps in initramfs

Init System (Phase 1)

  • POSIX sh init system with staged boot (00-early-mount through 90-kubesolo)
  • switch_root from initramfs to SquashFS root
  • Persistent data partition mount with bind-mounts for K8s state
  • Kernel module loading, sysctl tuning, network, hostname, NTP
  • Emergency shell fallback on boot failure
  • Device node creation via mknod fallback from sysfs

Cloud-Init (Phase 2)

  • Go-based cloud-init parser (~2.7 MB static binary)
  • Network configuration: DHCP and static IP modes
  • Hostname and machine-id generation
  • KubeSolo configuration (node-name, extra flags)
  • Portainer Edge Agent integration via K8s manifest injection
  • Persistent config saved to /mnt/data/ for next-boot fast path
  • 22 Go tests

A/B Atomic Updates (Phase 3)

  • 4-partition GPT disk image: EFI + System A + System B + Data
  • GRUB 2 bootloader with A/B slot selection and boot counter rollback
  • Go update agent (~6.0 MB static binary) with check, apply, activate, rollback commands
  • Health check: containerd + K8s API + node Ready verification
  • Update server protocol: HTTP serving latest.json + image files
  • K8s CronJob for automated update checks (every 6 hours)
  • Zero external Go dependencies — uses kubectl/ctr exec commands

Production Hardening (Phase 4)

  • Ed25519 image signing with pure Go stdlib (zero external deps)
  • Key generation, signing, and verification CLI commands
  • Portainer Edge Agent deployment via cloud-init
  • SSH extension injection for debugging (hack/inject-ssh.sh)
  • Boot time and resource usage benchmarks
  • Deployment guide documentation

Distribution & Fleet Management (Phase 5)

  • Gitea Actions CI/CD (test + build + shellcheck on push, release on tags)
  • OCI container image packaging (scratch-based)
  • Prometheus metrics endpoint (zero-dependency text exposition format)
  • USB provisioning script with cloud-init injection
  • ARM64 cross-compilation support

Build System

  • Makefile with full build orchestration
  • Dockerized reproducible builds (build/Dockerfile.builder)
  • Component fetching with version pinning
  • ISO and raw disk image creation
  • Fast rebuild path (make quick)

Documentation

  • Architecture design document
  • Boot flow reference
  • A/B update flow reference
  • Cloud-init configuration reference
  • Deployment and operations guide

Fixed

  • Replaced grep -oP with POSIX-safe sed in functions.sh (BusyBox compatibility)
  • Replaced grep -qiE with grep -qi -e pattern (POSIX compliance)
  • Fixed KVM flag handling in dev-vm.sh (bash array context)
  • Added iptables table pre-initialization before kube-proxy start (nf_tables issue)
  • Added /dev/kmsg and /etc/machine-id creation for kubelet
  • Added CA certificates bundle to initramfs (containerd TLS verification for Docker Hub)
  • Added DNS fallback (10.0.2.3 + 8.8.8.8) when DHCP client doesn't populate resolv.conf
  • Added headless Service to Portainer Edge Agent manifest (agent peer discovery DNS)
  • Added kubesolo.edge_id/edge_key kernel boot parameters for Portainer Edge
  • Added auto-format of unformatted data disks on first boot
  • Rewrote dev-vm.sh for macOS: bsdtar ISO extraction, Homebrew mkfs.ext4 detection, direct kernel boot, TCG acceleration, port 8080 forwarding
  • Kubeconfig now served via HTTP on port 8080 (serial console truncates base64 lines)
  • Added 127.0.0.1 and 10.0.2.15 to API server SANs for QEMU port forwarding
  • dev-vm.sh now works on Linux: fallback ISO extraction via isoinfo or loop mount, KVM auto-detection, platform-aware error messages