Files
kubesolo-os/CHANGELOG.md
Adolfo Delorenzo 81b29fd237
Some checks failed
ARM64 Build / Build generic ARM64 disk image (push) Failing after 3s
CI / Go Tests (push) Successful in 1m53s
CI / Shellcheck (push) Successful in 1m2s
Release / Test (push) Successful in 1m37s
CI / Build Go Binaries (amd64, linux, linux-amd64) (push) Successful in 1m33s
CI / Build Go Binaries (arm64, linux, linux-arm64) (push) Successful in 1m34s
Release / Build Binaries (linux-amd64) (push) Successful in 1m26s
Release / Build Binaries (linux-arm64) (push) Successful in 1m37s
Release / Build ARM64 disk image (push) Failing after 3s
Release / Build x86_64 ISO + disk image (push) Failing after 44s
Release / Publish Gitea Release (push) Has been skipped
release: v0.3.1
VERSION 0.3.0 -> 0.3.1. Append CHANGELOG entry covering the eight fix
commits since v0.3.0 (dual-glibc, nft binary, NF_TABLES_IPV4 family,
NFT_NUMGEN expressions, modules.list parser, banner+motd, port 8080
hostfwd, and the release.yaml workflow rewrite).

End-to-end validated on Apple Silicon Mac under QEMU virt + HVF:
  - kubectl get nodes -> kubesolo-XXXXXX  Ready
  - kube-system/coredns                   1/1 Running
  - local-path-storage/local-path-prov    1/1 Running
  - default/nginx-test (user workload)    1/1 Running (pulled+started 11s)

Tagging this release is also the first real exercise of the rewritten
release.yaml workflow. If it works as designed, the v0.3.1 release page
should populate automatically with: x86 ISO + .img.xz, ARM64 .arm64.img.xz,
Go binaries (cloudinit + update, amd64 + arm64), and SHA256SUMS.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-15 16:29:06 -06:00

18 KiB
Raw Blame History

Changelog

All notable changes to KubeSolo OS are documented in this file.

Format based on Keep a Changelog, versioning follows Semantic Versioning.

[0.3.1] - 2026-05-15

First fully-functional generic ARM64 release. v0.3.0 shipped the build scaffold; v0.3.1 makes it actually boot a Kubernetes cluster end-to-end on QEMU virt under HVF acceleration. Validated by deploying CoreDNS, local-path-provisioner, and an nginx:alpine workload — all reach Running, kubectl get nodes reports Ready.

Fixed

  • Dual-glibc loading on ARM64 — piCore64's /lib/libc.so.6 and the build host's /lib/$LIB_ARCH/libc.so.6 could both be resolved into the same process by the dynamic linker, triggering *** stack smashing detected *** aborts when stack frames crossed between functions linked against different libcs. Fix: bundle the full glibc family (libc + libpthread + libdl + libm + libresolv + librt + libanl + libgcc_s + ld.so), delete piCore's duplicates in /lib/, and write /etc/ld.so.conf + ldconfig -r so the runtime linker has a deterministic search order. (76ed2ff)
  • nft binary not bundled — KubeSolo v1.1.4+ runs nft add table ip kubesolo-masq for pod-masquerade setup, but inject-kubesolo.sh only bundled xtables-nft-multi. Without standalone nft in $PATH, KubeSolo FATAL'd at startup. Fix: copy /usr/sbin/nft + its non-shared libs (libnftables, libedit, libjansson, libgmp, libtinfo, libbsd, libmd) into the rootfs. (51c1f78)
  • nftables address-family handlersnf_tables core was loaded but no address families were registered, so nft add table ip ... returned EOPNOTSUPP. The bool Kconfigs CONFIG_NF_TABLES_IPV4, CONFIG_NF_TABLES_IPV6, CONFIG_NF_TABLES_INET, CONFIG_NF_TABLES_NETDEV are required and weren't in the fragment. Fix: add to kernel-container.fragment as =y. (7e46f8f)
  • kube-proxy nftables-backend expression modules — Kubernetes 1.34's kube-proxy nft backend uses numgen, hash, limit, log expressions. The corresponding kernel modules (CONFIG_NFT_NUMGEN, etc.) were missing from the fragment AND the runtime module list, so even after a kernel rebuild stage 30 didn't load them and stage 85's kernel.modules_disabled=1 lockdown prevented on-demand loads. Fix: add to both kernel-container.fragment (as =m) and modules.list / modules-arm64.list. (31eee77, 3bcf2e1)
  • modules.list inline-comment parser bug — the inject script's comment-strip only matched lines starting with #, not lines with inline # comment tails. So nft_numgen # foo was passed verbatim to modprobe, resolved to nothing, and the .ko never made it into the initramfs. Fix: parse with mod="${mod%%#*}" to strip inline tails. (bc3300e)
  • Banner only printed on kubeconfig success90-kubesolo.sh gated the host-access banner behind if [ -f $KUBECONFIG_PATH ]. When KubeSolo crashed early (bug #2 above) or the wait loop timed out, the user never saw the connection instructions. Fix: write the banner to /etc/motd AND print it unconditionally after the wait loop. (51c1f78)
  • dev-vm-arm64.sh missing port-8080 hostfwd — the in-VM HTTP server that serves the kubeconfig listens on port 8080, but the QEMU -net user line only forwarded 6443 and 2222, so curl http://localhost:8080 from the host machine connected to nothing. Fix: add the third hostfwd. (fbe2d0b)

Fixed (CI)

  • release.yaml workflow rewritten so v0.3.1+ tag pushes auto-publish a complete release page on Gitea: actions/upload-artifact pinned to @v3 for act_runner compatibility, the softprops/action-gh-release@v2 step replaced with a direct curl against /api/v1/repos/.../releases (softprops hard-codes api.github.com so it silently no-ops on Gitea), added a build-disk-arm64 job that builds on the arm64-linux runner. v0.3.0's manual-upload-only release was the canary that exposed all three bugs. (f8c308d)

Known issues carried forward to v0.3.2

These don't block normal operation but are tracked:

  • xt_comment userspace extension load fails on the iptables-nft path, causing kubelet's KUBE-FIREWALL rule install to skip. Reported as Couldn't load match 'comment' in the boot log. kubelet continues without the localhost-drop rule.
  • containerd-shim-runc-v2 -info probe reports runc: executable file not found in $PATH. Cosmetic — containerd uses the absolute path from its config when actually launching containers.
  • kube-proxy conntrack cleanup logs Failed to list conntrack entries: invalid argument every cleanup cycle. Probably needs CONFIG_NF_CONNTRACK_PROCFS or netlink-glue tweaks.
  • Several pods restart 12 times on first boot due to a PLEG / runtime-probe race in the kubelet startup path. Pods stabilise.

[0.3.0] - 2026-05-14

The main themes: generic ARM64 (not just Raspberry Pi), an honest update lifecycle with state file + metrics, OCI multi-arch distribution via ghcr.io, and policy gates (channels, maintenance windows, version stepping-stones, pre-flight checks, auto-rollback).

Added

  • Generic ARM64 build track distinct from Raspberry Pi:
    • make kernel-arm64 builds a mainline kernel.org LTS kernel (6.12.10 by default) from arm64 defconfig + shared kernel-container.fragment + arm64 virt-host enables (VIRTIO_*, EFI_STUB, NVMe).
    • make disk-image-arm64 produces a UEFI-bootable raw GPT image with A/B system partitions and GRUB-EFI ARM64. Targets QEMU virt, Graviton, Ampere, or any UEFI ARM64 host.
    • hack/dev-vm-arm64.sh --disk boots the built image through QEMU UEFI for end-to-end testing.
    • test/qemu/test-boot-arm64-disk.sh automated boot smoke test.
  • Bumped KubeSolo to v1.1.5 (was v1.1.0). New cloud-init flags surfaced:
    • kubesolo.full (v1.1.4+) — disable edge-optimised overrides
    • kubesolo.disable-ipv6 (v1.1.5+)
    • kubesolo.db-wal-repair (v1.1.5+) — recover from unclean shutdowns
  • Per-arch supply-chain verification: KUBESOLO_SHA256_AMD64 and KUBESOLO_SHA256_ARM64 in versions.env, applied to the tarball before extract.
  • docs/arm64-architecture.md — defines the generic-vs-RPi two-track layout.
  • docs/arm64-status.md — Phase 3 status snapshot, known limitations, what's needed to ship.
  • docs/ci-runners.md — Gitea Actions runner setup (Odroid arm64-linux).
  • Update agent state machine and observability (update/pkg/state):
    • Persistent on-disk state.json at /var/lib/kubesolo/update/state.json (atomic write via tmp + rename). Records Phase (Idle / Checking / Downloading / Staged / Activated / Verifying / Success / RolledBack / Failed), FromVersion, ToVersion, StartedAt, UpdatedAt, LastError, AttemptCount, HealthCheckFailures.
    • apply, activate, healthcheck, rollback all transition state explicitly on entry / exit / failure. Errors land in LastError so status can show why.
    • kubesolo-update status --json emits the full state for orchestration tooling. Human-readable mode adds an "Update Lifecycle" section when not idle.
    • New Prometheus metrics: kubesolo_update_phase{phase="..."} (all 9 phase labels always emitted), kubesolo_update_attempts_total, kubesolo_update_last_attempt_timestamp_seconds.
  • Channels, maintenance windows, version policy (update/pkg/config):
    • /etc/kubesolo/update.conf (key=value, comments, missing-OK) configures server, channel, maintenance_window, pubkey, healthcheck_url, auto_rollback_after.
    • cloud-init top-level updates: block writes update.conf on first boot. Empty block leaves any existing file alone.
    • apply enforces four gates before download: maintenance window, channel match, runtime architecture match, min_compatible_version stepping-stone. All gate failures land in the state machine as Failed with a clear LastError. --force bypasses window + node-block-label.
    • UpdateMetadata JSON gains channel, min_compatible_version, architecture (all optional, omitempty).
  • OCI registry distribution (update/pkg/oci, ~280 LOC, 9 tests):
    • kubesolo-update apply --registry ghcr.io/<org>/kubesolo-os --tag stable pulls update artifacts from any OCI-compliant registry. Multi-arch indexes resolve to the runtime.GOARCH-matching manifest automatically.
    • Custom media types: application/vnd.kubesolo.os.kernel.v1+octet-stream and application/vnd.kubesolo.os.initramfs.v1+gzip. Annotations: io.kubesolo.os.{version,channel,architecture,min_compatible_version, release_notes,release_date}.
    • End-to-end digest verification from manifest to blobs via oras-go/v2.
    • build/scripts/push-oci-artifact.sh publishes per-arch artifacts via oras. Multi-arch index composition documented inline.
    • Dependencies added (update module only): oras.land/oras-go/v2 and transitive opencontainers/{go-digest,image-spec} + golang.org/x/sync.
  • Pre-flight gates and deeper healthcheck (update/pkg/health extended, update/pkg/partition extended):
    • Free-space pre-flight on the passive partition (image + 10% headroom) via partition.FreeBytes / HasFreeSpaceFor.
    • Node-block-label pre-flight: refuses if the local K8s node carries updates.kubesolo.io/block=true. Silently allowed when no kubeconfig (air-gap). Skipped by --force.
    • CheckKubeSystemReady waits until every kube-system pod has held Running for ≥ N seconds (configurable via --kube-system-settle).
    • CheckProbeURL GETs an operator-supplied URL; 200 = pass. Configurable via --healthcheck-url or healthcheck_url= in update.conf.
    • CheckDiskWritable writes / fsyncs / reads / deletes a probe file under /var/lib/kubesolo to catch a wedged data partition.
    • --auto-rollback-after N (also auto_rollback_after= in update.conf): after N consecutive post-activation healthcheck failures, the agent calls ForceRollback() and the operator/init reboots. Reset to 0 on a clean pass.
  • .gitea/workflows/build-arm64.yaml — full ARM64 build on the Odroid self-hosted runner. Triggers on push to main, tags, and workflow_dispatch. Boot smoke test marked continue-on-error pending KVM or real-hardware validation.

Changed

  • build/scripts/build-kernel-arm64.sh is now the generic ARM64 kernel build (mainline kernel.org LTS, generic UEFI/virtio).
  • Renamed build/scripts/build-kernel-rpi.sh (was build-kernel-arm64.sh). RPi kernel build (raspberrypi/linux fork, bcm2711_defconfig) lives here now.
  • Renamed build/config/kernel-container.fragment (was rpi-kernel-config.fragment). Misnomer: contents are arch-agnostic and now shared across x86, ARM64-generic, and RPi kernels.
  • build/scripts/build-kernel.sh (x86) refactored to consume the shared fragment via a generic apply_fragment function. ~50 lines of duplication killed.
  • KUBESOLO_VERSION moved out of fetch-components.sh defaults into versions.env. Bumping is now a one-line PR.

Fixed

  • Native ARM64 build hosts (e.g. an Odroid runner) no longer require the x86 cross-compiler. Both build-kernel-arm64.sh and build-kernel-rpi.sh detect uname -m and use the host's gcc directly when arch matches.
  • ARM64 grub.cfg console ordering: ttyAMA0 is now the primary console (console=ttyS0,... console=ttyAMA0,...). Init output is now visible on QEMU virt and most ARM64 SBCs without further configuration.
  • ARM64 boot: replaced piCore64's /init with our staged init at /init and /sbin/init. Previously the kernel ran piCore's TCE handler which segfaulted in our environment.
  • ARM64 boot: replaced piCore64's broken dynamic BusyBox with the build host's busybox-static. piCore's binary triggered EL0 instruction-abort panics on QEMU virt under both -cpu cortex-a72 and -cpu max.
  • POSIX-character-class portability: tr -d '[:space:]' in 30-kernel-modules.sh and 40-sysctl.sh replaced with explicit ' \t\r\n'. Ubuntu's busybox-static 1.30.1 doesn't parse [:space:] and instead deletes the literal characters [ : s p a c e ], which truncated module names (virtio_netvirtio_nt, etc.) and sysctl keys.
  • inject-kubesolo.sh no longer copies init/lib/functions.sh into init.d/. Previously the main init loop tried to run it as a stage after stage 90 and panicked with "Init completed without exec'ing KubeSolo".
  • ARM64 disk image: TARGET_ARCH=arm64 create-disk-image.sh produces BOOTAA64.EFI via grub-mkimage -O arm64-efi (not bootx64.efi). Skips the BIOS-only grub-install --target=i386-pc step.
  • build/Dockerfile.builder: added grub-efi-amd64-bin, grub-efi-arm64-bin, grub-pc-bin, grub-common, grub2-common, and busybox-static so the Docker-based build flow can produce ARM64 disk images and gets the same BusyBox swap behaviour as native builds.

Known limitations (deferred to follow-up)

  • ARM64 LABEL= resolution doesn't work yet — piCore's blkid/findfs crash in QEMU and our static busybox lacks the applets. Hardcoded /dev/vda4 as a workaround in build/grub/grub-arm64.cfg. Production fix: ship static blkid/findfs or replace LABEL resolution with a sysfs walk.
  • AppArmor profile load fails on ARM64 (apparmor_parser ABI mismatch). Init reports it; boot continues without enforcement.
  • OCI signature verification is deferred. The HTTP transport still honours --pubkey for .sig files; the OCI transport is digest-verified end-to-end via oras-go but does not yet consume cosign-style referrer attestations. Targeted for v0.3.1.
  • Real-hardware validation of the generic ARM64 image is still pending. Builds and boots end-to-end under QEMU virt; production certification waits on a Graviton / Ampere run.
  • QEMU TCG performance can trigger KubeSolo's first-boot image-import deadline. Not a defect in the OS itself; real hardware and KVM-accelerated QEMU complete the import in seconds.

[0.2.0] - 2026-02-12

Added

  • Cloud-init: support all documented KubeSolo CLI flags (--local-storage-shared-path, --debug, --pprof-server, --portainer-edge-id, --portainer-edge-key, --portainer-edge-async)
  • Cloud-init: full-config.yaml example showing all supported parameters
  • Cloud-init: KubeSolo configuration reference table in docs/cloud-init.md
  • Security hardening: mount hardening, sysctl, kernel module lock, AppArmor profiles
  • ARM64 Raspberry Pi support with A/B boot via tryboot
  • BootEnv abstraction for GRUB and RPi boot environments
  • Go 1.25.5 installed on host for native builds

[0.1.0] - 2026-02-12

First release with all 5 design-doc phases complete. ISO boots and runs K8s pods.

Added

Custom Kernel

  • Custom kernel build (6.18.2-tinycore64) with container-critical configs
  • Added CONFIG_CGROUP_BPF, CONFIG_DEVTMPFS, CONFIG_DEVTMPFS_MOUNT, CONFIG_MEMCG, CONFIG_CFS_BANDWIDTH
  • Stripped unnecessary subsystems (sound, GPU, wireless, Bluetooth, etc.)
  • Selective kernel module install — only modules.list + transitive deps in initramfs

Init System (Phase 1)

  • POSIX sh init system with staged boot (00-early-mount through 90-kubesolo)
  • switch_root from initramfs to SquashFS root
  • Persistent data partition mount with bind-mounts for K8s state
  • Kernel module loading, sysctl tuning, network, hostname, NTP
  • Emergency shell fallback on boot failure
  • Device node creation via mknod fallback from sysfs

Cloud-Init (Phase 2)

  • Go-based cloud-init parser (~2.7 MB static binary)
  • Network configuration: DHCP and static IP modes
  • Hostname and machine-id generation
  • KubeSolo configuration (node-name, extra flags)
  • Portainer Edge Agent integration via K8s manifest injection
  • Persistent config saved to /mnt/data/ for next-boot fast path
  • 22 Go tests

A/B Atomic Updates (Phase 3)

  • 4-partition GPT disk image: EFI + System A + System B + Data
  • GRUB 2 bootloader with A/B slot selection and boot counter rollback
  • Go update agent (~6.0 MB static binary) with check, apply, activate, rollback commands
  • Health check: containerd + K8s API + node Ready verification
  • Update server protocol: HTTP serving latest.json + image files
  • K8s CronJob for automated update checks (every 6 hours)
  • Zero external Go dependencies — uses kubectl/ctr exec commands

Production Hardening (Phase 4)

  • Ed25519 image signing with pure Go stdlib (zero external deps)
  • Key generation, signing, and verification CLI commands
  • Portainer Edge Agent deployment via cloud-init
  • SSH extension injection for debugging (hack/inject-ssh.sh)
  • Boot time and resource usage benchmarks
  • Deployment guide documentation

Distribution & Fleet Management (Phase 5)

  • Gitea Actions CI/CD (test + build + shellcheck on push, release on tags)
  • OCI container image packaging (scratch-based)
  • Prometheus metrics endpoint (zero-dependency text exposition format)
  • USB provisioning script with cloud-init injection
  • ARM64 cross-compilation support

Build System

  • Makefile with full build orchestration
  • Dockerized reproducible builds (build/Dockerfile.builder)
  • Component fetching with version pinning
  • ISO and raw disk image creation
  • Fast rebuild path (make quick)

Documentation

  • Architecture design document
  • Boot flow reference
  • A/B update flow reference
  • Cloud-init configuration reference
  • Deployment and operations guide

Fixed

  • Replaced grep -oP with POSIX-safe sed in functions.sh (BusyBox compatibility)
  • Replaced grep -qiE with grep -qi -e pattern (POSIX compliance)
  • Fixed KVM flag handling in dev-vm.sh (bash array context)
  • Added iptables table pre-initialization before kube-proxy start (nf_tables issue)
  • Added /dev/kmsg and /etc/machine-id creation for kubelet
  • Added CA certificates bundle to initramfs (containerd TLS verification for Docker Hub)
  • Added DNS fallback (10.0.2.3 + 8.8.8.8) when DHCP client doesn't populate resolv.conf
  • Added headless Service to Portainer Edge Agent manifest (agent peer discovery DNS)
  • Added kubesolo.edge_id/edge_key kernel boot parameters for Portainer Edge
  • Added auto-format of unformatted data disks on first boot
  • Rewrote dev-vm.sh for macOS: bsdtar ISO extraction, Homebrew mkfs.ext4 detection, direct kernel boot, TCG acceleration, port 8080 forwarding
  • Kubeconfig now served via HTTP on port 8080 (serial console truncates base64 lines)
  • Added 127.0.0.1 and 10.0.2.15 to API server SANs for QEMU port forwarding
  • dev-vm.sh now works on Linux: fallback ISO extraction via isoinfo or loop mount, KVM auto-detection, platform-aware error messages