README: - Status line bumped from v0.3.0 to v0.3.1 with the actually-validated framing (K8s Ready under QEMU virt+HVF, CoreDNS + local-path + nginx all Running) and a link to CHANGELOG.md for full notes. - Roadmap: Phase 7 (generic ARM64) flipped to "Complete (v0.3.1, K8s Ready under QEMU virt+HVF)". OCI cosign verification, LABEL=KSOLODATA on ARM64, and real-hardware ARM64 validation move from "Planned for v0.3.1" to "Planned for v0.3.2" — they didn't make this release. CHANGELOG: - New "[Unreleased]" section covering the four post-v0.3.1 CI / repo housekeeping commits: drop tag trigger on build-arm64.yaml (04a5cd2), gitignore .env/credentials (48267e1), fix gated x86 job staying "queued" instead of "skipped" (fb24e64), and paths-ignore on build-arm64.yaml so workflow/docs-only commits skip the 60-minute kernel rebuild (e1b8a69). No runtime changes. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
20 KiB
Changelog
All notable changes to KubeSolo OS are documented in this file.
Format based on Keep a Changelog, versioning follows Semantic Versioning.
[Unreleased]
Pure CI / repository housekeeping; no runtime changes since v0.3.1. All items below shake out workflow-loop bugs exposed by the v0.3.1 release flow on Gitea Actions.
Fixed (CI)
build-arm64.yamlno longer triggers on tag pushes.release.yamlalready produces the ARM64 disk image as part of the release flow, so triggering both on the same tag wasted an hour of Odroid runner time on a duplicate kernel build. (04a5cd2)- The gated
build-iso-amd64job inrelease.yaml(if: falseuntil an amd64-linux runner exists) used to advertiseruns-on: amd64-linux. With no matching runner, Gitea left the job queued forever and the overall workflow run never transitioned tosuccess— even though every load-bearing job had finished and the release was already published. Now usesruns-on: ubuntu-latestso any runner picks the job up just long enough to evaluateif: falseand mark itskipped. (fb24e64) build-arm64.yamlnow ignores workflow-file, docs, and*.mdchanges viapaths-ignore(.gitea/workflows/**,.github/workflows/**,docs/**, top-level*.md,.gitignore). Workflow- / docs-only commits no longer kick off a 60-minute kernel rebuild on the Odroid. Any change to a kernel fragment, init script, or build script still triggers the full build, as intended. (e1b8a69)
Changed
.gitignorenow excludes.env,.env.*,*.token,*.patto keep Gitea PATs and other credentials used during release ops from being accidentally committed. (48267e1)
[0.3.1] - 2026-05-15
First fully-functional generic ARM64 release. v0.3.0 shipped the build
scaffold; v0.3.1 makes it actually boot a Kubernetes cluster end-to-end
on QEMU virt under HVF acceleration. Validated by deploying CoreDNS,
local-path-provisioner, and an nginx:alpine workload — all reach
Running, kubectl get nodes reports Ready.
Fixed
- Dual-glibc loading on ARM64 — piCore64's
/lib/libc.so.6and the build host's/lib/$LIB_ARCH/libc.so.6could both be resolved into the same process by the dynamic linker, triggering*** stack smashing detected ***aborts when stack frames crossed between functions linked against different libcs. Fix: bundle the full glibc family (libc + libpthread + libdl + libm + libresolv + librt + libanl + libgcc_s + ld.so), delete piCore's duplicates in/lib/, and write/etc/ld.so.conf+ldconfig -rso the runtime linker has a deterministic search order. (76ed2ff) nftbinary not bundled — KubeSolo v1.1.4+ runsnft add table ip kubesolo-masqfor pod-masquerade setup, butinject-kubesolo.shonly bundledxtables-nft-multi. Without standalonenftin$PATH, KubeSolo FATAL'd at startup. Fix: copy/usr/sbin/nft+ its non-shared libs (libnftables, libedit, libjansson, libgmp, libtinfo, libbsd, libmd) into the rootfs. (51c1f78)- nftables address-family handlers —
nf_tablescore was loaded but no address families were registered, sonft add table ip ...returnedEOPNOTSUPP. The bool KconfigsCONFIG_NF_TABLES_IPV4,CONFIG_NF_TABLES_IPV6,CONFIG_NF_TABLES_INET,CONFIG_NF_TABLES_NETDEVare required and weren't in the fragment. Fix: add tokernel-container.fragmentas=y. (7e46f8f) - kube-proxy nftables-backend expression modules — Kubernetes 1.34's
kube-proxy nft backend uses
numgen,hash,limit,logexpressions. The corresponding kernel modules (CONFIG_NFT_NUMGEN, etc.) were missing from the fragment AND the runtime module list, so even after a kernel rebuild stage 30 didn't load them and stage 85'skernel.modules_disabled=1lockdown prevented on-demand loads. Fix: add to bothkernel-container.fragment(as=m) andmodules.list/modules-arm64.list. (31eee77,3bcf2e1) modules.listinline-comment parser bug — the inject script's comment-strip only matched lines starting with#, not lines with inline# commenttails. Sonft_numgen # foowas passed verbatim to modprobe, resolved to nothing, and the .ko never made it into the initramfs. Fix: parse withmod="${mod%%#*}"to strip inline tails. (bc3300e)- Banner only printed on kubeconfig success —
90-kubesolo.shgated the host-access banner behindif [ -f $KUBECONFIG_PATH ]. When KubeSolo crashed early (bug #2 above) or the wait loop timed out, the user never saw the connection instructions. Fix: write the banner to/etc/motdAND print it unconditionally after the wait loop. (51c1f78) dev-vm-arm64.shmissing port-8080 hostfwd — the in-VM HTTP server that serves the kubeconfig listens on port 8080, but the QEMU-net userline only forwarded 6443 and 2222, socurl http://localhost:8080from the host machine connected to nothing. Fix: add the third hostfwd. (fbe2d0b)
Fixed (CI)
release.yamlworkflow rewritten so v0.3.1+ tag pushes auto-publish a complete release page on Gitea:actions/upload-artifactpinned to@v3for act_runner compatibility, thesoftprops/action-gh-release@v2step replaced with a directcurlagainst/api/v1/repos/.../releases(softpropshard-codesapi.github.comso it silently no-ops on Gitea), added abuild-disk-arm64job that builds on thearm64-linuxrunner. v0.3.0's manual-upload-only release was the canary that exposed all three bugs. (f8c308d)
Known issues carried forward to v0.3.2
These don't block normal operation but are tracked:
xt_commentuserspace extension load fails on the iptables-nft path, causing kubelet's KUBE-FIREWALL rule install to skip. Reported asCouldn't load match 'comment'in the boot log. kubelet continues without the localhost-drop rule.containerd-shim-runc-v2 -infoprobe reportsrunc: executable file not found in $PATH. Cosmetic — containerd uses the absolute path from its config when actually launching containers.kube-proxy conntrack cleanuplogsFailed to list conntrack entries: invalid argumentevery cleanup cycle. Probably needsCONFIG_NF_CONNTRACK_PROCFSor netlink-glue tweaks.- Several pods restart 1–2 times on first boot due to a PLEG / runtime-probe race in the kubelet startup path. Pods stabilise.
[0.3.0] - 2026-05-14
The main themes: generic ARM64 (not just Raspberry Pi), an honest update lifecycle with state file + metrics, OCI multi-arch distribution via ghcr.io, and policy gates (channels, maintenance windows, version stepping-stones, pre-flight checks, auto-rollback).
Added
- Generic ARM64 build track distinct from Raspberry Pi:
make kernel-arm64builds a mainline kernel.org LTS kernel (6.12.10 by default) fromarm64 defconfig+ sharedkernel-container.fragment+ arm64 virt-host enables (VIRTIO_*, EFI_STUB, NVMe).make disk-image-arm64produces a UEFI-bootable raw GPT image with A/B system partitions and GRUB-EFI ARM64. Targets QEMU virt, Graviton, Ampere, or any UEFI ARM64 host.hack/dev-vm-arm64.sh --diskboots the built image through QEMU UEFI for end-to-end testing.test/qemu/test-boot-arm64-disk.shautomated boot smoke test.
- Bumped KubeSolo to v1.1.5 (was v1.1.0). New cloud-init flags surfaced:
kubesolo.full(v1.1.4+) — disable edge-optimised overrideskubesolo.disable-ipv6(v1.1.5+)kubesolo.db-wal-repair(v1.1.5+) — recover from unclean shutdowns
- Per-arch supply-chain verification:
KUBESOLO_SHA256_AMD64andKUBESOLO_SHA256_ARM64inversions.env, applied to the tarball before extract. docs/arm64-architecture.md— defines the generic-vs-RPi two-track layout.docs/arm64-status.md— Phase 3 status snapshot, known limitations, what's needed to ship.docs/ci-runners.md— Gitea Actions runner setup (Odroid arm64-linux).- Update agent state machine and observability (
update/pkg/state):- Persistent on-disk
state.jsonat/var/lib/kubesolo/update/state.json(atomic write via tmp + rename). Records Phase (Idle / Checking / Downloading / Staged / Activated / Verifying / Success / RolledBack / Failed), FromVersion, ToVersion, StartedAt, UpdatedAt, LastError, AttemptCount, HealthCheckFailures. apply,activate,healthcheck,rollbackall transition state explicitly on entry / exit / failure. Errors land in LastError sostatuscan show why.kubesolo-update status --jsonemits the full state for orchestration tooling. Human-readable mode adds an "Update Lifecycle" section when not idle.- New Prometheus metrics:
kubesolo_update_phase{phase="..."}(all 9 phase labels always emitted),kubesolo_update_attempts_total,kubesolo_update_last_attempt_timestamp_seconds.
- Persistent on-disk
- Channels, maintenance windows, version policy (
update/pkg/config):/etc/kubesolo/update.conf(key=value, comments, missing-OK) configures server, channel, maintenance_window, pubkey, healthcheck_url, auto_rollback_after.cloud-inittop-levelupdates:block writesupdate.confon first boot. Empty block leaves any existing file alone.applyenforces four gates before download: maintenance window, channel match, runtime architecture match, min_compatible_version stepping-stone. All gate failures land in the state machine as Failed with a clear LastError.--forcebypasses window + node-block-label.UpdateMetadataJSON gainschannel,min_compatible_version,architecture(all optional, omitempty).
- OCI registry distribution (
update/pkg/oci, ~280 LOC, 9 tests):kubesolo-update apply --registry ghcr.io/<org>/kubesolo-os --tag stablepulls update artifacts from any OCI-compliant registry. Multi-arch indexes resolve to the runtime.GOARCH-matching manifest automatically.- Custom media types:
application/vnd.kubesolo.os.kernel.v1+octet-streamandapplication/vnd.kubesolo.os.initramfs.v1+gzip. Annotations:io.kubesolo.os.{version,channel,architecture,min_compatible_version, release_notes,release_date}. - End-to-end digest verification from manifest to blobs via oras-go/v2.
build/scripts/push-oci-artifact.shpublishes per-arch artifacts viaoras. Multi-arch index composition documented inline.- Dependencies added (update module only): oras.land/oras-go/v2 and transitive opencontainers/{go-digest,image-spec} + golang.org/x/sync.
- Pre-flight gates and deeper healthcheck (
update/pkg/healthextended,update/pkg/partitionextended):- Free-space pre-flight on the passive partition (image + 10% headroom)
via
partition.FreeBytes/HasFreeSpaceFor. - Node-block-label pre-flight: refuses if the local K8s node carries
updates.kubesolo.io/block=true. Silently allowed when no kubeconfig (air-gap). Skipped by--force. CheckKubeSystemReadywaits until every kube-system pod has held Running for ≥ N seconds (configurable via--kube-system-settle).CheckProbeURLGETs an operator-supplied URL; 200 = pass. Configurable via--healthcheck-urlorhealthcheck_url=in update.conf.CheckDiskWritablewrites / fsyncs / reads / deletes a probe file under/var/lib/kubesoloto catch a wedged data partition.--auto-rollback-after N(alsoauto_rollback_after=in update.conf): after N consecutive post-activation healthcheck failures, the agent callsForceRollback()and the operator/init reboots. Reset to 0 on a clean pass.
- Free-space pre-flight on the passive partition (image + 10% headroom)
via
.gitea/workflows/build-arm64.yaml— full ARM64 build on the Odroid self-hosted runner. Triggers on push to main, tags, and workflow_dispatch. Boot smoke test marked continue-on-error pending KVM or real-hardware validation.
Changed
build/scripts/build-kernel-arm64.shis now the generic ARM64 kernel build (mainline kernel.org LTS, generic UEFI/virtio).- Renamed
build/scripts/build-kernel-rpi.sh(wasbuild-kernel-arm64.sh). RPi kernel build (raspberrypi/linux fork, bcm2711_defconfig) lives here now. - Renamed
build/config/kernel-container.fragment(wasrpi-kernel-config.fragment). Misnomer: contents are arch-agnostic and now shared across x86, ARM64-generic, and RPi kernels. build/scripts/build-kernel.sh(x86) refactored to consume the shared fragment via a genericapply_fragmentfunction. ~50 lines of duplication killed.KUBESOLO_VERSIONmoved out offetch-components.shdefaults intoversions.env. Bumping is now a one-line PR.
Fixed
- Native ARM64 build hosts (e.g. an Odroid runner) no longer require the x86
cross-compiler. Both
build-kernel-arm64.shandbuild-kernel-rpi.shdetectuname -mand use the host's gcc directly when arch matches. - ARM64 grub.cfg console ordering:
ttyAMA0is now the primary console (console=ttyS0,... console=ttyAMA0,...). Init output is now visible on QEMU virt and most ARM64 SBCs without further configuration. - ARM64 boot: replaced piCore64's
/initwith our staged init at/initand/sbin/init. Previously the kernel ran piCore's TCE handler which segfaulted in our environment. - ARM64 boot: replaced piCore64's broken dynamic BusyBox with the build
host's
busybox-static. piCore's binary triggered EL0 instruction-abort panics on QEMU virt under both-cpu cortex-a72and-cpu max. - POSIX-character-class portability:
tr -d '[:space:]'in30-kernel-modules.shand40-sysctl.shreplaced with explicit' \t\r\n'. Ubuntu's busybox-static 1.30.1 doesn't parse[:space:]and instead deletes the literal characters[ : s p a c e ], which truncated module names (virtio_net→virtio_nt, etc.) and sysctl keys. inject-kubesolo.shno longer copiesinit/lib/functions.shintoinit.d/. Previously the main init loop tried to run it as a stage after stage 90 and panicked with "Init completed without exec'ing KubeSolo".- ARM64 disk image:
TARGET_ARCH=arm64 create-disk-image.shproducesBOOTAA64.EFIviagrub-mkimage -O arm64-efi(notbootx64.efi). Skips the BIOS-onlygrub-install --target=i386-pcstep. build/Dockerfile.builder: addedgrub-efi-amd64-bin,grub-efi-arm64-bin,grub-pc-bin,grub-common,grub2-common, andbusybox-staticso the Docker-based build flow can produce ARM64 disk images and gets the same BusyBox swap behaviour as native builds.
Known limitations (deferred to follow-up)
- ARM64 LABEL= resolution doesn't work yet — piCore's
blkid/findfscrash in QEMU and our static busybox lacks the applets. Hardcoded/dev/vda4as a workaround inbuild/grub/grub-arm64.cfg. Production fix: ship staticblkid/findfsor replace LABEL resolution with a sysfs walk. - AppArmor profile load fails on ARM64 (apparmor_parser ABI mismatch). Init reports it; boot continues without enforcement.
- OCI signature verification is deferred. The HTTP transport still
honours
--pubkeyfor.sigfiles; the OCI transport is digest-verified end-to-end via oras-go but does not yet consume cosign-style referrer attestations. Targeted for v0.3.1. - Real-hardware validation of the generic ARM64 image is still pending. Builds and boots end-to-end under QEMU virt; production certification waits on a Graviton / Ampere run.
- QEMU TCG performance can trigger KubeSolo's first-boot image-import deadline. Not a defect in the OS itself; real hardware and KVM-accelerated QEMU complete the import in seconds.
[0.2.0] - 2026-02-12
Added
- Cloud-init: support all documented KubeSolo CLI flags (
--local-storage-shared-path,--debug,--pprof-server,--portainer-edge-id,--portainer-edge-key,--portainer-edge-async) - Cloud-init:
full-config.yamlexample showing all supported parameters - Cloud-init: KubeSolo configuration reference table in docs/cloud-init.md
- Security hardening: mount hardening, sysctl, kernel module lock, AppArmor profiles
- ARM64 Raspberry Pi support with A/B boot via tryboot
- BootEnv abstraction for GRUB and RPi boot environments
- Go 1.25.5 installed on host for native builds
[0.1.0] - 2026-02-12
First release with all 5 design-doc phases complete. ISO boots and runs K8s pods.
Added
Custom Kernel
- Custom kernel build (6.18.2-tinycore64) with container-critical configs
- Added CONFIG_CGROUP_BPF, CONFIG_DEVTMPFS, CONFIG_DEVTMPFS_MOUNT, CONFIG_MEMCG, CONFIG_CFS_BANDWIDTH
- Stripped unnecessary subsystems (sound, GPU, wireless, Bluetooth, etc.)
- Selective kernel module install — only modules.list + transitive deps in initramfs
Init System (Phase 1)
- POSIX sh init system with staged boot (00-early-mount through 90-kubesolo)
- switch_root from initramfs to SquashFS root
- Persistent data partition mount with bind-mounts for K8s state
- Kernel module loading, sysctl tuning, network, hostname, NTP
- Emergency shell fallback on boot failure
- Device node creation via mknod fallback from sysfs
Cloud-Init (Phase 2)
- Go-based cloud-init parser (~2.7 MB static binary)
- Network configuration: DHCP and static IP modes
- Hostname and machine-id generation
- KubeSolo configuration (node-name, extra flags)
- Portainer Edge Agent integration via K8s manifest injection
- Persistent config saved to /mnt/data/ for next-boot fast path
- 22 Go tests
A/B Atomic Updates (Phase 3)
- 4-partition GPT disk image: EFI + System A + System B + Data
- GRUB 2 bootloader with A/B slot selection and boot counter rollback
- Go update agent (~6.0 MB static binary) with check, apply, activate, rollback commands
- Health check: containerd + K8s API + node Ready verification
- Update server protocol: HTTP serving latest.json + image files
- K8s CronJob for automated update checks (every 6 hours)
- Zero external Go dependencies — uses kubectl/ctr exec commands
Production Hardening (Phase 4)
- Ed25519 image signing with pure Go stdlib (zero external deps)
- Key generation, signing, and verification CLI commands
- Portainer Edge Agent deployment via cloud-init
- SSH extension injection for debugging (hack/inject-ssh.sh)
- Boot time and resource usage benchmarks
- Deployment guide documentation
Distribution & Fleet Management (Phase 5)
- Gitea Actions CI/CD (test + build + shellcheck on push, release on tags)
- OCI container image packaging (scratch-based)
- Prometheus metrics endpoint (zero-dependency text exposition format)
- USB provisioning script with cloud-init injection
- ARM64 cross-compilation support
Build System
- Makefile with full build orchestration
- Dockerized reproducible builds (build/Dockerfile.builder)
- Component fetching with version pinning
- ISO and raw disk image creation
- Fast rebuild path (
make quick)
Documentation
- Architecture design document
- Boot flow reference
- A/B update flow reference
- Cloud-init configuration reference
- Deployment and operations guide
Fixed
- Replaced
grep -oPwith POSIX-safesedin functions.sh (BusyBox compatibility) - Replaced
grep -qiEwithgrep -qi -epattern (POSIX compliance) - Fixed KVM flag handling in dev-vm.sh (bash array context)
- Added iptables table pre-initialization before kube-proxy start (nf_tables issue)
- Added /dev/kmsg and /etc/machine-id creation for kubelet
- Added CA certificates bundle to initramfs (containerd TLS verification for Docker Hub)
- Added DNS fallback (10.0.2.3 + 8.8.8.8) when DHCP client doesn't populate resolv.conf
- Added headless Service to Portainer Edge Agent manifest (agent peer discovery DNS)
- Added kubesolo.edge_id/edge_key kernel boot parameters for Portainer Edge
- Added auto-format of unformatted data disks on first boot
- Rewrote dev-vm.sh for macOS: bsdtar ISO extraction, Homebrew mkfs.ext4 detection, direct kernel boot, TCG acceleration, port 8080 forwarding
- Kubeconfig now served via HTTP on port 8080 (serial console truncates base64 lines)
- Added 127.0.0.1 and 10.0.2.15 to API server SANs for QEMU port forwarding
- dev-vm.sh now works on Linux: fallback ISO extraction via isoinfo or loop mount, KVM auto-detection, platform-aware error messages