Files
kubesolo-os/CHANGELOG.md
Adolfo Delorenzo 81b29fd237
Some checks failed
ARM64 Build / Build generic ARM64 disk image (push) Failing after 3s
CI / Go Tests (push) Successful in 1m53s
CI / Shellcheck (push) Successful in 1m2s
Release / Test (push) Successful in 1m37s
CI / Build Go Binaries (amd64, linux, linux-amd64) (push) Successful in 1m33s
CI / Build Go Binaries (arm64, linux, linux-arm64) (push) Successful in 1m34s
Release / Build Binaries (linux-amd64) (push) Successful in 1m26s
Release / Build Binaries (linux-arm64) (push) Successful in 1m37s
Release / Build ARM64 disk image (push) Failing after 3s
Release / Build x86_64 ISO + disk image (push) Failing after 44s
Release / Publish Gitea Release (push) Has been skipped
release: v0.3.1
VERSION 0.3.0 -> 0.3.1. Append CHANGELOG entry covering the eight fix
commits since v0.3.0 (dual-glibc, nft binary, NF_TABLES_IPV4 family,
NFT_NUMGEN expressions, modules.list parser, banner+motd, port 8080
hostfwd, and the release.yaml workflow rewrite).

End-to-end validated on Apple Silicon Mac under QEMU virt + HVF:
  - kubectl get nodes -> kubesolo-XXXXXX  Ready
  - kube-system/coredns                   1/1 Running
  - local-path-storage/local-path-prov    1/1 Running
  - default/nginx-test (user workload)    1/1 Running (pulled+started 11s)

Tagging this release is also the first real exercise of the rewritten
release.yaml workflow. If it works as designed, the v0.3.1 release page
should populate automatically with: x86 ISO + .img.xz, ARM64 .arm64.img.xz,
Go binaries (cloudinit + update, amd64 + arm64), and SHA256SUMS.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-15 16:29:06 -06:00

345 lines
18 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Changelog
All notable changes to KubeSolo OS are documented in this file.
Format based on [Keep a Changelog](https://keepachangelog.com/en/1.1.0/),
versioning follows [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
## [0.3.1] - 2026-05-15
First fully-functional generic ARM64 release. v0.3.0 shipped the build
scaffold; v0.3.1 makes it actually boot a Kubernetes cluster end-to-end
on QEMU virt under HVF acceleration. Validated by deploying CoreDNS,
local-path-provisioner, and an `nginx:alpine` workload — all reach
Running, `kubectl get nodes` reports `Ready`.
### Fixed
- **Dual-glibc loading on ARM64** — piCore64's `/lib/libc.so.6` and the
build host's `/lib/$LIB_ARCH/libc.so.6` could both be resolved into the
same process by the dynamic linker, triggering
`*** stack smashing detected ***` aborts when stack frames crossed
between functions linked against different libcs. Fix: bundle the full
glibc family (libc + libpthread + libdl + libm + libresolv + librt +
libanl + libgcc_s + ld.so), delete piCore's duplicates in `/lib/`,
and write `/etc/ld.so.conf` + `ldconfig -r` so the runtime linker has
a deterministic search order. (`76ed2ff`)
- **`nft` binary not bundled** — KubeSolo v1.1.4+ runs `nft add table ip
kubesolo-masq` for pod-masquerade setup, but `inject-kubesolo.sh` only
bundled `xtables-nft-multi`. Without standalone `nft` in `$PATH`,
KubeSolo FATAL'd at startup. Fix: copy `/usr/sbin/nft` + its
non-shared libs (libnftables, libedit, libjansson, libgmp, libtinfo,
libbsd, libmd) into the rootfs. (`51c1f78`)
- **nftables address-family handlers** — `nf_tables` core was loaded but
no address families were registered, so `nft add table ip ...`
returned `EOPNOTSUPP`. The bool Kconfigs `CONFIG_NF_TABLES_IPV4`,
`CONFIG_NF_TABLES_IPV6`, `CONFIG_NF_TABLES_INET`,
`CONFIG_NF_TABLES_NETDEV` are required and weren't in the
fragment. Fix: add to `kernel-container.fragment` as `=y`. (`7e46f8f`)
- **kube-proxy nftables-backend expression modules** — Kubernetes 1.34's
kube-proxy nft backend uses `numgen`, `hash`, `limit`, `log`
expressions. The corresponding kernel modules (`CONFIG_NFT_NUMGEN`,
etc.) were missing from the fragment AND the runtime module list, so
even after a kernel rebuild stage 30 didn't load them and stage 85's
`kernel.modules_disabled=1` lockdown prevented on-demand loads. Fix:
add to both `kernel-container.fragment` (as `=m`) and
`modules.list` / `modules-arm64.list`. (`31eee77`, `3bcf2e1`)
- **`modules.list` inline-comment parser bug** — the inject script's
comment-strip only matched lines starting with `#`, not lines with
inline `# comment` tails. So `nft_numgen # foo` was passed
verbatim to modprobe, resolved to nothing, and the .ko never made it
into the initramfs. Fix: parse with `mod="${mod%%#*}"` to strip
inline tails. (`bc3300e`)
- **Banner only printed on kubeconfig success** —
`90-kubesolo.sh` gated the host-access banner behind `if [ -f
$KUBECONFIG_PATH ]`. When KubeSolo crashed early (bug #2 above) or
the wait loop timed out, the user never saw the connection
instructions. Fix: write the banner to `/etc/motd` AND print it
unconditionally after the wait loop. (`51c1f78`)
- **`dev-vm-arm64.sh` missing port-8080 hostfwd** — the in-VM HTTP
server that serves the kubeconfig listens on port 8080, but the
QEMU `-net user` line only forwarded 6443 and 2222, so
`curl http://localhost:8080` from the host machine connected to
nothing. Fix: add the third hostfwd. (`fbe2d0b`)
### Fixed (CI)
- **`release.yaml` workflow** rewritten so v0.3.1+ tag pushes
auto-publish a complete release page on Gitea: `actions/upload-artifact`
pinned to `@v3` for act_runner compatibility, the
`softprops/action-gh-release@v2` step replaced with a direct `curl`
against `/api/v1/repos/.../releases` (`softprops` hard-codes
`api.github.com` so it silently no-ops on Gitea), added a
`build-disk-arm64` job that builds on the `arm64-linux` runner.
v0.3.0's manual-upload-only release was the canary that exposed all
three bugs. (`f8c308d`)
### Known issues carried forward to v0.3.2
These don't block normal operation but are tracked:
- `xt_comment` userspace extension load fails on the iptables-nft path,
causing kubelet's KUBE-FIREWALL rule install to skip. Reported as
`Couldn't load match 'comment'` in the boot log. kubelet continues
without the localhost-drop rule.
- `containerd-shim-runc-v2 -info` probe reports `runc: executable file
not found in $PATH`. Cosmetic — containerd uses the absolute path
from its config when actually launching containers.
- `kube-proxy conntrack cleanup` logs `Failed to list conntrack entries:
invalid argument` every cleanup cycle. Probably needs
`CONFIG_NF_CONNTRACK_PROCFS` or netlink-glue tweaks.
- Several pods restart 12 times on first boot due to a PLEG /
runtime-probe race in the kubelet startup path. Pods stabilise.
## [0.3.0] - 2026-05-14
The main themes: generic ARM64 (not just Raspberry Pi), an honest update
lifecycle with state file + metrics, OCI multi-arch distribution via ghcr.io,
and policy gates (channels, maintenance windows, version stepping-stones,
pre-flight checks, auto-rollback).
### Added
- Generic ARM64 build track distinct from Raspberry Pi:
- `make kernel-arm64` builds a mainline kernel.org LTS kernel (6.12.10 by
default) from `arm64 defconfig` + shared `kernel-container.fragment` +
arm64 virt-host enables (VIRTIO_*, EFI_STUB, NVMe).
- `make disk-image-arm64` produces a UEFI-bootable raw GPT image with A/B
system partitions and GRUB-EFI ARM64. Targets QEMU virt, Graviton, Ampere,
or any UEFI ARM64 host.
- `hack/dev-vm-arm64.sh --disk` boots the built image through QEMU UEFI for
end-to-end testing.
- `test/qemu/test-boot-arm64-disk.sh` automated boot smoke test.
- Bumped KubeSolo to v1.1.5 (was v1.1.0). New cloud-init flags surfaced:
- `kubesolo.full` (v1.1.4+) — disable edge-optimised overrides
- `kubesolo.disable-ipv6` (v1.1.5+)
- `kubesolo.db-wal-repair` (v1.1.5+) — recover from unclean shutdowns
- Per-arch supply-chain verification: `KUBESOLO_SHA256_AMD64` and
`KUBESOLO_SHA256_ARM64` in `versions.env`, applied to the tarball before
extract.
- `docs/arm64-architecture.md` — defines the generic-vs-RPi two-track layout.
- `docs/arm64-status.md` — Phase 3 status snapshot, known limitations, what's
needed to ship.
- `docs/ci-runners.md` — Gitea Actions runner setup (Odroid arm64-linux).
- Update agent state machine and observability (`update/pkg/state`):
- Persistent on-disk `state.json` at `/var/lib/kubesolo/update/state.json`
(atomic write via tmp + rename). Records Phase (Idle / Checking /
Downloading / Staged / Activated / Verifying / Success / RolledBack /
Failed), FromVersion, ToVersion, StartedAt, UpdatedAt, LastError,
AttemptCount, HealthCheckFailures.
- `apply`, `activate`, `healthcheck`, `rollback` all transition state
explicitly on entry / exit / failure. Errors land in LastError so
`status` can show why.
- `kubesolo-update status --json` emits the full state for
orchestration tooling. Human-readable mode adds an "Update Lifecycle"
section when not idle.
- New Prometheus metrics: `kubesolo_update_phase{phase="..."}` (all 9
phase labels always emitted), `kubesolo_update_attempts_total`,
`kubesolo_update_last_attempt_timestamp_seconds`.
- Channels, maintenance windows, version policy (`update/pkg/config`):
- `/etc/kubesolo/update.conf` (key=value, comments, missing-OK) configures
server, channel, maintenance_window, pubkey, healthcheck_url,
auto_rollback_after.
- `cloud-init` top-level `updates:` block writes `update.conf` on first
boot. Empty block leaves any existing file alone.
- `apply` enforces four gates before download: maintenance window,
channel match, runtime architecture match, min_compatible_version
stepping-stone. All gate failures land in the state machine as Failed
with a clear LastError. `--force` bypasses window + node-block-label.
- `UpdateMetadata` JSON gains `channel`, `min_compatible_version`,
`architecture` (all optional, omitempty).
- OCI registry distribution (`update/pkg/oci`, ~280 LOC, 9 tests):
- `kubesolo-update apply --registry ghcr.io/<org>/kubesolo-os --tag stable`
pulls update artifacts from any OCI-compliant registry. Multi-arch
indexes resolve to the runtime.GOARCH-matching manifest automatically.
- Custom media types: `application/vnd.kubesolo.os.kernel.v1+octet-stream`
and `application/vnd.kubesolo.os.initramfs.v1+gzip`. Annotations:
`io.kubesolo.os.{version,channel,architecture,min_compatible_version,
release_notes,release_date}`.
- End-to-end digest verification from manifest to blobs via oras-go/v2.
- `build/scripts/push-oci-artifact.sh` publishes per-arch artifacts via
`oras`. Multi-arch index composition documented inline.
- Dependencies added (update module only): oras.land/oras-go/v2 and
transitive opencontainers/{go-digest,image-spec} + golang.org/x/sync.
- Pre-flight gates and deeper healthcheck (`update/pkg/health` extended,
`update/pkg/partition` extended):
- Free-space pre-flight on the passive partition (image + 10% headroom)
via `partition.FreeBytes` / `HasFreeSpaceFor`.
- Node-block-label pre-flight: refuses if the local K8s node carries
`updates.kubesolo.io/block=true`. Silently allowed when no kubeconfig
(air-gap). Skipped by `--force`.
- `CheckKubeSystemReady` waits until every kube-system pod has held
Running for ≥ N seconds (configurable via
`--kube-system-settle`).
- `CheckProbeURL` GETs an operator-supplied URL; 200 = pass. Configurable
via `--healthcheck-url` or `healthcheck_url=` in update.conf.
- `CheckDiskWritable` writes / fsyncs / reads / deletes a probe file
under `/var/lib/kubesolo` to catch a wedged data partition.
- `--auto-rollback-after N` (also `auto_rollback_after=` in update.conf):
after N consecutive post-activation healthcheck failures, the agent
calls `ForceRollback()` and the operator/init reboots. Reset to 0 on
a clean pass.
- `.gitea/workflows/build-arm64.yaml` — full ARM64 build on the Odroid
self-hosted runner. Triggers on push to main, tags, and workflow_dispatch.
Boot smoke test marked continue-on-error pending KVM or real-hardware
validation.
### Changed
- `build/scripts/build-kernel-arm64.sh` is now the **generic ARM64** kernel
build (mainline kernel.org LTS, generic UEFI/virtio).
- Renamed `build/scripts/build-kernel-rpi.sh` (was `build-kernel-arm64.sh`).
RPi kernel build (raspberrypi/linux fork, bcm2711_defconfig) lives here now.
- Renamed `build/config/kernel-container.fragment` (was
`rpi-kernel-config.fragment`). Misnomer: contents are arch-agnostic and now
shared across x86, ARM64-generic, and RPi kernels.
- `build/scripts/build-kernel.sh` (x86) refactored to consume the shared
fragment via a generic `apply_fragment` function. ~50 lines of duplication
killed.
- `KUBESOLO_VERSION` moved out of `fetch-components.sh` defaults into
`versions.env`. Bumping is now a one-line PR.
### Fixed
- Native ARM64 build hosts (e.g. an Odroid runner) no longer require the x86
cross-compiler. Both `build-kernel-arm64.sh` and `build-kernel-rpi.sh` detect
`uname -m` and use the host's gcc directly when arch matches.
- ARM64 grub.cfg console ordering: `ttyAMA0` is now the primary console
(`console=ttyS0,... console=ttyAMA0,...`). Init output is now visible on
QEMU virt and most ARM64 SBCs without further configuration.
- ARM64 boot: replaced piCore64's `/init` with our staged init at `/init` and
`/sbin/init`. Previously the kernel ran piCore's TCE handler which
segfaulted in our environment.
- ARM64 boot: replaced piCore64's broken dynamic BusyBox with the build
host's `busybox-static`. piCore's binary triggered EL0 instruction-abort
panics on QEMU virt under both `-cpu cortex-a72` and `-cpu max`.
- POSIX-character-class portability: `tr -d '[:space:]'` in
`30-kernel-modules.sh` and `40-sysctl.sh` replaced with explicit
`' \t\r\n'`. Ubuntu's busybox-static 1.30.1 doesn't parse `[:space:]` and
instead deletes the literal characters `[ : s p a c e ]`, which truncated
module names (`virtio_net` → `virtio_nt`, etc.) and sysctl keys.
- `inject-kubesolo.sh` no longer copies `init/lib/functions.sh` into
`init.d/`. Previously the main init loop tried to run it as a stage after
stage 90 and panicked with "Init completed without exec'ing KubeSolo".
- ARM64 disk image: `TARGET_ARCH=arm64 create-disk-image.sh` produces
`BOOTAA64.EFI` via `grub-mkimage -O arm64-efi` (not `bootx64.efi`). Skips
the BIOS-only `grub-install --target=i386-pc` step.
- `build/Dockerfile.builder`: added `grub-efi-amd64-bin`, `grub-efi-arm64-bin`,
`grub-pc-bin`, `grub-common`, `grub2-common`, and `busybox-static` so the
Docker-based build flow can produce ARM64 disk images and gets the same
BusyBox swap behaviour as native builds.
### Known limitations (deferred to follow-up)
- **ARM64 LABEL= resolution** doesn't work yet — piCore's `blkid`/`findfs`
crash in QEMU and our static busybox lacks the applets. Hardcoded
`/dev/vda4` as a workaround in `build/grub/grub-arm64.cfg`. Production
fix: ship static `blkid`/`findfs` or replace LABEL resolution with a
sysfs walk.
- **AppArmor profile load fails on ARM64** (apparmor_parser ABI mismatch).
Init reports it; boot continues without enforcement.
- **OCI signature verification** is deferred. The HTTP transport still
honours `--pubkey` for `.sig` files; the OCI transport is digest-verified
end-to-end via oras-go but does not yet consume cosign-style referrer
attestations. Targeted for v0.3.1.
- **Real-hardware validation** of the generic ARM64 image is still
pending. Builds and boots end-to-end under QEMU virt; production
certification waits on a Graviton / Ampere run.
- **QEMU TCG performance** can trigger KubeSolo's first-boot image-import
deadline. Not a defect in the OS itself; real hardware and KVM-accelerated
QEMU complete the import in seconds.
## [0.2.0] - 2026-02-12
### Added
- Cloud-init: support all documented KubeSolo CLI flags (`--local-storage-shared-path`, `--debug`, `--pprof-server`, `--portainer-edge-id`, `--portainer-edge-key`, `--portainer-edge-async`)
- Cloud-init: `full-config.yaml` example showing all supported parameters
- Cloud-init: KubeSolo configuration reference table in docs/cloud-init.md
- Security hardening: mount hardening, sysctl, kernel module lock, AppArmor profiles
- ARM64 Raspberry Pi support with A/B boot via tryboot
- BootEnv abstraction for GRUB and RPi boot environments
- Go 1.25.5 installed on host for native builds
## [0.1.0] - 2026-02-12
First release with all 5 design-doc phases complete. ISO boots and runs K8s pods.
### Added
#### Custom Kernel
- Custom kernel build (6.18.2-tinycore64) with container-critical configs
- Added CONFIG_CGROUP_BPF, CONFIG_DEVTMPFS, CONFIG_DEVTMPFS_MOUNT, CONFIG_MEMCG, CONFIG_CFS_BANDWIDTH
- Stripped unnecessary subsystems (sound, GPU, wireless, Bluetooth, etc.)
- Selective kernel module install — only modules.list + transitive deps in initramfs
#### Init System (Phase 1)
- POSIX sh init system with staged boot (00-early-mount through 90-kubesolo)
- switch_root from initramfs to SquashFS root
- Persistent data partition mount with bind-mounts for K8s state
- Kernel module loading, sysctl tuning, network, hostname, NTP
- Emergency shell fallback on boot failure
- Device node creation via mknod fallback from sysfs
#### Cloud-Init (Phase 2)
- Go-based cloud-init parser (~2.7 MB static binary)
- Network configuration: DHCP and static IP modes
- Hostname and machine-id generation
- KubeSolo configuration (node-name, extra flags)
- Portainer Edge Agent integration via K8s manifest injection
- Persistent config saved to /mnt/data/ for next-boot fast path
- 22 Go tests
#### A/B Atomic Updates (Phase 3)
- 4-partition GPT disk image: EFI + System A + System B + Data
- GRUB 2 bootloader with A/B slot selection and boot counter rollback
- Go update agent (~6.0 MB static binary) with check, apply, activate, rollback commands
- Health check: containerd + K8s API + node Ready verification
- Update server protocol: HTTP serving latest.json + image files
- K8s CronJob for automated update checks (every 6 hours)
- Zero external Go dependencies — uses kubectl/ctr exec commands
#### Production Hardening (Phase 4)
- Ed25519 image signing with pure Go stdlib (zero external deps)
- Key generation, signing, and verification CLI commands
- Portainer Edge Agent deployment via cloud-init
- SSH extension injection for debugging (hack/inject-ssh.sh)
- Boot time and resource usage benchmarks
- Deployment guide documentation
#### Distribution & Fleet Management (Phase 5)
- Gitea Actions CI/CD (test + build + shellcheck on push, release on tags)
- OCI container image packaging (scratch-based)
- Prometheus metrics endpoint (zero-dependency text exposition format)
- USB provisioning script with cloud-init injection
- ARM64 cross-compilation support
#### Build System
- Makefile with full build orchestration
- Dockerized reproducible builds (build/Dockerfile.builder)
- Component fetching with version pinning
- ISO and raw disk image creation
- Fast rebuild path (`make quick`)
#### Documentation
- Architecture design document
- Boot flow reference
- A/B update flow reference
- Cloud-init configuration reference
- Deployment and operations guide
### Fixed
- Replaced `grep -oP` with POSIX-safe `sed` in functions.sh (BusyBox compatibility)
- Replaced `grep -qiE` with `grep -qi -e` pattern (POSIX compliance)
- Fixed KVM flag handling in dev-vm.sh (bash array context)
- Added iptables table pre-initialization before kube-proxy start (nf_tables issue)
- Added /dev/kmsg and /etc/machine-id creation for kubelet
- Added CA certificates bundle to initramfs (containerd TLS verification for Docker Hub)
- Added DNS fallback (10.0.2.3 + 8.8.8.8) when DHCP client doesn't populate resolv.conf
- Added headless Service to Portainer Edge Agent manifest (agent peer discovery DNS)
- Added kubesolo.edge_id/edge_key kernel boot parameters for Portainer Edge
- Added auto-format of unformatted data disks on first boot
- Rewrote dev-vm.sh for macOS: bsdtar ISO extraction, Homebrew mkfs.ext4 detection, direct kernel boot, TCG acceleration, port 8080 forwarding
- Kubeconfig now served via HTTP on port 8080 (serial console truncates base64 lines)
- Added 127.0.0.1 and 10.0.2.15 to API server SANs for QEMU port forwarding
- dev-vm.sh now works on Linux: fallback ISO extraction via isoinfo or loop mount, KVM auto-detection, platform-aware error messages