Files
kubesolo-os/docs/arm64-status.md
Adolfo Delorenzo de10de0ef3
Some checks failed
CI / Go Tests (push) Successful in 1m46s
CI / Shellcheck (push) Failing after 38s
CI / Build Go Binaries (amd64, linux, linux-amd64) (push) Failing after 1m19s
CI / Build Go Binaries (arm64, linux, linux-arm64) (push) Failing after 1m16s
chore(arm64): clean up debug logging + document Phase 3 status
Remove [KSOLO-DBG] per-step echos from init.sh. The /dev/console redirect
stays — it's load-bearing for early-boot visibility on QEMU virt.

Add docs/arm64-status.md capturing the end-of-Phase-3 state:
  - What works (full boot through 14 stages, KubeSolo + containerd start)
  - Known limitations of the dev setup (QEMU TCG perf, /dev/vda4 hardcode,
    busybox-static gaps)
  - What's needed to ship v0.3 ARM64 as production-ready

Real-hardware validation (Graviton, Ampere, or similar) is the next gating
step before we can call ARM64 generic done.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-14 16:19:16 -06:00

5.8 KiB

ARM64 Generic Status (v0.3 in-progress)

End-of-Phase-3 snapshot of the generic ARM64 build track.

What works

End-to-end boot through QEMU on an Odroid (aarch64 Ubuntu 22.04 build host):

  1. make kernel-arm64 produces a mainline 6.12.10 LTS kernel (44 MB Image, 868 modules)
  2. make rootfs-arm64 extracts piCore64 userland, replaces BusyBox with Ubuntu's static busybox-static, injects KubeSolo + Go agents + init scripts
  3. make disk-image-arm64 produces a UEFI-bootable 4 GB GPT image with GRUB A/B slots
  4. hack/dev-vm-arm64.sh --disk boots the image:
    • UEFI firmware loads GRUB
    • GRUB loads kernel + initramfs
    • Custom init runs all 14 stages (early-mount, parse-cmdline, persistent-mount, kernel-modules, apparmor, sysctl, cloud-init, network, hostname, clock, containerd, security-lockdown, kubesolo)
    • Data partition mounts (ext4 on vda4)
    • Network configured (DHCP on virtio eth0)
    • KubeSolo starts; containerd boots successfully; CoreDNS + pause images register

Known limitations of the current dev setup

These are debugging-environment issues, not production blockers:

1. QEMU TCG performance hits KubeSolo's image-import deadline

KubeSolo bundles its essential container images and imports them into containerd on first boot. Under QEMU TCG (software emulation on the Odroid's 1.8 GB / 6-core ARM64), the import takes longer than KubeSolo's internal deadline, so we see:

failed to import images: ... context deadline exceeded
shutdown requested before containerd was ready

On real ARM64 hardware (Graviton, Ampere, RPi 5, etc.) this import completes in seconds. KVM acceleration on the Odroid would also fix it, but the Odroid's vendor kernel (4.9.337-38) doesn't ship the KVM module — fixing that requires a host-kernel upgrade outside this project's scope.

2. Hardcoded /dev/vda4 data partition path

Stage 20 currently expects kubesolo.data=/dev/vda4 rather than LABEL=KSOLODATA. The LABEL= path is preferred (works regardless of disk naming on different hosts), but resolution depends on blkid and findfs, which:

  • piCore64 ships as dynamic util-linux binaries that crash in QEMU virt
  • Ubuntu's busybox-static 1.30.1 doesn't include the applets

Production fix options (deferred to next phase):

  • Build a more comprehensive static BusyBox (Alpine's, or upstream + custom config)
  • Ship statically-linked blkid and findfs from util-linux
  • Replace LABEL resolution with a sysfs walk that reads /sys/class/block/*/holders and /dev/<n> device numbers

3. AppArmor profiles fail to load

apparmor_parser errors on the containerd and kubelet profiles, probably because the parser binary or libraries copied from the build host don't match the rootfs's libc layout. Boot proceeds without AppArmor enforcement. Same fix path as #2 (better static binaries).

4. piCore64 BusyBox swap is a build-host dependency

inject-kubesolo.sh replaces piCore's /bin/busybox with the build host's /bin/busybox (Ubuntu's busybox-static package). That binary must exist on the build host or in the builder Docker image. Documented; works in CI because the Dockerfile installs busybox-static.

A more reproducible approach (future work): ship a known-good ARM64 BusyBox binary as a tracked artifact rather than depending on the host package.

5. busybox-static 1.30.1 has its own bugs

Even after the swap, some applets misbehave inside QEMU:

  • modprobe triggers "stack smashing detected" abort (kernel modules still load via direct write to /sys/... in stage 30, so this isn't fatal)
  • tr doesn't parse POSIX character classes like [:space:] — already worked around by using explicit ' \t\r\n' in our scripts
  • Missing applets: blkid, findfs, --version, etc.

These won't necessarily manifest on real hardware (different CPU, different glibc interaction) but they confirm that 1.30.1 isn't the right long-term BusyBox.

What's needed to ship v0.3 ARM64 as production-ready

In order of priority:

  1. Validate on real ARM64 hardware — boot the image on a Graviton EC2 instance, Ampere VPS, RPi 5 (when hardware available), or any UEFI-capable ARM64 board. Confirm full KubeSolo bring-up: node Ready, pods schedule.
  2. Fix LABEL=KSOLODATA resolution — see option list in #2 above.
  3. Replace busybox-static with a curated build — see #4.
  4. Add a Gitea workflow that runs make kernel-arm64 + disk-image-arm64 on the Odroid runner and the QEMU boot-test as a smoke test (with the expectation that KubeSolo doesn't finish first-boot under TCG).

Files exercised by the Phase 3 work

Path Status
build/scripts/build-kernel-arm64.sh New — mainline 6.12.10 kernel build, native or cross
build/scripts/build-kernel-rpi.sh Renamed from old build-kernel-arm64.sh — RPi path
build/config/kernel-container.fragment Renamed from rpi-kernel-config.fragment
build/scripts/create-disk-image.sh Refactored — accepts TARGET_ARCH=arm64
build/grub/grub-arm64.cfg New — ARM64 console + init=/sbin/init
build/scripts/inject-kubesolo.sh Updated — BusyBox swap, /init install, variant routing
init/init.sh Updated — output to /dev/console for early-boot visibility
init/lib/30-kernel-modules.sh Fixed — tr -d ' \t\r\n' instead of [:space:]
init/lib/40-sysctl.sh Same fix
hack/dev-vm-arm64.sh Updated — -cpu max, UEFI --disk mode
test/qemu/test-boot-arm64-disk.sh New — CI test for UEFI boot
Makefile New targets: kernel-arm64, kernel-rpi, disk-image-arm64, test-boot-arm64-disk, rootfs-arm64-rpi
build/config/versions.env Pinned MAINLINE_KERNEL_VERSION=6.12.10, KUBESOLO_VERSION=v1.1.0
build/Dockerfile.builder Added grub-efi-amd64-bin, grub-efi-arm64-bin, busybox-static