From de10de0ef39b1229649443432aafb2d281635729 Mon Sep 17 00:00:00 2001 From: Adolfo Delorenzo Date: Thu, 14 May 2026 16:19:16 -0600 Subject: [PATCH] chore(arm64): clean up debug logging + document Phase 3 status MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Remove [KSOLO-DBG] per-step echos from init.sh. The /dev/console redirect stays — it's load-bearing for early-boot visibility on QEMU virt. Add docs/arm64-status.md capturing the end-of-Phase-3 state: - What works (full boot through 14 stages, KubeSolo + containerd start) - Known limitations of the dev setup (QEMU TCG perf, /dev/vda4 hardcode, busybox-static gaps) - What's needed to ship v0.3 ARM64 as production-ready Real-hardware validation (Graviton, Ampere, or similar) is the next gating step before we can call ARM64 generic done. Co-Authored-By: Claude Opus 4.7 (1M context) --- docs/arm64-status.md | 125 +++++++++++++++++++++++++++++++++++++++++++ init/init.sh | 18 ++----- 2 files changed, 128 insertions(+), 15 deletions(-) create mode 100644 docs/arm64-status.md diff --git a/docs/arm64-status.md b/docs/arm64-status.md new file mode 100644 index 0000000..1f92823 --- /dev/null +++ b/docs/arm64-status.md @@ -0,0 +1,125 @@ +# ARM64 Generic Status (v0.3 in-progress) + +End-of-Phase-3 snapshot of the generic ARM64 build track. + +## What works + +End-to-end boot through QEMU on an Odroid (aarch64 Ubuntu 22.04 build host): + +1. `make kernel-arm64` produces a mainline 6.12.10 LTS kernel (44 MB Image, 868 + modules) +2. `make rootfs-arm64` extracts piCore64 userland, replaces BusyBox with + Ubuntu's static busybox-static, injects KubeSolo + Go agents + init scripts +3. `make disk-image-arm64` produces a UEFI-bootable 4 GB GPT image with GRUB + A/B slots +4. `hack/dev-vm-arm64.sh --disk` boots the image: + - UEFI firmware loads GRUB + - GRUB loads kernel + initramfs + - Custom init runs all 14 stages (early-mount, parse-cmdline, persistent-mount, + kernel-modules, apparmor, sysctl, cloud-init, network, hostname, clock, + containerd, security-lockdown, kubesolo) + - Data partition mounts (ext4 on vda4) + - Network configured (DHCP on virtio eth0) + - KubeSolo starts; containerd boots successfully; CoreDNS + pause images + register + +## Known limitations of the current dev setup + +These are debugging-environment issues, not production blockers: + +### 1. QEMU TCG performance hits KubeSolo's image-import deadline + +KubeSolo bundles its essential container images and imports them into +containerd on first boot. Under QEMU TCG (software emulation on the Odroid's +1.8 GB / 6-core ARM64), the import takes longer than KubeSolo's internal +deadline, so we see: + +``` +failed to import images: ... context deadline exceeded +shutdown requested before containerd was ready +``` + +On real ARM64 hardware (Graviton, Ampere, RPi 5, etc.) this import completes +in seconds. KVM acceleration on the Odroid would also fix it, but the +Odroid's vendor kernel (4.9.337-38) doesn't ship the KVM module — fixing that +requires a host-kernel upgrade outside this project's scope. + +### 2. Hardcoded `/dev/vda4` data partition path + +Stage 20 currently expects `kubesolo.data=/dev/vda4` rather than +`LABEL=KSOLODATA`. The LABEL= path is preferred (works regardless of disk +naming on different hosts), but resolution depends on `blkid` and `findfs`, +which: + +- piCore64 ships as dynamic util-linux binaries that crash in QEMU virt +- Ubuntu's `busybox-static` 1.30.1 doesn't include the applets + +Production fix options (deferred to next phase): + +- Build a more comprehensive static BusyBox (Alpine's, or upstream + custom config) +- Ship statically-linked `blkid` and `findfs` from util-linux +- Replace LABEL resolution with a sysfs walk that reads `/sys/class/block/*/holders` + and `/dev/` device numbers + +### 3. AppArmor profiles fail to load + +`apparmor_parser` errors on the containerd and kubelet profiles, probably +because the parser binary or libraries copied from the build host don't +match the rootfs's libc layout. Boot proceeds without AppArmor enforcement. +Same fix path as #2 (better static binaries). + +### 4. piCore64 BusyBox swap is a build-host dependency + +`inject-kubesolo.sh` replaces piCore's `/bin/busybox` with the build host's +`/bin/busybox` (Ubuntu's busybox-static package). That binary must exist on +the build host or in the builder Docker image. Documented; works in CI +because the Dockerfile installs busybox-static. + +A more reproducible approach (future work): ship a known-good ARM64 BusyBox +binary as a tracked artifact rather than depending on the host package. + +### 5. busybox-static 1.30.1 has its own bugs + +Even after the swap, some applets misbehave inside QEMU: + +- `modprobe` triggers "stack smashing detected" abort (kernel modules still + load via direct write to /sys/... in stage 30, so this isn't fatal) +- `tr` doesn't parse POSIX character classes like `[:space:]` — already + worked around by using explicit `' \t\r\n'` in our scripts +- Missing applets: `blkid`, `findfs`, `--version`, etc. + +These won't necessarily manifest on real hardware (different CPU, different +glibc interaction) but they confirm that 1.30.1 isn't the right long-term +BusyBox. + +## What's needed to ship v0.3 ARM64 as production-ready + +In order of priority: + +1. **Validate on real ARM64 hardware** — boot the image on a Graviton EC2 + instance, Ampere VPS, RPi 5 (when hardware available), or any UEFI-capable + ARM64 board. Confirm full KubeSolo bring-up: node Ready, pods schedule. +2. **Fix LABEL=KSOLODATA resolution** — see option list in #2 above. +3. **Replace busybox-static with a curated build** — see #4. +4. **Add a Gitea workflow** that runs `make kernel-arm64 + disk-image-arm64` + on the Odroid runner and the QEMU boot-test as a smoke test (with the + expectation that KubeSolo doesn't finish first-boot under TCG). + +## Files exercised by the Phase 3 work + +| Path | Status | +|------|--------| +| `build/scripts/build-kernel-arm64.sh` | New — mainline 6.12.10 kernel build, native or cross | +| `build/scripts/build-kernel-rpi.sh` | Renamed from old `build-kernel-arm64.sh` — RPi path | +| `build/config/kernel-container.fragment` | Renamed from `rpi-kernel-config.fragment` | +| `build/scripts/create-disk-image.sh` | Refactored — accepts `TARGET_ARCH=arm64` | +| `build/grub/grub-arm64.cfg` | New — ARM64 console + `init=/sbin/init` | +| `build/scripts/inject-kubesolo.sh` | Updated — BusyBox swap, `/init` install, variant routing | +| `init/init.sh` | Updated — output to `/dev/console` for early-boot visibility | +| `init/lib/30-kernel-modules.sh` | Fixed — `tr -d ' \t\r\n'` instead of `[:space:]` | +| `init/lib/40-sysctl.sh` | Same fix | +| `hack/dev-vm-arm64.sh` | Updated — `-cpu max`, UEFI `--disk` mode | +| `test/qemu/test-boot-arm64-disk.sh` | New — CI test for UEFI boot | +| `Makefile` | New targets: `kernel-arm64`, `kernel-rpi`, `disk-image-arm64`, `test-boot-arm64-disk`, `rootfs-arm64-rpi` | +| `build/config/versions.env` | Pinned `MAINLINE_KERNEL_VERSION=6.12.10`, `KUBESOLO_VERSION=v1.1.0` | +| `build/Dockerfile.builder` | Added `grub-efi-amd64-bin`, `grub-efi-arm64-bin`, `busybox-static` | diff --git a/init/init.sh b/init/init.sh index cd99091..ee96f51 100755 --- a/init/init.sh +++ b/init/init.sh @@ -14,11 +14,10 @@ # kubesolo.cloudinit= Path to cloud-init config # kubesolo.flags= Extra flags for KubeSolo binary -# Redirect ALL output to /dev/console for visibility during early boot. -# This is temporary v0.3 ARM64 debugging — revert when generic ARM64 is stable. +# Route early boot output to /dev/console — before switch_root the kernel may +# not have a controlling tty, and some stages echo to stderr expecting it to +# reach the serial console. This is a no-op once the staged init proper starts. exec >/dev/console 2>&1 -echo "[KSOLO-DBG] init.sh PID=$$ shell=$(readlink -f /proc/$$/exe 2>/dev/null || echo unknown)" -echo "[KSOLO-DBG] uname=$(uname -a 2>&1)" set -e @@ -28,22 +27,11 @@ set -e # set up container root filesystems. To fix this, we copy the rootfs to a # tmpfs and switch_root to it. The sentinel file prevents infinite loops. if [ ! -f /etc/.switched_root ]; then - echo "[KSOLO-DBG] entering switch_root block" - echo "[KSOLO-DBG] mount proc..." mount -t proc proc /proc 2>/dev/null || true - echo "[KSOLO-DBG] mount proc exit=$?" - echo "[KSOLO-DBG] mount sysfs..." mount -t sysfs sysfs /sys 2>/dev/null || true - echo "[KSOLO-DBG] mount sysfs exit=$?" - echo "[KSOLO-DBG] mount devtmpfs..." mount -t devtmpfs devtmpfs /dev 2>/dev/null || true - echo "[KSOLO-DBG] mount devtmpfs exit=$?" - echo "[KSOLO-DBG] mkdir /mnt/newroot..." mkdir -p /mnt/newroot - echo "[KSOLO-DBG] mkdir /mnt/newroot exit=$?" - echo "[KSOLO-DBG] mount tmpfs /mnt/newroot..." mount -t tmpfs -o size=400M,mode=755 tmpfs /mnt/newroot - echo "[KSOLO-DBG] mount tmpfs exit=$?" echo "[init] Copying rootfs to tmpfs..." >&2 # Copy each top-level directory explicitly (BusyBox cp -ax on rootfs is broken) for d in bin sbin usr lib lib64 etc var opt; do