chore(arm64): clean up debug logging + document Phase 3 status
Some checks failed
CI / Go Tests (push) Successful in 1m46s
CI / Shellcheck (push) Failing after 38s
CI / Build Go Binaries (amd64, linux, linux-amd64) (push) Failing after 1m19s
CI / Build Go Binaries (arm64, linux, linux-arm64) (push) Failing after 1m16s

Remove [KSOLO-DBG] per-step echos from init.sh. The /dev/console redirect
stays — it's load-bearing for early-boot visibility on QEMU virt.

Add docs/arm64-status.md capturing the end-of-Phase-3 state:
  - What works (full boot through 14 stages, KubeSolo + containerd start)
  - Known limitations of the dev setup (QEMU TCG perf, /dev/vda4 hardcode,
    busybox-static gaps)
  - What's needed to ship v0.3 ARM64 as production-ready

Real-hardware validation (Graviton, Ampere, or similar) is the next gating
step before we can call ARM64 generic done.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
2026-05-14 16:19:16 -06:00
parent 1de36289a5
commit de10de0ef3
2 changed files with 128 additions and 15 deletions

125
docs/arm64-status.md Normal file
View File

@@ -0,0 +1,125 @@
# ARM64 Generic Status (v0.3 in-progress)
End-of-Phase-3 snapshot of the generic ARM64 build track.
## What works
End-to-end boot through QEMU on an Odroid (aarch64 Ubuntu 22.04 build host):
1. `make kernel-arm64` produces a mainline 6.12.10 LTS kernel (44 MB Image, 868
modules)
2. `make rootfs-arm64` extracts piCore64 userland, replaces BusyBox with
Ubuntu's static busybox-static, injects KubeSolo + Go agents + init scripts
3. `make disk-image-arm64` produces a UEFI-bootable 4 GB GPT image with GRUB
A/B slots
4. `hack/dev-vm-arm64.sh --disk` boots the image:
- UEFI firmware loads GRUB
- GRUB loads kernel + initramfs
- Custom init runs all 14 stages (early-mount, parse-cmdline, persistent-mount,
kernel-modules, apparmor, sysctl, cloud-init, network, hostname, clock,
containerd, security-lockdown, kubesolo)
- Data partition mounts (ext4 on vda4)
- Network configured (DHCP on virtio eth0)
- KubeSolo starts; containerd boots successfully; CoreDNS + pause images
register
## Known limitations of the current dev setup
These are debugging-environment issues, not production blockers:
### 1. QEMU TCG performance hits KubeSolo's image-import deadline
KubeSolo bundles its essential container images and imports them into
containerd on first boot. Under QEMU TCG (software emulation on the Odroid's
1.8 GB / 6-core ARM64), the import takes longer than KubeSolo's internal
deadline, so we see:
```
failed to import images: ... context deadline exceeded
shutdown requested before containerd was ready
```
On real ARM64 hardware (Graviton, Ampere, RPi 5, etc.) this import completes
in seconds. KVM acceleration on the Odroid would also fix it, but the
Odroid's vendor kernel (4.9.337-38) doesn't ship the KVM module — fixing that
requires a host-kernel upgrade outside this project's scope.
### 2. Hardcoded `/dev/vda4` data partition path
Stage 20 currently expects `kubesolo.data=/dev/vda4` rather than
`LABEL=KSOLODATA`. The LABEL= path is preferred (works regardless of disk
naming on different hosts), but resolution depends on `blkid` and `findfs`,
which:
- piCore64 ships as dynamic util-linux binaries that crash in QEMU virt
- Ubuntu's `busybox-static` 1.30.1 doesn't include the applets
Production fix options (deferred to next phase):
- Build a more comprehensive static BusyBox (Alpine's, or upstream + custom config)
- Ship statically-linked `blkid` and `findfs` from util-linux
- Replace LABEL resolution with a sysfs walk that reads `/sys/class/block/*/holders`
and `/dev/<n>` device numbers
### 3. AppArmor profiles fail to load
`apparmor_parser` errors on the containerd and kubelet profiles, probably
because the parser binary or libraries copied from the build host don't
match the rootfs's libc layout. Boot proceeds without AppArmor enforcement.
Same fix path as #2 (better static binaries).
### 4. piCore64 BusyBox swap is a build-host dependency
`inject-kubesolo.sh` replaces piCore's `/bin/busybox` with the build host's
`/bin/busybox` (Ubuntu's busybox-static package). That binary must exist on
the build host or in the builder Docker image. Documented; works in CI
because the Dockerfile installs busybox-static.
A more reproducible approach (future work): ship a known-good ARM64 BusyBox
binary as a tracked artifact rather than depending on the host package.
### 5. busybox-static 1.30.1 has its own bugs
Even after the swap, some applets misbehave inside QEMU:
- `modprobe` triggers "stack smashing detected" abort (kernel modules still
load via direct write to /sys/... in stage 30, so this isn't fatal)
- `tr` doesn't parse POSIX character classes like `[:space:]` — already
worked around by using explicit `' \t\r\n'` in our scripts
- Missing applets: `blkid`, `findfs`, `--version`, etc.
These won't necessarily manifest on real hardware (different CPU, different
glibc interaction) but they confirm that 1.30.1 isn't the right long-term
BusyBox.
## What's needed to ship v0.3 ARM64 as production-ready
In order of priority:
1. **Validate on real ARM64 hardware** — boot the image on a Graviton EC2
instance, Ampere VPS, RPi 5 (when hardware available), or any UEFI-capable
ARM64 board. Confirm full KubeSolo bring-up: node Ready, pods schedule.
2. **Fix LABEL=KSOLODATA resolution** — see option list in #2 above.
3. **Replace busybox-static with a curated build** — see #4.
4. **Add a Gitea workflow** that runs `make kernel-arm64 + disk-image-arm64`
on the Odroid runner and the QEMU boot-test as a smoke test (with the
expectation that KubeSolo doesn't finish first-boot under TCG).
## Files exercised by the Phase 3 work
| Path | Status |
|------|--------|
| `build/scripts/build-kernel-arm64.sh` | New — mainline 6.12.10 kernel build, native or cross |
| `build/scripts/build-kernel-rpi.sh` | Renamed from old `build-kernel-arm64.sh` — RPi path |
| `build/config/kernel-container.fragment` | Renamed from `rpi-kernel-config.fragment` |
| `build/scripts/create-disk-image.sh` | Refactored — accepts `TARGET_ARCH=arm64` |
| `build/grub/grub-arm64.cfg` | New — ARM64 console + `init=/sbin/init` |
| `build/scripts/inject-kubesolo.sh` | Updated — BusyBox swap, `/init` install, variant routing |
| `init/init.sh` | Updated — output to `/dev/console` for early-boot visibility |
| `init/lib/30-kernel-modules.sh` | Fixed — `tr -d ' \t\r\n'` instead of `[:space:]` |
| `init/lib/40-sysctl.sh` | Same fix |
| `hack/dev-vm-arm64.sh` | Updated — `-cpu max`, UEFI `--disk` mode |
| `test/qemu/test-boot-arm64-disk.sh` | New — CI test for UEFI boot |
| `Makefile` | New targets: `kernel-arm64`, `kernel-rpi`, `disk-image-arm64`, `test-boot-arm64-disk`, `rootfs-arm64-rpi` |
| `build/config/versions.env` | Pinned `MAINLINE_KERNEL_VERSION=6.12.10`, `KUBESOLO_VERSION=v1.1.0` |
| `build/Dockerfile.builder` | Added `grub-efi-amd64-bin`, `grub-efi-arm64-bin`, `busybox-static` |

View File

@@ -14,11 +14,10 @@
# kubesolo.cloudinit=<path> Path to cloud-init config
# kubesolo.flags=<flags> Extra flags for KubeSolo binary
# Redirect ALL output to /dev/console for visibility during early boot.
# This is temporary v0.3 ARM64 debugging — revert when generic ARM64 is stable.
# Route early boot output to /dev/console — before switch_root the kernel may
# not have a controlling tty, and some stages echo to stderr expecting it to
# reach the serial console. This is a no-op once the staged init proper starts.
exec >/dev/console 2>&1
echo "[KSOLO-DBG] init.sh PID=$$ shell=$(readlink -f /proc/$$/exe 2>/dev/null || echo unknown)"
echo "[KSOLO-DBG] uname=$(uname -a 2>&1)"
set -e
@@ -28,22 +27,11 @@ set -e
# set up container root filesystems. To fix this, we copy the rootfs to a
# tmpfs and switch_root to it. The sentinel file prevents infinite loops.
if [ ! -f /etc/.switched_root ]; then
echo "[KSOLO-DBG] entering switch_root block"
echo "[KSOLO-DBG] mount proc..."
mount -t proc proc /proc 2>/dev/null || true
echo "[KSOLO-DBG] mount proc exit=$?"
echo "[KSOLO-DBG] mount sysfs..."
mount -t sysfs sysfs /sys 2>/dev/null || true
echo "[KSOLO-DBG] mount sysfs exit=$?"
echo "[KSOLO-DBG] mount devtmpfs..."
mount -t devtmpfs devtmpfs /dev 2>/dev/null || true
echo "[KSOLO-DBG] mount devtmpfs exit=$?"
echo "[KSOLO-DBG] mkdir /mnt/newroot..."
mkdir -p /mnt/newroot
echo "[KSOLO-DBG] mkdir /mnt/newroot exit=$?"
echo "[KSOLO-DBG] mount tmpfs /mnt/newroot..."
mount -t tmpfs -o size=400M,mode=755 tmpfs /mnt/newroot
echo "[KSOLO-DBG] mount tmpfs exit=$?"
echo "[init] Copying rootfs to tmpfs..." >&2
# Copy each top-level directory explicitly (BusyBox cp -ax on rootfs is broken)
for d in bin sbin usr lib lib64 etc var opt; do