chore(arm64): clean up debug logging + document Phase 3 status
Remove [KSOLO-DBG] per-step echos from init.sh. The /dev/console redirect
stays — it's load-bearing for early-boot visibility on QEMU virt.
Add docs/arm64-status.md capturing the end-of-Phase-3 state:
- What works (full boot through 14 stages, KubeSolo + containerd start)
- Known limitations of the dev setup (QEMU TCG perf, /dev/vda4 hardcode,
busybox-static gaps)
- What's needed to ship v0.3 ARM64 as production-ready
Real-hardware validation (Graviton, Ampere, or similar) is the next gating
step before we can call ARM64 generic done.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
125
docs/arm64-status.md
Normal file
125
docs/arm64-status.md
Normal file
@@ -0,0 +1,125 @@
|
||||
# ARM64 Generic Status (v0.3 in-progress)
|
||||
|
||||
End-of-Phase-3 snapshot of the generic ARM64 build track.
|
||||
|
||||
## What works
|
||||
|
||||
End-to-end boot through QEMU on an Odroid (aarch64 Ubuntu 22.04 build host):
|
||||
|
||||
1. `make kernel-arm64` produces a mainline 6.12.10 LTS kernel (44 MB Image, 868
|
||||
modules)
|
||||
2. `make rootfs-arm64` extracts piCore64 userland, replaces BusyBox with
|
||||
Ubuntu's static busybox-static, injects KubeSolo + Go agents + init scripts
|
||||
3. `make disk-image-arm64` produces a UEFI-bootable 4 GB GPT image with GRUB
|
||||
A/B slots
|
||||
4. `hack/dev-vm-arm64.sh --disk` boots the image:
|
||||
- UEFI firmware loads GRUB
|
||||
- GRUB loads kernel + initramfs
|
||||
- Custom init runs all 14 stages (early-mount, parse-cmdline, persistent-mount,
|
||||
kernel-modules, apparmor, sysctl, cloud-init, network, hostname, clock,
|
||||
containerd, security-lockdown, kubesolo)
|
||||
- Data partition mounts (ext4 on vda4)
|
||||
- Network configured (DHCP on virtio eth0)
|
||||
- KubeSolo starts; containerd boots successfully; CoreDNS + pause images
|
||||
register
|
||||
|
||||
## Known limitations of the current dev setup
|
||||
|
||||
These are debugging-environment issues, not production blockers:
|
||||
|
||||
### 1. QEMU TCG performance hits KubeSolo's image-import deadline
|
||||
|
||||
KubeSolo bundles its essential container images and imports them into
|
||||
containerd on first boot. Under QEMU TCG (software emulation on the Odroid's
|
||||
1.8 GB / 6-core ARM64), the import takes longer than KubeSolo's internal
|
||||
deadline, so we see:
|
||||
|
||||
```
|
||||
failed to import images: ... context deadline exceeded
|
||||
shutdown requested before containerd was ready
|
||||
```
|
||||
|
||||
On real ARM64 hardware (Graviton, Ampere, RPi 5, etc.) this import completes
|
||||
in seconds. KVM acceleration on the Odroid would also fix it, but the
|
||||
Odroid's vendor kernel (4.9.337-38) doesn't ship the KVM module — fixing that
|
||||
requires a host-kernel upgrade outside this project's scope.
|
||||
|
||||
### 2. Hardcoded `/dev/vda4` data partition path
|
||||
|
||||
Stage 20 currently expects `kubesolo.data=/dev/vda4` rather than
|
||||
`LABEL=KSOLODATA`. The LABEL= path is preferred (works regardless of disk
|
||||
naming on different hosts), but resolution depends on `blkid` and `findfs`,
|
||||
which:
|
||||
|
||||
- piCore64 ships as dynamic util-linux binaries that crash in QEMU virt
|
||||
- Ubuntu's `busybox-static` 1.30.1 doesn't include the applets
|
||||
|
||||
Production fix options (deferred to next phase):
|
||||
|
||||
- Build a more comprehensive static BusyBox (Alpine's, or upstream + custom config)
|
||||
- Ship statically-linked `blkid` and `findfs` from util-linux
|
||||
- Replace LABEL resolution with a sysfs walk that reads `/sys/class/block/*/holders`
|
||||
and `/dev/<n>` device numbers
|
||||
|
||||
### 3. AppArmor profiles fail to load
|
||||
|
||||
`apparmor_parser` errors on the containerd and kubelet profiles, probably
|
||||
because the parser binary or libraries copied from the build host don't
|
||||
match the rootfs's libc layout. Boot proceeds without AppArmor enforcement.
|
||||
Same fix path as #2 (better static binaries).
|
||||
|
||||
### 4. piCore64 BusyBox swap is a build-host dependency
|
||||
|
||||
`inject-kubesolo.sh` replaces piCore's `/bin/busybox` with the build host's
|
||||
`/bin/busybox` (Ubuntu's busybox-static package). That binary must exist on
|
||||
the build host or in the builder Docker image. Documented; works in CI
|
||||
because the Dockerfile installs busybox-static.
|
||||
|
||||
A more reproducible approach (future work): ship a known-good ARM64 BusyBox
|
||||
binary as a tracked artifact rather than depending on the host package.
|
||||
|
||||
### 5. busybox-static 1.30.1 has its own bugs
|
||||
|
||||
Even after the swap, some applets misbehave inside QEMU:
|
||||
|
||||
- `modprobe` triggers "stack smashing detected" abort (kernel modules still
|
||||
load via direct write to /sys/... in stage 30, so this isn't fatal)
|
||||
- `tr` doesn't parse POSIX character classes like `[:space:]` — already
|
||||
worked around by using explicit `' \t\r\n'` in our scripts
|
||||
- Missing applets: `blkid`, `findfs`, `--version`, etc.
|
||||
|
||||
These won't necessarily manifest on real hardware (different CPU, different
|
||||
glibc interaction) but they confirm that 1.30.1 isn't the right long-term
|
||||
BusyBox.
|
||||
|
||||
## What's needed to ship v0.3 ARM64 as production-ready
|
||||
|
||||
In order of priority:
|
||||
|
||||
1. **Validate on real ARM64 hardware** — boot the image on a Graviton EC2
|
||||
instance, Ampere VPS, RPi 5 (when hardware available), or any UEFI-capable
|
||||
ARM64 board. Confirm full KubeSolo bring-up: node Ready, pods schedule.
|
||||
2. **Fix LABEL=KSOLODATA resolution** — see option list in #2 above.
|
||||
3. **Replace busybox-static with a curated build** — see #4.
|
||||
4. **Add a Gitea workflow** that runs `make kernel-arm64 + disk-image-arm64`
|
||||
on the Odroid runner and the QEMU boot-test as a smoke test (with the
|
||||
expectation that KubeSolo doesn't finish first-boot under TCG).
|
||||
|
||||
## Files exercised by the Phase 3 work
|
||||
|
||||
| Path | Status |
|
||||
|------|--------|
|
||||
| `build/scripts/build-kernel-arm64.sh` | New — mainline 6.12.10 kernel build, native or cross |
|
||||
| `build/scripts/build-kernel-rpi.sh` | Renamed from old `build-kernel-arm64.sh` — RPi path |
|
||||
| `build/config/kernel-container.fragment` | Renamed from `rpi-kernel-config.fragment` |
|
||||
| `build/scripts/create-disk-image.sh` | Refactored — accepts `TARGET_ARCH=arm64` |
|
||||
| `build/grub/grub-arm64.cfg` | New — ARM64 console + `init=/sbin/init` |
|
||||
| `build/scripts/inject-kubesolo.sh` | Updated — BusyBox swap, `/init` install, variant routing |
|
||||
| `init/init.sh` | Updated — output to `/dev/console` for early-boot visibility |
|
||||
| `init/lib/30-kernel-modules.sh` | Fixed — `tr -d ' \t\r\n'` instead of `[:space:]` |
|
||||
| `init/lib/40-sysctl.sh` | Same fix |
|
||||
| `hack/dev-vm-arm64.sh` | Updated — `-cpu max`, UEFI `--disk` mode |
|
||||
| `test/qemu/test-boot-arm64-disk.sh` | New — CI test for UEFI boot |
|
||||
| `Makefile` | New targets: `kernel-arm64`, `kernel-rpi`, `disk-image-arm64`, `test-boot-arm64-disk`, `rootfs-arm64-rpi` |
|
||||
| `build/config/versions.env` | Pinned `MAINLINE_KERNEL_VERSION=6.12.10`, `KUBESOLO_VERSION=v1.1.0` |
|
||||
| `build/Dockerfile.builder` | Added `grub-efi-amd64-bin`, `grub-efi-arm64-bin`, `busybox-static` |
|
||||
18
init/init.sh
18
init/init.sh
@@ -14,11 +14,10 @@
|
||||
# kubesolo.cloudinit=<path> Path to cloud-init config
|
||||
# kubesolo.flags=<flags> Extra flags for KubeSolo binary
|
||||
|
||||
# Redirect ALL output to /dev/console for visibility during early boot.
|
||||
# This is temporary v0.3 ARM64 debugging — revert when generic ARM64 is stable.
|
||||
# Route early boot output to /dev/console — before switch_root the kernel may
|
||||
# not have a controlling tty, and some stages echo to stderr expecting it to
|
||||
# reach the serial console. This is a no-op once the staged init proper starts.
|
||||
exec >/dev/console 2>&1
|
||||
echo "[KSOLO-DBG] init.sh PID=$$ shell=$(readlink -f /proc/$$/exe 2>/dev/null || echo unknown)"
|
||||
echo "[KSOLO-DBG] uname=$(uname -a 2>&1)"
|
||||
|
||||
set -e
|
||||
|
||||
@@ -28,22 +27,11 @@ set -e
|
||||
# set up container root filesystems. To fix this, we copy the rootfs to a
|
||||
# tmpfs and switch_root to it. The sentinel file prevents infinite loops.
|
||||
if [ ! -f /etc/.switched_root ]; then
|
||||
echo "[KSOLO-DBG] entering switch_root block"
|
||||
echo "[KSOLO-DBG] mount proc..."
|
||||
mount -t proc proc /proc 2>/dev/null || true
|
||||
echo "[KSOLO-DBG] mount proc exit=$?"
|
||||
echo "[KSOLO-DBG] mount sysfs..."
|
||||
mount -t sysfs sysfs /sys 2>/dev/null || true
|
||||
echo "[KSOLO-DBG] mount sysfs exit=$?"
|
||||
echo "[KSOLO-DBG] mount devtmpfs..."
|
||||
mount -t devtmpfs devtmpfs /dev 2>/dev/null || true
|
||||
echo "[KSOLO-DBG] mount devtmpfs exit=$?"
|
||||
echo "[KSOLO-DBG] mkdir /mnt/newroot..."
|
||||
mkdir -p /mnt/newroot
|
||||
echo "[KSOLO-DBG] mkdir /mnt/newroot exit=$?"
|
||||
echo "[KSOLO-DBG] mount tmpfs /mnt/newroot..."
|
||||
mount -t tmpfs -o size=400M,mode=755 tmpfs /mnt/newroot
|
||||
echo "[KSOLO-DBG] mount tmpfs exit=$?"
|
||||
echo "[init] Copying rootfs to tmpfs..." >&2
|
||||
# Copy each top-level directory explicitly (BusyBox cp -ax on rootfs is broken)
|
||||
for d in bin sbin usr lib lib64 etc var opt; do
|
||||
|
||||
Reference in New Issue
Block a user