chore(arm64): clean up debug logging + document Phase 3 status
Remove [KSOLO-DBG] per-step echos from init.sh. The /dev/console redirect
stays — it's load-bearing for early-boot visibility on QEMU virt.
Add docs/arm64-status.md capturing the end-of-Phase-3 state:
- What works (full boot through 14 stages, KubeSolo + containerd start)
- Known limitations of the dev setup (QEMU TCG perf, /dev/vda4 hardcode,
busybox-static gaps)
- What's needed to ship v0.3 ARM64 as production-ready
Real-hardware validation (Graviton, Ampere, or similar) is the next gating
step before we can call ARM64 generic done.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
125
docs/arm64-status.md
Normal file
125
docs/arm64-status.md
Normal file
@@ -0,0 +1,125 @@
|
|||||||
|
# ARM64 Generic Status (v0.3 in-progress)
|
||||||
|
|
||||||
|
End-of-Phase-3 snapshot of the generic ARM64 build track.
|
||||||
|
|
||||||
|
## What works
|
||||||
|
|
||||||
|
End-to-end boot through QEMU on an Odroid (aarch64 Ubuntu 22.04 build host):
|
||||||
|
|
||||||
|
1. `make kernel-arm64` produces a mainline 6.12.10 LTS kernel (44 MB Image, 868
|
||||||
|
modules)
|
||||||
|
2. `make rootfs-arm64` extracts piCore64 userland, replaces BusyBox with
|
||||||
|
Ubuntu's static busybox-static, injects KubeSolo + Go agents + init scripts
|
||||||
|
3. `make disk-image-arm64` produces a UEFI-bootable 4 GB GPT image with GRUB
|
||||||
|
A/B slots
|
||||||
|
4. `hack/dev-vm-arm64.sh --disk` boots the image:
|
||||||
|
- UEFI firmware loads GRUB
|
||||||
|
- GRUB loads kernel + initramfs
|
||||||
|
- Custom init runs all 14 stages (early-mount, parse-cmdline, persistent-mount,
|
||||||
|
kernel-modules, apparmor, sysctl, cloud-init, network, hostname, clock,
|
||||||
|
containerd, security-lockdown, kubesolo)
|
||||||
|
- Data partition mounts (ext4 on vda4)
|
||||||
|
- Network configured (DHCP on virtio eth0)
|
||||||
|
- KubeSolo starts; containerd boots successfully; CoreDNS + pause images
|
||||||
|
register
|
||||||
|
|
||||||
|
## Known limitations of the current dev setup
|
||||||
|
|
||||||
|
These are debugging-environment issues, not production blockers:
|
||||||
|
|
||||||
|
### 1. QEMU TCG performance hits KubeSolo's image-import deadline
|
||||||
|
|
||||||
|
KubeSolo bundles its essential container images and imports them into
|
||||||
|
containerd on first boot. Under QEMU TCG (software emulation on the Odroid's
|
||||||
|
1.8 GB / 6-core ARM64), the import takes longer than KubeSolo's internal
|
||||||
|
deadline, so we see:
|
||||||
|
|
||||||
|
```
|
||||||
|
failed to import images: ... context deadline exceeded
|
||||||
|
shutdown requested before containerd was ready
|
||||||
|
```
|
||||||
|
|
||||||
|
On real ARM64 hardware (Graviton, Ampere, RPi 5, etc.) this import completes
|
||||||
|
in seconds. KVM acceleration on the Odroid would also fix it, but the
|
||||||
|
Odroid's vendor kernel (4.9.337-38) doesn't ship the KVM module — fixing that
|
||||||
|
requires a host-kernel upgrade outside this project's scope.
|
||||||
|
|
||||||
|
### 2. Hardcoded `/dev/vda4` data partition path
|
||||||
|
|
||||||
|
Stage 20 currently expects `kubesolo.data=/dev/vda4` rather than
|
||||||
|
`LABEL=KSOLODATA`. The LABEL= path is preferred (works regardless of disk
|
||||||
|
naming on different hosts), but resolution depends on `blkid` and `findfs`,
|
||||||
|
which:
|
||||||
|
|
||||||
|
- piCore64 ships as dynamic util-linux binaries that crash in QEMU virt
|
||||||
|
- Ubuntu's `busybox-static` 1.30.1 doesn't include the applets
|
||||||
|
|
||||||
|
Production fix options (deferred to next phase):
|
||||||
|
|
||||||
|
- Build a more comprehensive static BusyBox (Alpine's, or upstream + custom config)
|
||||||
|
- Ship statically-linked `blkid` and `findfs` from util-linux
|
||||||
|
- Replace LABEL resolution with a sysfs walk that reads `/sys/class/block/*/holders`
|
||||||
|
and `/dev/<n>` device numbers
|
||||||
|
|
||||||
|
### 3. AppArmor profiles fail to load
|
||||||
|
|
||||||
|
`apparmor_parser` errors on the containerd and kubelet profiles, probably
|
||||||
|
because the parser binary or libraries copied from the build host don't
|
||||||
|
match the rootfs's libc layout. Boot proceeds without AppArmor enforcement.
|
||||||
|
Same fix path as #2 (better static binaries).
|
||||||
|
|
||||||
|
### 4. piCore64 BusyBox swap is a build-host dependency
|
||||||
|
|
||||||
|
`inject-kubesolo.sh` replaces piCore's `/bin/busybox` with the build host's
|
||||||
|
`/bin/busybox` (Ubuntu's busybox-static package). That binary must exist on
|
||||||
|
the build host or in the builder Docker image. Documented; works in CI
|
||||||
|
because the Dockerfile installs busybox-static.
|
||||||
|
|
||||||
|
A more reproducible approach (future work): ship a known-good ARM64 BusyBox
|
||||||
|
binary as a tracked artifact rather than depending on the host package.
|
||||||
|
|
||||||
|
### 5. busybox-static 1.30.1 has its own bugs
|
||||||
|
|
||||||
|
Even after the swap, some applets misbehave inside QEMU:
|
||||||
|
|
||||||
|
- `modprobe` triggers "stack smashing detected" abort (kernel modules still
|
||||||
|
load via direct write to /sys/... in stage 30, so this isn't fatal)
|
||||||
|
- `tr` doesn't parse POSIX character classes like `[:space:]` — already
|
||||||
|
worked around by using explicit `' \t\r\n'` in our scripts
|
||||||
|
- Missing applets: `blkid`, `findfs`, `--version`, etc.
|
||||||
|
|
||||||
|
These won't necessarily manifest on real hardware (different CPU, different
|
||||||
|
glibc interaction) but they confirm that 1.30.1 isn't the right long-term
|
||||||
|
BusyBox.
|
||||||
|
|
||||||
|
## What's needed to ship v0.3 ARM64 as production-ready
|
||||||
|
|
||||||
|
In order of priority:
|
||||||
|
|
||||||
|
1. **Validate on real ARM64 hardware** — boot the image on a Graviton EC2
|
||||||
|
instance, Ampere VPS, RPi 5 (when hardware available), or any UEFI-capable
|
||||||
|
ARM64 board. Confirm full KubeSolo bring-up: node Ready, pods schedule.
|
||||||
|
2. **Fix LABEL=KSOLODATA resolution** — see option list in #2 above.
|
||||||
|
3. **Replace busybox-static with a curated build** — see #4.
|
||||||
|
4. **Add a Gitea workflow** that runs `make kernel-arm64 + disk-image-arm64`
|
||||||
|
on the Odroid runner and the QEMU boot-test as a smoke test (with the
|
||||||
|
expectation that KubeSolo doesn't finish first-boot under TCG).
|
||||||
|
|
||||||
|
## Files exercised by the Phase 3 work
|
||||||
|
|
||||||
|
| Path | Status |
|
||||||
|
|------|--------|
|
||||||
|
| `build/scripts/build-kernel-arm64.sh` | New — mainline 6.12.10 kernel build, native or cross |
|
||||||
|
| `build/scripts/build-kernel-rpi.sh` | Renamed from old `build-kernel-arm64.sh` — RPi path |
|
||||||
|
| `build/config/kernel-container.fragment` | Renamed from `rpi-kernel-config.fragment` |
|
||||||
|
| `build/scripts/create-disk-image.sh` | Refactored — accepts `TARGET_ARCH=arm64` |
|
||||||
|
| `build/grub/grub-arm64.cfg` | New — ARM64 console + `init=/sbin/init` |
|
||||||
|
| `build/scripts/inject-kubesolo.sh` | Updated — BusyBox swap, `/init` install, variant routing |
|
||||||
|
| `init/init.sh` | Updated — output to `/dev/console` for early-boot visibility |
|
||||||
|
| `init/lib/30-kernel-modules.sh` | Fixed — `tr -d ' \t\r\n'` instead of `[:space:]` |
|
||||||
|
| `init/lib/40-sysctl.sh` | Same fix |
|
||||||
|
| `hack/dev-vm-arm64.sh` | Updated — `-cpu max`, UEFI `--disk` mode |
|
||||||
|
| `test/qemu/test-boot-arm64-disk.sh` | New — CI test for UEFI boot |
|
||||||
|
| `Makefile` | New targets: `kernel-arm64`, `kernel-rpi`, `disk-image-arm64`, `test-boot-arm64-disk`, `rootfs-arm64-rpi` |
|
||||||
|
| `build/config/versions.env` | Pinned `MAINLINE_KERNEL_VERSION=6.12.10`, `KUBESOLO_VERSION=v1.1.0` |
|
||||||
|
| `build/Dockerfile.builder` | Added `grub-efi-amd64-bin`, `grub-efi-arm64-bin`, `busybox-static` |
|
||||||
18
init/init.sh
18
init/init.sh
@@ -14,11 +14,10 @@
|
|||||||
# kubesolo.cloudinit=<path> Path to cloud-init config
|
# kubesolo.cloudinit=<path> Path to cloud-init config
|
||||||
# kubesolo.flags=<flags> Extra flags for KubeSolo binary
|
# kubesolo.flags=<flags> Extra flags for KubeSolo binary
|
||||||
|
|
||||||
# Redirect ALL output to /dev/console for visibility during early boot.
|
# Route early boot output to /dev/console — before switch_root the kernel may
|
||||||
# This is temporary v0.3 ARM64 debugging — revert when generic ARM64 is stable.
|
# not have a controlling tty, and some stages echo to stderr expecting it to
|
||||||
|
# reach the serial console. This is a no-op once the staged init proper starts.
|
||||||
exec >/dev/console 2>&1
|
exec >/dev/console 2>&1
|
||||||
echo "[KSOLO-DBG] init.sh PID=$$ shell=$(readlink -f /proc/$$/exe 2>/dev/null || echo unknown)"
|
|
||||||
echo "[KSOLO-DBG] uname=$(uname -a 2>&1)"
|
|
||||||
|
|
||||||
set -e
|
set -e
|
||||||
|
|
||||||
@@ -28,22 +27,11 @@ set -e
|
|||||||
# set up container root filesystems. To fix this, we copy the rootfs to a
|
# set up container root filesystems. To fix this, we copy the rootfs to a
|
||||||
# tmpfs and switch_root to it. The sentinel file prevents infinite loops.
|
# tmpfs and switch_root to it. The sentinel file prevents infinite loops.
|
||||||
if [ ! -f /etc/.switched_root ]; then
|
if [ ! -f /etc/.switched_root ]; then
|
||||||
echo "[KSOLO-DBG] entering switch_root block"
|
|
||||||
echo "[KSOLO-DBG] mount proc..."
|
|
||||||
mount -t proc proc /proc 2>/dev/null || true
|
mount -t proc proc /proc 2>/dev/null || true
|
||||||
echo "[KSOLO-DBG] mount proc exit=$?"
|
|
||||||
echo "[KSOLO-DBG] mount sysfs..."
|
|
||||||
mount -t sysfs sysfs /sys 2>/dev/null || true
|
mount -t sysfs sysfs /sys 2>/dev/null || true
|
||||||
echo "[KSOLO-DBG] mount sysfs exit=$?"
|
|
||||||
echo "[KSOLO-DBG] mount devtmpfs..."
|
|
||||||
mount -t devtmpfs devtmpfs /dev 2>/dev/null || true
|
mount -t devtmpfs devtmpfs /dev 2>/dev/null || true
|
||||||
echo "[KSOLO-DBG] mount devtmpfs exit=$?"
|
|
||||||
echo "[KSOLO-DBG] mkdir /mnt/newroot..."
|
|
||||||
mkdir -p /mnt/newroot
|
mkdir -p /mnt/newroot
|
||||||
echo "[KSOLO-DBG] mkdir /mnt/newroot exit=$?"
|
|
||||||
echo "[KSOLO-DBG] mount tmpfs /mnt/newroot..."
|
|
||||||
mount -t tmpfs -o size=400M,mode=755 tmpfs /mnt/newroot
|
mount -t tmpfs -o size=400M,mode=755 tmpfs /mnt/newroot
|
||||||
echo "[KSOLO-DBG] mount tmpfs exit=$?"
|
|
||||||
echo "[init] Copying rootfs to tmpfs..." >&2
|
echo "[init] Copying rootfs to tmpfs..." >&2
|
||||||
# Copy each top-level directory explicitly (BusyBox cp -ax on rootfs is broken)
|
# Copy each top-level directory explicitly (BusyBox cp -ax on rootfs is broken)
|
||||||
for d in bin sbin usr lib lib64 etc var opt; do
|
for d in bin sbin usr lib lib64 etc var opt; do
|
||||||
|
|||||||
Reference in New Issue
Block a user