Phase 8 of v0.3. Tightens the update lifecycle on both ends.
Pre-flight (apply.go, before any download):
- Free-space check on the passive partition: image size + 10% headroom must
be available. Uses statfs(2) via the new pkg/partition.FreeBytes /
HasFreeSpaceFor helpers (tests cover happy path, tiny request, huge
request, missing path). Catches corrupted-FS and shrunk-partition cases
before we destroy the existing slot data.
- Node-block-label check: refuses if the local K8s node carries the
updates.kubesolo.io/block=true label. New pkg/health.CheckNodeBlocked
shells out to kubectl per the project's zero-deps stance. Silently bypassed
when no kubeconfig is reachable (air-gap case). Skipped by --force.
Healthcheck (extended via new pkg/health/extended.go + preflight.go):
- CheckKubeSystemReady waits until every kube-system pod has held the Running
phase for >= N seconds (default 30). Catches "started ok, will crash-loop"
bugs that a single-shot phase check misses.
- CheckProbeURL fetches an operator-supplied URL; 200 = pass. Wired through
update.conf as healthcheck_url= and cloud-init updates.healthcheck_url.
- CheckDiskWritable writes/fsyncs/reads a 1-KiB probe under /var/lib/kubesolo.
Always runs in healthcheck so a wedged data partition fails fast.
- pkg/health.Status grows KubeSystemReady, ProbeURL, DiskWritable booleans.
Optional checks default to true in RunAll() so they don't block when
unconfigured. health_test.go updated to the new 6-field shape.
Auto-rollback (healthcheck.go):
- state.UpdateState gains HealthCheckFailures (consecutive post-Activated
failures). Reset on a clean pass.
- --auto-rollback-after N (also auto_rollback_after= in update.conf) triggers
env.ForceRollback() when the failure count reaches the threshold. State
transitions to RolledBack with a descriptive LastError. The command still
exits with the healthcheck error; the operator/init is expected to reboot.
- Only fires while Phase == Activated. Doesn't second-guess a long-stable
system that happens to fail one healthcheck.
config / opts / cloud-init plumbing:
- update.conf gains healthcheck_url= and auto_rollback_after= keys.
- New CLI flags: --healthcheck-url, --auto-rollback-after, --kube-system-settle.
- cloud-init full-config.yaml documents the new updates: subfields.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Phase 6 of v0.3. The update agent now refuses to apply artifacts whose
channel doesn't match local policy, whose architecture differs from the
running host, or whose min_compatible_version is above the current
version. It also refuses to apply outside a configured maintenance window
unless --force is given.
New package update/pkg/config:
- config.Load parses /etc/kubesolo/update.conf (key=value, # comments,
unknown keys ignored). Missing file is fine — fresh systems before
cloud-init has run.
- ParseWindow handles "HH:MM-HH:MM" plus the wrapping midnight case
(e.g. "23:00-01:00"). Empty input -> AlwaysOpen (no constraint).
Degenerate zero-length windows never match.
- CompareVersions does a simple 3-component semver compare with the 'v'
prefix optional and pre-release suffix ignored.
- 14 unit tests total.
update/pkg/image/image.UpdateMetadata gains three optional fields:
- channel ("stable", "beta", ...)
- min_compatible_version (refuse upgrade if current < this)
- architecture ("amd64", "arm64", ...)
update/cmd/opts.go reads update.conf and merges it into opts; explicit
--server / --channel / --pubkey / --maintenance-window CLI flags override
the file. New --force, --conf, --channel, --maintenance-window flags.
Precedence: CLI > config file > package defaults.
update/cmd/apply.go gains four gates in order:
1. Maintenance window — checked locally before any HTTP work; skipped
with --force.
2. Channel — refused if metadata.channel doesn't match opts.Channel.
3. Architecture — refused if metadata.architecture != runtime.GOARCH.
4. Min compatible version — refused if FromVersion < min_compatible.
All gate failures transition state to Failed with a clear LastError.
cloud-init gains a top-level updates: block (Server, Channel,
MaintenanceWindow, PubKey). cloud-init.ApplyUpdates writes
/etc/kubesolo/update.conf from those fields on first boot. Empty block
leaves any existing file alone (so hand-edited update.conf survives a
reboot without cloud-init re-applying). 4 new tests cover empty / all /
partial / parent-dir-creation cases. full-config.yaml example updated.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Phase 4 of v0.3 — KubeSolo version bump and CI gating.
KubeSolo v1.1.0 → v1.1.5 brings:
- New flag --disable-ipv6 (v1.1.5)
- New flag --db-wal-repair (v1.1.5) — important for power-loss resilience
on edge appliances; surfaced as kubesolo.db-wal-repair in cloud-init
- New flag --full (v1.1.4) — disables edge-optimised k8s overrides
- Pod egress connectivity fix after reboot (v1.1.4)
- Registry config persistence fix (v1.1.5)
- k8s 1.34.7, CoreDNS 1.14.3, Go 1.26.2
All three new flags wired into cloud-init: config.go fields, kubesolo.go
extra-flag emission, full-config.yaml example.
Supply-chain hygiene:
- Per-arch checksums: KUBESOLO_SHA256_AMD64 and KUBESOLO_SHA256_ARM64 in
versions.env. Replaces the single shared KUBESOLO_SHA256 that couldn't
meaningfully verify both binaries at once.
- Checksum now applied to the tarball (the immutable upstream artifact)
rather than the post-extract binary.
CI:
- New .gitea/workflows/build-arm64.yaml routes the full kernel + rootfs +
disk-image build to the Odroid arm64-linux runner. Triggers on push to
main, tags, and manual workflow_dispatch. The boot smoke test is
continue-on-error because KubeSolo's first-boot image import deadline
fires under QEMU TCG on the Odroid.
VERSION bumped to 0.3.0-dev. CHANGELOG entry under [0.3.0-dev] captures all
Phase 1-4 work + the known limitations documented in arm64-status.md.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Add missing flags (--local-storage-shared-path, --debug, --pprof-server,
--portainer-edge-id, --portainer-edge-key, --portainer-edge-async) so all
10 documented KubeSolo parameters can be configured via cloud-init YAML.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- dev-vm.sh: rewrite for macOS (bsdtar ISO extraction, Homebrew mkfs.ext4
detection, direct kernel boot, TCG acceleration, port 8080 forwarding)
- inject-kubesolo.sh: add CA certificates bundle from builder so containerd
can verify TLS when pulling from registries (Docker Hub, etc.)
- 50-network.sh: add DNS fallback (10.0.2.3 + 8.8.8.8) when DHCP client
doesn't populate /etc/resolv.conf
- 90-kubesolo.sh: serve kubeconfig via HTTP on port 8080 for reliable
retrieval from host, add 127.0.0.1 and 10.0.2.15 to API server SANs
- portainer.go: add headless Service to Edge Agent manifest (required for
agent peer discovery DNS lookup)
- 10-parse-cmdline.sh + init.sh: add kubesolo.edge_id/edge_key boot params
- 20-persistent-mount.sh: auto-format unformatted data disks on first boot
- hack/fix-portainer-service.sh: helper to patch running cluster
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Implement a lightweight cloud-init system for first-boot configuration:
- Go parser for YAML config (hostname, network, KubeSolo settings)
- Static/DHCP network modes with DNS override
- KubeSolo extra flags and API server SAN configuration
- Portainer Edge Agent and air-gapped deployment support
- New init stage 45-cloud-init.sh runs before network/hostname stages
- Stages 50/60 skip gracefully when cloud-init has already applied
- Build script compiles static Linux/amd64 binary (~2.7 MB)
- 17 unit tests covering parsing, validation, and example files
- Full documentation at docs/cloud-init.md
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>