7e46f8fdc25df21fa73ff138474ef4dcb66e8427
7 Commits
| Author | SHA1 | Message | Date | |
|---|---|---|---|---|
| 9fb894c5af |
feat(update): pre-flight gates + deeper healthcheck + auto-rollback
Some checks failed
ARM64 Build / Build generic ARM64 disk image (push) Failing after 4s
CI / Go Tests (push) Successful in 1m29s
CI / Shellcheck (push) Successful in 48s
CI / Build Go Binaries (amd64, linux, linux-amd64) (push) Successful in 1m12s
CI / Build Go Binaries (arm64, linux, linux-arm64) (push) Has been cancelled
Phase 8 of v0.3. Tightens the update lifecycle on both ends. Pre-flight (apply.go, before any download): - Free-space check on the passive partition: image size + 10% headroom must be available. Uses statfs(2) via the new pkg/partition.FreeBytes / HasFreeSpaceFor helpers (tests cover happy path, tiny request, huge request, missing path). Catches corrupted-FS and shrunk-partition cases before we destroy the existing slot data. - Node-block-label check: refuses if the local K8s node carries the updates.kubesolo.io/block=true label. New pkg/health.CheckNodeBlocked shells out to kubectl per the project's zero-deps stance. Silently bypassed when no kubeconfig is reachable (air-gap case). Skipped by --force. Healthcheck (extended via new pkg/health/extended.go + preflight.go): - CheckKubeSystemReady waits until every kube-system pod has held the Running phase for >= N seconds (default 30). Catches "started ok, will crash-loop" bugs that a single-shot phase check misses. - CheckProbeURL fetches an operator-supplied URL; 200 = pass. Wired through update.conf as healthcheck_url= and cloud-init updates.healthcheck_url. - CheckDiskWritable writes/fsyncs/reads a 1-KiB probe under /var/lib/kubesolo. Always runs in healthcheck so a wedged data partition fails fast. - pkg/health.Status grows KubeSystemReady, ProbeURL, DiskWritable booleans. Optional checks default to true in RunAll() so they don't block when unconfigured. health_test.go updated to the new 6-field shape. Auto-rollback (healthcheck.go): - state.UpdateState gains HealthCheckFailures (consecutive post-Activated failures). Reset on a clean pass. - --auto-rollback-after N (also auto_rollback_after= in update.conf) triggers env.ForceRollback() when the failure count reaches the threshold. State transitions to RolledBack with a descriptive LastError. The command still exits with the healthcheck error; the operator/init is expected to reboot. - Only fires while Phase == Activated. Doesn't second-guess a long-stable system that happens to fail one healthcheck. config / opts / cloud-init plumbing: - update.conf gains healthcheck_url= and auto_rollback_after= keys. - New CLI flags: --healthcheck-url, --auto-rollback-after, --kube-system-settle. - cloud-init full-config.yaml documents the new updates: subfields. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
|||
| 28de656b97 |
feat(update): OCI registry distribution for update artifacts
Some checks failed
ARM64 Build / Build generic ARM64 disk image (push) Failing after 4s
CI / Go Tests (push) Successful in 1m28s
CI / Shellcheck (push) Successful in 45s
CI / Build Go Binaries (amd64, linux, linux-amd64) (push) Successful in 1m17s
CI / Build Go Binaries (arm64, linux, linux-arm64) (push) Successful in 1m13s
Phase 7 of v0.3. The update agent can now pull update artifacts from any
OCI-compliant registry (ghcr.io, quay.io, harbor, zot, etc.) alongside the
existing HTTP latest.json protocol. Multi-arch artifacts are resolved
through manifest indexes so the same tag (e.g. "stable") yields the
right kernel + initramfs for runtime.GOARCH.
New package update/pkg/oci (~280 LOC, 9 tests):
- Client wraps oras-go/v2's remote.Repository. NewClient parses
host/path references; WithPlainHTTP toggle for httptest.
- FetchMetadata resolves a tag and returns image.UpdateMetadata from
manifest annotations (io.kubesolo.os.{version,channel,architecture,
min_compatible_version,release_notes,release_date}). No blobs fetched.
- Pull resolves the tag, walks index → arch-specific manifest, downloads
kernel + initramfs layers identified by their custom media types
(application/vnd.kubesolo.os.kernel.v1+octet-stream and
application/vnd.kubesolo.os.initramfs.v1+gzip), verifies their digests
against the manifest, returns the same image.StagedImage shape the
HTTP client produces.
- Cross-arch single-arch manifests are refused via the AnnotArch check
(defense in depth on top of the gates in cmd/apply.go).
- Tests use a hand-rolled httptest registry implementing /v2/probe,
manifest fetch by tag-or-digest, blob fetch by digest. Cover index
arch-selection, single-arch manifests, missing-arch error, tampered
blob rejection (digest mismatch), and reference parsing.
Dependencies added: oras.land/oras-go/v2 v2.6.0 plus its transitive
opencontainers/{go-digest,image-spec} and golang.org/x/sync. All small
and well-maintained; total binary size impact is negligible relative to
the existing 6.1 MB update agent.
cmd/apply.go:
- New --registry and --tag flags; mutually exclusive with --server.
- applyMetadataGates extracted as a helper, called from both transports
so channel/arch/min-version policy is enforced identically regardless
of how metadata was fetched.
- State transitions identical to the HTTP path: Checking → Downloading
→ Staged, with RecordError on any failure.
cmd/opts.go: --registry, --tag CLI flags. update.conf "server=" already
accepts either an HTTP URL or an OCI ref; the agent distinguishes by
which CLI/conf field carries the value.
build/scripts/push-oci-artifact.sh: new tool that publishes a single-arch
update artifact via the oras CLI with our custom media types and
annotations. After running for each arch, the operator composes the
multi-arch index with `oras manifest index create`. Documented inline.
build/Dockerfile.builder: installs oras 1.2.3 from upstream releases so
the Gitea Actions build container can run the new script.
Signature verification on the OCI path is intentionally deferred — the
artifact format is digest-verified end-to-end via oras-go, and Ed25519
signature consumption via OCI referrers is a follow-up. Plain HTTP
clients keep their existing signature path.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
|||
| dfed6ddba8 |
feat(update): channels, maintenance windows, min-version gate
Some checks failed
ARM64 Build / Build generic ARM64 disk image (push) Failing after 3s
CI / Go Tests (push) Successful in 1m23s
CI / Shellcheck (push) Successful in 46s
CI / Build Go Binaries (amd64, linux, linux-amd64) (push) Successful in 1m32s
CI / Build Go Binaries (arm64, linux, linux-arm64) (push) Successful in 1m15s
Phase 6 of v0.3. The update agent now refuses to apply artifacts whose
channel doesn't match local policy, whose architecture differs from the
running host, or whose min_compatible_version is above the current
version. It also refuses to apply outside a configured maintenance window
unless --force is given.
New package update/pkg/config:
- config.Load parses /etc/kubesolo/update.conf (key=value, # comments,
unknown keys ignored). Missing file is fine — fresh systems before
cloud-init has run.
- ParseWindow handles "HH:MM-HH:MM" plus the wrapping midnight case
(e.g. "23:00-01:00"). Empty input -> AlwaysOpen (no constraint).
Degenerate zero-length windows never match.
- CompareVersions does a simple 3-component semver compare with the 'v'
prefix optional and pre-release suffix ignored.
- 14 unit tests total.
update/pkg/image/image.UpdateMetadata gains three optional fields:
- channel ("stable", "beta", ...)
- min_compatible_version (refuse upgrade if current < this)
- architecture ("amd64", "arm64", ...)
update/cmd/opts.go reads update.conf and merges it into opts; explicit
--server / --channel / --pubkey / --maintenance-window CLI flags override
the file. New --force, --conf, --channel, --maintenance-window flags.
Precedence: CLI > config file > package defaults.
update/cmd/apply.go gains four gates in order:
1. Maintenance window — checked locally before any HTTP work; skipped
with --force.
2. Channel — refused if metadata.channel doesn't match opts.Channel.
3. Architecture — refused if metadata.architecture != runtime.GOARCH.
4. Min compatible version — refused if FromVersion < min_compatible.
All gate failures transition state to Failed with a clear LastError.
cloud-init gains a top-level updates: block (Server, Channel,
MaintenanceWindow, PubKey). cloud-init.ApplyUpdates writes
/etc/kubesolo/update.conf from those fields on first boot. Empty block
leaves any existing file alone (so hand-edited update.conf survives a
reboot without cloud-init re-applying). 4 new tests cover empty / all /
partial / parent-dir-creation cases. full-config.yaml example updated.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
|||
| bce565e2f7 |
feat(update): persistent state machine + lifecycle metrics
Some checks failed
ARM64 Build / Build generic ARM64 disk image (push) Failing after 4s
CI / Go Tests (push) Successful in 1m31s
CI / Shellcheck (push) Successful in 47s
CI / Build Go Binaries (amd64, linux, linux-amd64) (push) Failing after 10s
CI / Build Go Binaries (arm64, linux, linux-arm64) (push) Failing after 16s
Phase 5 of v0.3. Adds an explicit, on-disk state machine to the update agent
so the lifecycle of an attempt is observable end-to-end, instead of being
inferred from logs and side effects.
New package update/pkg/state:
- Phase enum (idle, checking, downloading, staged, activated, verifying,
success, rolled_back, failed)
- UpdateState struct persisted to /var/lib/kubesolo/update/state.json
(overridable via --state). Atomic write (.tmp + rename). Survives reboots
and slot switches because the file lives on the data partition.
- Transition helper that bumps AttemptCount when an attempt starts, resets
it when the target version changes, sets/clears LastError on
failed/success transitions, and stamps StartedAt + UpdatedAt.
- 13 unit tests cover the lifecycle, atomic write, version-change reset,
error recording, idempotent SetFromVersion, garbage-file handling.
Wired into the existing commands:
- apply.go transitions Idle -> Checking -> Downloading -> Staged, with
RecordError on any step failure. Reads the active slot's version file to
populate FromVersion.
- activate.go transitions to Activated.
- healthcheck.go transitions Activated -> Verifying -> Success on pass,
or to Failed on fail. Skips transitions if state isn't post-activation
(manual healthcheck on a stable system shouldn't churn the state).
- rollback.go transitions to RolledBack with LastError="manual rollback".
- check.go intentionally untouched — checks are passive queries, not
attempts; they shouldn't reset AttemptCount.
status.go gains a --json mode that emits the full state report (A/B slots,
boot counter, full UpdateState) for orchestration tooling. Human-readable
mode also prints an Update Lifecycle section when state.phase != idle.
pkg/metrics gains three new series, derived from state.json at scrape time:
- kubesolo_update_phase{phase="..."} — 1 for current, 0 for all others;
all nine phase values always emitted so dashboards see complete series
- kubesolo_update_attempts_total
- kubesolo_update_last_attempt_timestamp_seconds
Server.SetStatePath() configures the file location; defaults to absent
which emits Idle defaults. Three new tests cover the absent / active /
all-phases-emitted cases.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
|||
| efc7f80b65 |
feat: add security hardening, AppArmor, and ARM64 Raspberry Pi support (Phase 6)
Security hardening: bind kubeconfig server to localhost, mount hardening (noexec/nosuid/nodev on tmpfs), sysctl network hardening, kernel module loading lock after boot, SHA256 checksum verification for downloads, kernel AppArmor + Audit support, complain-mode AppArmor profiles for containerd and kubelet, and security integration test. ARM64 Raspberry Pi support: piCore64 base extraction, RPi kernel build from raspberrypi/linux fork, RPi firmware fetch, SD card image with 4- partition GPT and tryboot A/B mechanism, BootEnv Go interface abstracting GRUB vs RPi boot environments, architecture-aware build scripts, QEMU aarch64 dev VM and boot test. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> |
|||
| 49a37e30e8 |
feat: add production hardening — Ed25519 signing, Portainer Edge, SSH extension (Phase 4)
Image signing: - Ed25519 sign/verify package (pure Go stdlib, zero deps) - genkey and sign CLI subcommands for build system - Optional --pubkey flag for verifying updates on apply - Signature URLs in update metadata (latest.json) Portainer Edge Agent: - cloud-init portainer.go module writes K8s manifest - Auto-deploys Edge Agent when portainer.edge-agent.enabled - Full RBAC (ServiceAccount, ClusterRoleBinding, Deployment) - 5 Portainer tests in portainer_test.go Production tooling: - SSH debug extension builder (hack/build-ssh-extension.sh) - Boot performance benchmark (test/benchmark/bench-boot.sh) - Resource usage benchmark (test/benchmark/bench-resources.sh) - Deployment guide (docs/deployment-guide.md) Test results: 50 update agent tests + 22 cloud-init tests passing. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> |
|||
| 8d25e1890e |
feat: add A/B partition updates with GRUB and Go update agent (Phase 3)
Implement atomic OS updates via A/B partition scheme with automatic rollback. GRUB bootloader manages slot selection with a 3-attempt boot counter that auto-rolls back on repeated health check failures. GRUB boot config: - A/B slot selection with boot_counter/boot_success env vars - Automatic rollback when counter reaches 0 (3 failed boots) - Debug, emergency shell, and manual slot-switch menu entries Disk image (refactored): - 4-partition GPT layout: EFI + System A + System B + Data - GRUB EFI/BIOS installation with graceful fallbacks - Both system partitions populated during image creation Update agent (Go, zero external deps): - pkg/grubenv: read/write GRUB env vars (grub-editenv + manual fallback) - pkg/partition: find/mount/write system partitions by label - pkg/image: HTTP download with SHA256 verification - pkg/health: post-boot checks (containerd, API server, node Ready) - 6 CLI commands: check, apply, activate, rollback, healthcheck, status - 37 unit tests across all 4 packages Deployment: - K8s CronJob for automatic update checks (every 6 hours) - ConfigMap for update server URL - Health check Job for post-boot verification Build pipeline: - build-update-agent.sh compiles static Linux binary (~5.9 MB) - inject-kubesolo.sh includes update agent in initramfs - Makefile: build-update-agent, test-update-agent, test-update targets Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> |