Files
kubesolo-os/update/pkg/state/state.go
Adolfo Delorenzo 9fb894c5af
Some checks failed
ARM64 Build / Build generic ARM64 disk image (push) Failing after 4s
CI / Go Tests (push) Successful in 1m29s
CI / Shellcheck (push) Successful in 48s
CI / Build Go Binaries (amd64, linux, linux-amd64) (push) Successful in 1m12s
CI / Build Go Binaries (arm64, linux, linux-arm64) (push) Has been cancelled
feat(update): pre-flight gates + deeper healthcheck + auto-rollback
Phase 8 of v0.3. Tightens the update lifecycle on both ends.

Pre-flight (apply.go, before any download):
- Free-space check on the passive partition: image size + 10% headroom must
  be available. Uses statfs(2) via the new pkg/partition.FreeBytes /
  HasFreeSpaceFor helpers (tests cover happy path, tiny request, huge
  request, missing path). Catches corrupted-FS and shrunk-partition cases
  before we destroy the existing slot data.
- Node-block-label check: refuses if the local K8s node carries the
  updates.kubesolo.io/block=true label. New pkg/health.CheckNodeBlocked
  shells out to kubectl per the project's zero-deps stance. Silently bypassed
  when no kubeconfig is reachable (air-gap case). Skipped by --force.

Healthcheck (extended via new pkg/health/extended.go + preflight.go):
- CheckKubeSystemReady waits until every kube-system pod has held the Running
  phase for >= N seconds (default 30). Catches "started ok, will crash-loop"
  bugs that a single-shot phase check misses.
- CheckProbeURL fetches an operator-supplied URL; 200 = pass. Wired through
  update.conf as healthcheck_url= and cloud-init updates.healthcheck_url.
- CheckDiskWritable writes/fsyncs/reads a 1-KiB probe under /var/lib/kubesolo.
  Always runs in healthcheck so a wedged data partition fails fast.
- pkg/health.Status grows KubeSystemReady, ProbeURL, DiskWritable booleans.
  Optional checks default to true in RunAll() so they don't block when
  unconfigured. health_test.go updated to the new 6-field shape.

Auto-rollback (healthcheck.go):
- state.UpdateState gains HealthCheckFailures (consecutive post-Activated
  failures). Reset on a clean pass.
- --auto-rollback-after N (also auto_rollback_after= in update.conf) triggers
  env.ForceRollback() when the failure count reaches the threshold. State
  transitions to RolledBack with a descriptive LastError. The command still
  exits with the healthcheck error; the operator/init is expected to reboot.
- Only fires while Phase == Activated. Doesn't second-guess a long-stable
  system that happens to fail one healthcheck.

config / opts / cloud-init plumbing:
- update.conf gains healthcheck_url= and auto_rollback_after= keys.
- New CLI flags: --healthcheck-url, --auto-rollback-after, --kube-system-settle.
- cloud-init full-config.yaml documents the new updates: subfields.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-14 19:08:30 -06:00

207 lines
7.1 KiB
Go

// Package state tracks the lifecycle of an OS update on disk.
//
// The state file (default /var/lib/kubesolo/update/state.json) records which
// phase the agent is in, what versions are involved, when the attempt started,
// any error from the last operation, and how many attempts have been made.
// Updates are atomic via tmp+rename, so a crash mid-write doesn't corrupt the
// state.
//
// Consumers:
// - cmd/check, cmd/apply, cmd/activate, cmd/healthcheck, cmd/rollback —
// transition the phase as they enter / leave their operations.
// - cmd/status --json — emits the raw state for orchestration tooling.
// - pkg/metrics — reads the state at scrape time to expose phase and
// attempt-count gauges.
package state
import (
"encoding/json"
"fmt"
"os"
"path/filepath"
"time"
)
// DefaultPath is where state.json lives on a live system. The directory is on
// the persistent data partition so the file survives A/B slot switches.
const DefaultPath = "/var/lib/kubesolo/update/state.json"
// Phase represents the current step in the update lifecycle.
//
// Terminal phases (Success, RolledBack, Failed) describe the outcome of the
// most recent attempt; transient phases (Checking, Downloading, Staged,
// Activated, Verifying) describe in-progress work. Idle means no update has
// been attempted yet, or the previous attempt has been acknowledged.
type Phase string
const (
// PhaseIdle — no update in progress.
PhaseIdle Phase = "idle"
// PhaseChecking — querying the update server for new versions.
PhaseChecking Phase = "checking"
// PhaseDownloading — pulling artifacts from the server.
PhaseDownloading Phase = "downloading"
// PhaseStaged — artifacts written to the passive partition; not yet active.
PhaseStaged Phase = "staged"
// PhaseActivated — passive slot promoted; next boot will use the new version.
PhaseActivated Phase = "activated"
// PhaseVerifying — post-boot healthcheck in progress on the new version.
PhaseVerifying Phase = "verifying"
// PhaseSuccess — last attempt completed and verified.
PhaseSuccess Phase = "success"
// PhaseRolledBack — last attempt failed verification; reverted to prior slot.
PhaseRolledBack Phase = "rolled_back"
// PhaseFailed — last attempt failed before reaching activation (download,
// checksum, signature, etc.). System still on the original slot.
PhaseFailed Phase = "failed"
)
// validPhases lists every legal Phase value. Anything not in this set is
// rejected by Save() to catch typos.
var validPhases = map[Phase]struct{}{
PhaseIdle: {},
PhaseChecking: {},
PhaseDownloading: {},
PhaseStaged: {},
PhaseActivated: {},
PhaseVerifying: {},
PhaseSuccess: {},
PhaseRolledBack: {},
PhaseFailed: {},
}
// UpdateState is the on-disk representation. Fields use JSON tags so the
// file format is forward-compatible (extra fields ignored, missing fields
// default).
type UpdateState struct {
// Phase is the current lifecycle position.
Phase Phase `json:"phase"`
// FromVersion is the version the system was running before the attempt.
// Empty when no attempt has run.
FromVersion string `json:"from_version,omitempty"`
// ToVersion is the version the attempt is targeting.
// Empty when no attempt has run.
ToVersion string `json:"to_version,omitempty"`
// StartedAt is when the current attempt entered a non-Idle phase.
StartedAt time.Time `json:"started_at,omitempty"`
// UpdatedAt is the last time the file was written. Always set on Save().
UpdatedAt time.Time `json:"updated_at"`
// LastError carries the most recent operation error, populated when
// transitioning to PhaseFailed or PhaseRolledBack. Cleared on Success/Idle.
LastError string `json:"last_error,omitempty"`
// AttemptCount counts attempts at the current ToVersion. Reset when
// ToVersion changes or on successful completion.
AttemptCount int `json:"attempt_count"`
// HealthCheckFailures counts consecutive post-Activated healthcheck
// failures. Reset to 0 on a successful healthcheck or after a rollback.
// Used by `kubesolo-update healthcheck --auto-rollback-after N` to
// trigger automatic recovery on a wedged new boot.
HealthCheckFailures int `json:"health_check_failures,omitempty"`
}
// New returns a fresh Idle state with UpdatedAt set to now.
func New() *UpdateState {
return &UpdateState{
Phase: PhaseIdle,
UpdatedAt: time.Now().UTC(),
}
}
// Load reads the state from disk. If the file does not exist, returns a fresh
// Idle state — this is the normal first-run case, not an error.
func Load(path string) (*UpdateState, error) {
data, err := os.ReadFile(path)
if err != nil {
if os.IsNotExist(err) {
return New(), nil
}
return nil, fmt.Errorf("read state %s: %w", path, err)
}
var s UpdateState
if err := json.Unmarshal(data, &s); err != nil {
return nil, fmt.Errorf("parse state %s: %w", path, err)
}
return &s, nil
}
// Save writes the state to disk atomically (tmp file + rename), so an
// interrupted write never leaves a partial file at `path`.
func (s *UpdateState) Save(path string) error {
if _, ok := validPhases[s.Phase]; !ok {
return fmt.Errorf("invalid phase %q", s.Phase)
}
s.UpdatedAt = time.Now().UTC()
if err := os.MkdirAll(filepath.Dir(path), 0o755); err != nil {
return fmt.Errorf("creating state dir: %w", err)
}
data, err := json.MarshalIndent(s, "", " ")
if err != nil {
return fmt.Errorf("marshal state: %w", err)
}
data = append(data, '\n')
tmp := path + ".tmp"
if err := os.WriteFile(tmp, data, 0o644); err != nil {
return fmt.Errorf("write tmp state: %w", err)
}
if err := os.Rename(tmp, path); err != nil {
_ = os.Remove(tmp)
return fmt.Errorf("rename state: %w", err)
}
return nil
}
// Transition moves the state to phase `next` and persists it. If `next`
// targets a new ToVersion (different from the current one), AttemptCount is
// reset to 1; otherwise it is left untouched. StartedAt is set when
// transitioning out of Idle. LastError is cleared unless `next` is Failed or
// RolledBack.
func (s *UpdateState) Transition(path string, next Phase, toVersion, errMsg string) error {
now := time.Now().UTC()
// Reset attempt counter when targeting a new version.
if toVersion != "" && toVersion != s.ToVersion {
s.ToVersion = toVersion
s.AttemptCount = 0
}
// First non-Idle phase of an attempt: record start time and bump count.
if s.Phase == PhaseIdle && next != PhaseIdle {
s.StartedAt = now
s.AttemptCount++
}
s.Phase = next
switch next {
case PhaseFailed, PhaseRolledBack:
if errMsg != "" {
s.LastError = errMsg
}
case PhaseSuccess, PhaseIdle:
s.LastError = ""
}
return s.Save(path)
}
// RecordError marks the state as failed with the given error and saves.
// Convenience wrapper around Transition for the most common failure path.
func (s *UpdateState) RecordError(path string, err error) error {
msg := ""
if err != nil {
msg = err.Error()
}
return s.Transition(path, PhaseFailed, "", msg)
}
// SetFromVersion records the version the system was running when an attempt
// started. Idempotent; only takes effect when From is empty.
func (s *UpdateState) SetFromVersion(v string) {
if s.FromVersion == "" {
s.FromVersion = v
}
}