Files
konstruct/.planning/phases/03-operator-experience/03-RESEARCH.md

39 KiB

Phase 3: Operator Experience - Research

Researched: 2026-03-23 Domain: Slack OAuth V2, Stripe Subscriptions, BYO API Key Encryption, Cost Dashboard Confidence: HIGH (core stack verified against official docs)


<user_constraints>

User Constraints (from CONTEXT.md)

Locked Decisions

  • Slack connection via standard OAuth2 "Add to Slack" flow — operator clicks button, authorizes, tokens stored automatically
  • WhatsApp connection: guided manual setup (Claude's discretion confirmed)
  • After connecting a channel, the wizard MUST include a "send test message" step — required, not optional
  • Test message verifies end-to-end connectivity before the agent goes live
  • Onboarding sequence: Connect Channel → Configure Agent → Send Test Message
  • Agent goes live automatically after the test message succeeds — no separate "Go Live" button
  • Pricing model: per-agent monthly (e.g., $49/agent/month)
  • 14-day free trial with full access, credit card required upfront
  • Subscription management via Stripe: subscribe, upgrade (add agents), downgrade (remove agents), cancel
  • LLM-03 resolved: BYO API keys IS in v1 scope (Phase 3)
  • Cost metrics: token usage per agent, cost breakdown by LLM provider, message volume per agent/channel, budget alerts
  • Budget alerts: visual indicator when approaching or exceeding per-agent budget limits (from AGNT-07)

Claude's Discretion

  • WhatsApp connection method (guided manual vs embedded signup)
  • Stepper UI for onboarding (yes/no, visual style)
  • Non-payment enforcement behavior
  • BYO key scope (tenant-level settings page vs per-agent)
  • Cost dashboard time range options
  • Dashboard chart library (recharts, nivo, etc.)
  • Stripe webhook event handling strategy (idempotency, retry)

Deferred Ideas (OUT OF SCOPE)

None — discussion stayed within phase scope </user_constraints>


<phase_requirements>

Phase Requirements

ID Description Research Support
AGNT-07 Agent token usage tracked per-agent per-tenant with configurable budget limits Audit event JSONB metadata must store prompt_tokens, completion_tokens, provider; budget stored on Tenant model; alert threshold query pattern documented
LLM-03 Tenant can provide their own API keys for supported LLM providers (BYO keys, encrypted at rest) Fernet AES-128-CBC with HMAC-SHA256; envelope encryption pattern; new tenant_llm_keys table; LiteLLM routing integration
PRTA-03 Operator can connect messaging channels (Slack, WhatsApp) via guided wizard Slack OAuth V2 flow; required scopes; token storage in channel_connections.config; WhatsApp manual setup steps
PRTA-04 New tenants are guided through structured onboarding (connect channel, configure agent, test message) Stepper UI pattern; Next.js App Router multi-step page; test message endpoint
PRTA-05 Operator can manage subscription plans and billing via Stripe integration Stripe Checkout with per-seat quantity; Billing Portal for self-service; webhook event map; idempotency pattern
PRTA-06 Portal displays agent cost tracking and usage metrics per tenant SQL aggregate query on audit_events; JSONB path extraction; Recharts for visualization; time-range filtering
</phase_requirements>

Summary

Phase 3 adds the commercial and operational layer to the Konstruct portal: Slack OAuth, subscription billing, BYO key encryption, and a cost dashboard. All four areas are well-trodden territory with mature libraries — the risks are in integration details, not algorithmic complexity.

The largest architectural gap is in the audit trail: the existing audit_events.metadata JSONB field stores model and iteration but NOT prompt_tokens, completion_tokens, or cost_usd. These fields must be added to the audit logger before the cost dashboard can function. This is a prerequisite for PRTA-06 and AGNT-07 and needs to be Wave 0 work.

The second important finding is that WhatsApp Embedded Signup (Meta OAuth flow) is now the standard for BSP-level onboarding in 2026, but it requires a registered Facebook Business Verification and a BSP/Tech Provider program account. For v1 "guided manual setup" is the correct choice — it means operators manually create a WhatsApp Business App, get their phone number token, and paste credentials into the portal. This avoids the multi-week Meta verification process while shipping.

Primary recommendation: Build Slack OAuth → Stripe billing → BYO key encryption → cost dashboard in that order. Each is independently deployable. Start with the audit trail metadata migration as Wave 0.


Standard Stack

Core

Library Version Purpose Why Standard
stripe (Python) >=12.0.0 Stripe API, webhook verification, subscription management Official Stripe Python SDK; StripeClient pattern is current API
cryptography (Python) >=47.0.0 BYO key encryption via Fernet pyca/cryptography is the Python standard; already used for bcrypt via bcrypt dep; Fernet is audited
slack-bolt (Python) >=1.22.0 Slack OAuth installer, Events API Already in CLAUDE.md tech stack; OAuthFlow handles token exchange
stripe (npm) >=17.0.0 Stripe.js for frontend Checkout redirect Official JS client
recharts >=2.15.0 Cost dashboard charts 17M weekly downloads vs Nivo's 2M; simpler JSX API; strong shadcn/ui alignment

Supporting

Library Version Purpose When to Use
@stripe/stripe-js >=5.0.0 Stripe Checkout redirect from browser When creating Checkout Sessions from portal
slack-sdk (Python) >=3.35.0 Lower-level Slack Web API calls (post test message) For the "send test message" verification step

Alternatives Considered

Instead of Could Use Tradeoff
Fernet (AES-128-CBC + HMAC) AES-256-GCM via cryptography.hazmat AES-256-GCM is stronger but requires manual MAC management; Fernet is audited, has MultiFernet key rotation, and AES-128-CBC + HMAC-SHA256 is sufficient for API key protection
Recharts Nivo Nivo has more chart types but 8x fewer downloads, worse documentation, and verbose API; Recharts is recommended for SaaS admin dashboards
Stripe Billing Portal (hosted) Custom billing UI Custom UI requires full payment method management; Billing Portal handles card updates, invoice history, cancellation in a Stripe-hosted page — use it

Installation:

# Python (add to packages/shared/pyproject.toml)
uv add stripe cryptography

# Node (in packages/portal)
npm install recharts @stripe/stripe-js stripe

Architecture Patterns

packages/
├── shared/
│   └── shared/
│       ├── models/
│       │   └── billing.py          # TenantBilling, TenantLlmKey models
│       └── api/
│           ├── billing.py          # Stripe webhooks + subscription endpoints
│           └── channels.py         # Slack OAuth callback, channel connection
├── portal/
│   └── app/
│       ├── api/
│       │   └── slack/
│       │       └── callback/
│       │           └── route.ts    # Slack OAuth redirect handler
│       └── (dashboard)/
│           ├── onboarding/
│           │   └── page.tsx        # Connect Channel → Configure Agent → Test
│           ├── billing/
│           │   └── page.tsx        # Subscription status + Billing Portal redirect
│           ├── usage/
│           │   └── [tenantId]/
│           │       └── page.tsx    # Cost dashboard per tenant
│           └── settings/
│               └── api-keys/
│                   └── page.tsx    # BYO key management
migrations/
└── versions/
    ├── xxxx_add_billing_fields.py  # stripe_customer_id, subscription_status, trial_ends_at on tenants
    ├── xxxx_add_tenant_llm_keys.py # tenant_llm_keys table
    └── xxxx_add_token_fields.py    # prompt_tokens, completion_tokens, cost_usd, provider on audit_events

Pattern 1: Slack OAuth V2 Flow

What: Operator clicks "Add to Slack" → Slack authorization page → redirect back to portal callback → exchange code for bot token → store in channel_connections

Scopes required (bot):

  • app_mentions:read — receive @mention events
  • channels:read — list public channels
  • channels:history — read channel message history
  • chat:write — post messages (required for test message + agent replies)
  • groups:read — private channels
  • im:read / im:write / im:history — DM support
  • mpim:read / mpim:history — multi-party DMs

OAuth V2 flow:

1. Operator visits /onboarding → clicks "Add to Slack"
2. Portal redirects to:
   https://slack.com/oauth/v2/authorize
     ?client_id=<SLACK_CLIENT_ID>
     &scope=app_mentions:read,channels:read,channels:history,chat:write,im:read,im:write,im:history
     &redirect_uri=https://app.konstruct.ai/api/slack/callback
     &state=<csrf_token:tenant_id>

3. User approves → Slack redirects to /api/slack/callback?code=xxx&state=yyy

4. FastAPI backend exchanges code:
   POST https://slack.com/api/oauth.v2.access
     client_id, client_secret, code, redirect_uri

5. Response contains:
   {
     "ok": true,
     "access_token": "xoxb-...",   ← bot token, store encrypted
     "team": { "id": "T12345", "name": "Acme Corp" },
     "bot_user_id": "U67890",
     "scope": "app_mentions:read,..."
   }

6. Store in channel_connections:
   - channel_type: "slack"
   - workspace_id: team.id
   - config: { "bot_token": encrypt(access_token), "bot_user_id": ..., "team_name": ... }

State parameter must encode tenant_id + CSRF token (sign with HMAC-SHA256, verify on callback).

# Source: https://docs.slack.dev/authentication/installing-with-oauth/

# Generate state
import hmac, hashlib, secrets, json, base64

def generate_oauth_state(tenant_id: str, secret: str) -> str:
    nonce = secrets.token_urlsafe(16)
    payload = json.dumps({"tenant_id": tenant_id, "nonce": nonce})
    sig = hmac.new(secret.encode(), payload.encode(), hashlib.sha256).hexdigest()
    return base64.urlsafe_b64encode(f"{payload}:{sig}".encode()).decode()

def verify_oauth_state(state: str, secret: str) -> str:
    """Returns tenant_id or raises ValueError."""
    decoded = base64.urlsafe_b64decode(state.encode()).decode()
    payload_str, sig = decoded.rsplit(":", 1)
    expected = hmac.new(secret.encode(), payload_str.encode(), hashlib.sha256).hexdigest()
    if not hmac.compare_digest(sig, expected):
        raise ValueError("Invalid OAuth state")
    return json.loads(payload_str)["tenant_id"]

Pattern 2: Stripe Per-Agent Subscription

What: Operator subscribes → Checkout Session created with quantity=agent_count → redirected to Stripe → on success webhook, provision access.

Key objects to persist on Tenant:

  • stripe_customer_id (String) — created once per tenant on first subscription
  • stripe_subscription_id (String | None)
  • stripe_subscription_item_id (String | None) — needed for quantity updates
  • subscription_status (Enum: trialing, active, past_due, canceled, unpaid)
  • trial_ends_at (DateTime | None)
  • agent_quota (Integer) — number of paid seats

Checkout Session creation (Python):

# Source: https://docs.stripe.com/payments/checkout/build-subscriptions

import stripe

client = stripe.StripeClient(api_key=settings.stripe_secret_key)

session = client.v1.checkout.sessions.create({
    "mode": "subscription",
    "customer": tenant.stripe_customer_id,  # or create new
    "line_items": [{
        "price": settings.stripe_per_agent_price_id,
        "quantity": agent_count,  # number of agents being subscribed
    }],
    "subscription_data": {
        "trial_period_days": 14,
    },
    "success_url": f"{settings.portal_url}/billing?session_id={{CHECKOUT_SESSION_ID}}",
    "cancel_url": f"{settings.portal_url}/billing",
})
# Return session.url to frontend for redirect

Quantity update when agents are added/removed:

# Source: https://docs.stripe.com/api/subscription_items/update?lang=python

client.v1.subscription_items.update(
    tenant.stripe_subscription_item_id,
    {"quantity": new_agent_count},
)

Billing Portal session:

# Source: https://docs.stripe.com/customer-management/integrate-customer-portal

portal_session = client.v1.billing_portal.sessions.create({
    "customer": tenant.stripe_customer_id,
    "return_url": f"{settings.portal_url}/billing",
})
# Return portal_session.url to frontend

Pattern 3: Stripe Webhook Handler

Critical webhook events to handle:

Event Action
checkout.session.completed Store subscription_id, subscription_item_id, set status trialing or active
customer.subscription.created Same as above if not using Checkout
customer.subscription.updated Update subscription_status, agent_quota, trial_ends_at
customer.subscription.deleted Set status canceled, deactivate all agents
customer.subscription.trial_will_end Send alert email (3 days before trial ends)
invoice.paid Set status active, re-enable agents if they were suspended
invoice.payment_failed Set status past_due, send payment failure notification

FastAPI webhook endpoint:

# Source: https://docs.stripe.com/webhooks

from fastapi import APIRouter, Request, HTTPException
import stripe

webhook_router = APIRouter()

@webhook_router.post("/webhooks/stripe")
async def stripe_webhook(
    request: Request,
    session: AsyncSession = Depends(get_session),
) -> dict[str, str]:
    payload = await request.body()
    sig_header = request.headers.get("stripe-signature", "")

    try:
        event = stripe.WebhookEvent.construct_from(
            stripe.Webhook.construct_event(
                payload, sig_header, settings.stripe_webhook_secret
            ).to_dict(),
            stripe.api_key,
        )
    except stripe.SignatureVerificationError:
        raise HTTPException(status_code=400, detail="Invalid signature")

    # Idempotency: check if event already processed
    already_processed = await _check_event_processed(session, event["id"])
    if already_processed:
        return {"status": "already_processed"}

    await _record_event_processed(session, event["id"])
    await _dispatch_event(session, event)
    return {"status": "ok"}

Idempotency table: Add a stripe_events table with (event_id PRIMARY KEY, processed_at) — INSERT with ON CONFLICT DO NOTHING; if 0 rows affected, skip processing.

Non-payment enforcement: When subscription_status becomes past_due after grace period (configurable, suggest 7 days), set Agent.is_active = False for all tenant agents. The gateway/orchestrator already gates on is_active, so no further changes needed.

Pattern 4: BYO API Key Encryption (Envelope Encryption)

What: Tenant provides their OpenAI/Anthropic API key. We encrypt it before storing. The platform-level master encryption key is in environment variables (or secrets manager).

Important: Fernet uses AES-128-CBC + HMAC-SHA256, NOT AES-256. This is still cryptographically sound and the cryptography library is audited. CLAUDE.md specifies "AES-256" aspirationally — Fernet is the correct practical choice. Document this tradeoff in ADR-005.

Schema — new table tenant_llm_keys:

CREATE TABLE tenant_llm_keys (
    id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    tenant_id UUID NOT NULL REFERENCES tenants(id) ON DELETE CASCADE,
    provider TEXT NOT NULL,  -- 'openai' | 'anthropic' | 'custom'
    label TEXT NOT NULL,     -- human-readable name
    encrypted_key TEXT NOT NULL,
    key_version INT NOT NULL DEFAULT 1,  -- for rotation tracking
    created_at TIMESTAMPTZ NOT NULL DEFAULT now(),
    UNIQUE(tenant_id, provider)  -- one key per provider per tenant
);
-- RLS enabled: same pattern as agents table

Encryption service:

# Source: https://cryptography.io/en/latest/fernet/

from cryptography.fernet import Fernet, MultiFernet
import os

class KeyEncryptionService:
    """
    Encrypts/decrypts tenant BYO API keys.

    PLATFORM_ENCRYPTION_KEY env var must be a URL-safe base64 Fernet key.
    For rotation: PLATFORM_ENCRYPTION_KEY_PREVIOUS holds the prior key.
    """

    def __init__(self) -> None:
        primary = Fernet(os.environ["PLATFORM_ENCRYPTION_KEY"])
        keys = [primary]
        if prev := os.environ.get("PLATFORM_ENCRYPTION_KEY_PREVIOUS"):
            keys.append(Fernet(prev))
        self._fernet = MultiFernet(keys)

    def encrypt(self, plaintext: str) -> str:
        return self._fernet.encrypt(plaintext.encode()).decode()

    def decrypt(self, ciphertext: str) -> str:
        return self._fernet.decrypt(ciphertext.encode()).decode()

    def rotate(self, ciphertext: str) -> str:
        """Re-encrypt under the current primary key."""
        return self._fernet.rotate(ciphertext.encode()).decode()

Key generation for setup:

python -c "from cryptography.fernet import Fernet; print(Fernet.generate_key().decode())"

LiteLLM integration: When routing LLM calls, check if the tenant has a BYO key for the requested provider. If yes, decrypt and inject into the LiteLLM call. Never log the decrypted key.

Pattern 5: Cost Dashboard — Audit Event Aggregation

CRITICAL PREREQUISITE: The current audit logger stores model in metadata but NOT token counts. The runner.py log_llm_call metadata must be extended before the cost dashboard can work.

Required metadata fields to add to log_llm_call:

# In orchestrator/agents/runner.py — extend existing metadata dict:
metadata={
    "model": data.get("model", agent.model_preference),
    "provider": _extract_provider(data.get("model", "")),  # "openai" | "anthropic" | "ollama"
    "prompt_tokens": usage.get("prompt_tokens", 0),
    "completion_tokens": usage.get("completion_tokens", 0),
    "total_tokens": usage.get("total_tokens", 0),
    "cost_usd": _calculate_cost(model, usage),  # pre-calculated, stored as float
    "iteration": iteration,
    "tool_calls_count": len(response_tool_calls),
}

Dashboard aggregation query:

-- Token usage per agent for a time range
SELECT
    agent_id,
    SUM((metadata->>'prompt_tokens')::int)       AS prompt_tokens,
    SUM((metadata->>'completion_tokens')::int)   AS completion_tokens,
    SUM((metadata->>'total_tokens')::int)        AS total_tokens,
    SUM((metadata->>'cost_usd')::float)          AS cost_usd,
    COUNT(*)                                      AS llm_call_count
FROM audit_events
WHERE
    tenant_id = :tenant_id
    AND action_type = 'llm_call'
    AND created_at >= :start_date
    AND created_at < :end_date
GROUP BY agent_id;

-- Cost by provider
SELECT
    metadata->>'provider'                        AS provider,
    SUM((metadata->>'cost_usd')::float)          AS cost_usd,
    COUNT(*)                                      AS call_count
FROM audit_events
WHERE
    tenant_id = :tenant_id
    AND action_type = 'llm_call'
    AND created_at >= :start_date
GROUP BY metadata->>'provider';

-- Message volume by channel (count message events)
SELECT
    metadata->>'channel'                         AS channel,
    COUNT(*)                                      AS message_count
FROM audit_events
WHERE
    tenant_id = :tenant_id
    AND action_type = 'llm_call'
    AND created_at >= :start_date
GROUP BY metadata->>'channel';

Index required:

CREATE INDEX CONCURRENTLY idx_audit_events_tenant_type_created
    ON audit_events (tenant_id, action_type, created_at DESC);

-- GIN index for JSONB queries if aggregation volume is high
CREATE INDEX CONCURRENTLY idx_audit_events_metadata
    ON audit_events USING GIN (metadata);

Time range options (Claude's discretion): Offer Last 7 days / Last 30 days / This month / Custom range. Default to Last 30 days. Use a simple <select> driving a query param — no date picker library needed for v1.

Budget alert logic: Compare SUM(cost_usd) against Agent.budget_limit_usd (new field). Visual indicator: amber at 80%, red at 100%. Render as colored badge on the cost dashboard row, not a modal.

Pattern 6: WhatsApp Manual Setup (Claude's Discretion Recommendation)

Embedded Signup requires a registered BSP/Meta Tech Provider account — multi-week verification process. For v1, use guided manual setup:

  1. Operator creates a Meta/Facebook developer account and a WhatsApp Business App
  2. Portal shows step-by-step instructions with screenshots
  3. Operator pastes: Phone Number ID, WhatsApp Business Account ID, permanent System User Token
  4. Portal validates by calling GET https://graph.facebook.com/v22.0/{phone_number_id} with the token
  5. If valid, store in channel_connections.config (token encrypted with Fernet)
  6. Test message: send "Konstruct connected successfully" to operator's own WhatsApp number

Anti-Patterns to Avoid

  • Do not verify Stripe webhooks manually — always use stripe.Webhook.construct_event() with the endpoint secret; raw header parsing is error-prone
  • Do not store Slack bot tokens in plaintext — encrypt with Fernet before writing to channel_connections.config
  • Do not update subscription quantity synchronously — Stripe rate-limits if updated many times per hour; queue via Celery if high-frequency
  • Do not re-query audit_events without the partial index — full table scan on audit_events will be slow; the composite index on (tenant_id, action_type, created_at) is mandatory
  • Do not use func.now() for trial_ends_at calculation — set it from the Stripe webhook response subscription.trial_end (Unix timestamp), not from local time

Don't Hand-Roll

Problem Don't Build Use Instead Why
OAuth state CSRF protection Custom state encoding HMAC-signed state (shown above) Replay attacks, timing attacks
Stripe signature verification Manual HMAC check on raw bytes stripe.Webhook.construct_event() Handles timestamp tolerance, replay prevention
Subscription lifecycle state machine Custom FSM Stripe subscription status field + webhooks Stripe handles payment retries, dunning, proration
API key encryption Custom AES wrapper cryptography.fernet.Fernet + MultiFernet Audited, handles IV generation, MAC, key rotation
Billing UI (card updates, invoices) Custom payment form Stripe Billing Portal PCI scope, card updating, invoice history — all free
Token cost calculation Per-request cost estimation Pre-calculate at log time using LiteLLM's completion_cost() LiteLLM already tracks pricing per model; reuse it

Key insight: Stripe and Slack provide hosted/SDK flows for the most security-sensitive parts. Never replicate what they already do correctly.


Common Pitfalls

Pitfall 1: Slack OAuth state Not Validated

What goes wrong: Attacker crafts a Slack OAuth callback with a valid code but forged state, linking their Slack workspace to a victim's tenant. Why it happens: Treating OAuth state as opaque and skipping verification. How to avoid: Always HMAC-sign the state before sending; verify the signature AND the tenant_id before exchanging the code. Warning signs: No state validation in the callback handler.

Pitfall 2: Stripe Webhook Raw Body Mangling

What goes wrong: Signature verification fails in production because a middleware (e.g., JSON parser) modifies the request body before the webhook handler reads it. Why it happens: FastAPI's Request.json() parses the body; Stripe signatures are computed over the raw bytes. How to avoid: Always read with await request.body() (raw bytes), not await request.json(). Warning signs: stripe.SignatureVerificationError in production but not in local testing.

Pitfall 3: Duplicate Webhook Processing

What goes wrong: Stripe delivers the same event twice (network retry); agent gets provisioned twice, subscription updated twice. Why it happens: No idempotency guard on the webhook handler. How to avoid: Store processed event.id in a stripe_events table; use INSERT ... ON CONFLICT DO NOTHING and check rows affected. Warning signs: Duplicate channel_connections rows or double-charged agents.

Pitfall 4: Fernet Key Not in Environment

What goes wrong: Application starts without PLATFORM_ENCRYPTION_KEY set; first BYO key encryption call throws KeyError. Why it happens: Key not added to .env.example / Docker Compose environment. How to avoid: Validate key presence at startup in shared/config.py using pydantic-settings required fields; fail fast with a clear error. Warning signs: KeyError: PLATFORM_ENCRYPTION_KEY in logs.

Pitfall 5: Audit Events Missing Token Metadata

What goes wrong: Cost dashboard shows zeros because prompt_tokens / cost_usd were never written to audit_events.metadata. Why it happens: Runner logs the LLM call but doesn't extract token counts from the LiteLLM response object. How to avoid: Extend runner.py metadata dict BEFORE Phase 3 dashboard work begins; backfill is impossible (audit log is append-only). Warning signs: metadata->>'prompt_tokens' returns NULL in dashboard queries.

Pitfall 6: Subscription Quantity Mismatch

What goes wrong: Operator creates 3 agents in the portal but Stripe still charges for 1. Why it happens: Agent creation doesn't trigger a subscription quantity update. How to avoid: On POST /agents and DELETE /agents, update Stripe subscription item quantity. Use a Celery task to avoid blocking the API response. Warning signs: stripe_subscription_item_id is NULL; no Celery task defined for quantity sync.


Code Examples

Verified patterns from official sources:

Stripe Subscription Creation with Trial

# Source: https://docs.stripe.com/api/subscriptions/create?lang=python

client = stripe.StripeClient(api_key=settings.stripe_secret_key)

subscription = client.v1.subscriptions.create({
    "customer": customer_id,
    "trial_period_days": 14,
    "items": [{"price": settings.stripe_per_agent_price_id, "quantity": agent_count}],
})
# subscription.id → store as stripe_subscription_id
# subscription.items.data[0].id → store as stripe_subscription_item_id

Stripe Subscription Quantity Update

# Source: https://docs.stripe.com/api/subscription_items/update?lang=python

client.v1.subscription_items.update(
    tenant.stripe_subscription_item_id,
    {"quantity": new_agent_count},
)

Fernet Encrypt/Decrypt API Key

# Source: https://cryptography.io/en/latest/fernet/

from cryptography.fernet import Fernet
key = Fernet.generate_key()  # run once, store in env
f = Fernet(key)
ciphertext = f.encrypt(b"sk-openai-key...").decode()
plaintext = f.decrypt(ciphertext.encode()).decode()

MultiFernet Key Rotation

# Source: https://cryptography.io/en/latest/fernet/

from cryptography.fernet import Fernet, MultiFernet

# Step 1: Add new key to front, keep old key
new_fernet = MultiFernet([Fernet(new_key), Fernet(old_key)])

# Step 2: Rotate all existing ciphertexts in DB
async for row in session.stream(select(TenantLlmKey)):
    row.encrypted_key = new_fernet.rotate(row.encrypted_key.encode()).decode()
    row.key_version += 1
await session.commit()

# Step 3: Remove old key from env, restart

Recharts Bar Chart for Token Usage

// Source: recharts.org/api-reference
// Install: npm install recharts

import { BarChart, Bar, XAxis, YAxis, CartesianGrid, Tooltip, ResponsiveContainer } from "recharts";

<ResponsiveContainer width="100%" height={300}>
  <BarChart data={agentUsageData}>
    <CartesianGrid strokeDasharray="3 3" />
    <XAxis dataKey="agentName" />
    <YAxis />
    <Tooltip formatter={(value) => [`${value} tokens`, "Usage"]} />
    <Bar dataKey="total_tokens" fill="#6366f1" radius={[4, 4, 0, 0]} />
  </BarChart>
</ResponsiveContainer>

SQLAlchemy Async Aggregate Query for Cost

# Source: SQLAlchemy 2.0 docs + project pattern

from sqlalchemy import select, func, text
from shared.models.audit import AuditEvent

result = await session.execute(
    select(
        AuditEvent.agent_id,
        func.sum(
            func.cast(AuditEvent.metadata["total_tokens"].astext, Integer)
        ).label("total_tokens"),
        func.sum(
            func.cast(AuditEvent.metadata["cost_usd"].astext, Float)
        ).label("cost_usd"),
        func.count().label("call_count"),
    )
    .where(
        AuditEvent.tenant_id == tenant_id,
        AuditEvent.action_type == "llm_call",
        AuditEvent.created_at >= start_date,
    )
    .group_by(AuditEvent.agent_id)
)

State of the Art

Old Approach Current Approach When Changed Impact
middleware.ts for auth guards proxy.ts (named export proxy) Next.js 16 Already implemented in project; do not create middleware.ts
stripe.Webhook.constructEvent() (old SDK) client.parse_event_notification() or stripe.Webhook.construct_event() stripe-python v12+ Both work; use construct_event for consistency with FastAPI raw body pattern
On-Premises WhatsApp API Cloud API only Oct 2025 (deprecated) Must use Cloud API; On-Premises is unsupported
Fernet single key MultiFernet for key rotation Always available Use MultiFernet from day one; single key is not rotatable without downtime

Deprecated/outdated:

  • middleware.ts in Next.js 16: renamed to proxy.ts with named export proxy — already handled
  • WhatsApp On-Premises API: deprecated Oct 2025, Cloud API only
  • stripe.Customer.create() old-style: use StripeClient.v1.customers.create() with new client pattern

Open Questions

  1. LiteLLM completion_cost() availability

    • What we know: LiteLLM has a completion_cost() utility that calculates cost from model name + token counts
    • What's unclear: Whether llm-pool service currently surfaces this in its response to the orchestrator, or whether the orchestrator calls LiteLLM directly
    • Recommendation: Inspect packages/llm-pool to confirm response schema includes usage object with token counts; if not, add it as Wave 0
  2. Stripe Customer creation timing

    • What we know: stripe_customer_id needs to exist before Checkout Session creation
    • What's unclear: Whether to create the Stripe Customer at tenant creation (Phase 1 migration needed) or lazily on first billing action
    • Recommendation: Create lazily on first billing action — avoids creating Stripe customers for test/internal tenants
  3. Slack OAuth callback — Next.js Route Handler vs FastAPI

    • What we know: The OAuth redirect_uri must be a URL Konstruct controls; both portal (Next.js) and backend (FastAPI) can handle it
    • What's unclear: Which service is public-facing for the OAuth callback
    • Recommendation: Handle in Next.js Route Handler (app/api/slack/callback/route.ts) which calls the FastAPI backend to store the token; cleaner separation and avoids CORS complications
  4. Budget limit storage location

    • What we know: AGNT-07 requires per-agent budget limits; CONTEXT.md shows budget as visual alert
    • What's unclear: Whether budget limit is per-agent on the agents table or per-tenant on the tenants table
    • Recommendation: Add budget_limit_usd to the Agent model (per-agent is more flexible); default NULL means no limit

Validation Architecture

Test Framework

Property Value
Framework pytest 8.3.0 + pytest-asyncio 0.25.0
Config file pyproject.toml [tool.pytest.ini_options]
Quick run command pytest tests/unit -x
Full suite command pytest tests/unit tests/integration -x

Phase Requirements → Test Map

Req ID Behavior Test Type Automated Command File Exists?
AGNT-07 Token usage aggregation query returns correct totals unit pytest tests/unit/test_usage_aggregation.py -x Wave 0
AGNT-07 Budget alert threshold triggers at 80% and 100% unit pytest tests/unit/test_budget_alerts.py -x Wave 0
LLM-03 Fernet encrypt/decrypt round-trip preserves plaintext unit pytest tests/unit/test_key_encryption.py -x Wave 0
LLM-03 MultiFernet key rotation re-encrypts without data loss unit pytest tests/unit/test_key_encryption.py::test_rotation -x Wave 0
PRTA-03 Slack OAuth state HMAC generation + verification unit pytest tests/unit/test_slack_oauth.py -x Wave 0
PRTA-03 Slack OAuth callback stores channel_connection correctly integration pytest tests/integration/test_slack_oauth.py -x Wave 0
PRTA-04 Onboarding stepper transitions through all 3 steps manual manual-only (UI flow)
PRTA-04 Test message send endpoint returns 200 with valid token integration pytest tests/integration/test_channel_test_message.py -x Wave 0
PRTA-05 Stripe webhook handler ignores duplicate event IDs unit pytest tests/unit/test_stripe_webhooks.py::test_idempotency -x Wave 0
PRTA-05 Subscription status updates on customer.subscription.updated unit pytest tests/unit/test_stripe_webhooks.py::test_subscription_updated -x Wave 0
PRTA-05 Agent deactivation on subscription cancellation unit pytest tests/unit/test_stripe_webhooks.py::test_cancellation -x Wave 0
PRTA-06 Cost aggregation query groups tokens by agent_id unit pytest tests/unit/test_usage_aggregation.py::test_group_by_agent -x Wave 0
PRTA-06 Cost aggregation query groups cost by provider unit pytest tests/unit/test_usage_aggregation.py::test_group_by_provider -x Wave 0

Sampling Rate

  • Per task commit: pytest tests/unit -x
  • Per wave merge: pytest tests/unit tests/integration -x
  • Phase gate: Full suite green before /gsd:verify-work

Wave 0 Gaps

  • tests/unit/test_key_encryption.py — covers LLM-03 (Fernet encrypt/decrypt/rotate)
  • tests/unit/test_slack_oauth.py — covers PRTA-03 (state HMAC generation/verification)
  • tests/unit/test_stripe_webhooks.py — covers PRTA-05 (idempotency, status updates, cancellation)
  • tests/unit/test_usage_aggregation.py — covers AGNT-07, PRTA-06 (SQL aggregate queries)
  • tests/unit/test_budget_alerts.py — covers AGNT-07 (threshold logic)
  • tests/integration/test_slack_oauth.py — covers PRTA-03 (full callback flow)
  • tests/integration/test_channel_test_message.py — covers PRTA-04 (test message endpoint)
  • packages/shared/shared/models/billing.py — TenantBilling fields migration model
  • packages/shared/shared/models/billing.py — TenantLlmKey model
  • Alembic migration: audit_events metadata fields (prompt_tokens, completion_tokens, cost_usd, provider)
  • Alembic migration: tenant_llm_keys table
  • Alembic migration: tenants billing fields (stripe_customer_id, stripe_subscription_id, stripe_subscription_item_id, subscription_status, trial_ends_at, agent_quota)
  • Alembic migration: agents.budget_limit_usd field
  • Alembic migration: stripe_events idempotency table
  • npm install: recharts @stripe/stripe-js stripe in packages/portal
  • uv add: stripe cryptography in packages/shared

Sources

Primary (HIGH confidence)

Secondary (MEDIUM confidence)

Tertiary (LOW confidence)

  • WhatsApp Embedded Signup standard status (2026) — multiple secondary sources agree; not directly verified against Meta developer docs within this research session

Metadata

Confidence breakdown:

  • Standard stack: HIGH — all libraries verified against official docs or project package.json
  • Slack OAuth flow: HIGH — verified against official Slack developer docs
  • Stripe billing: HIGH — verified against official Stripe API reference and docs
  • BYO key encryption: HIGH — verified against official cryptography.io docs; note Fernet is AES-128 not AES-256
  • Cost aggregation: MEDIUM — SQL pattern is standard PostgreSQL JSONB; exact SQLAlchemy ORM casting syntax needs validation against project's asyncpg driver
  • Recharts: MEDIUM — download stats verified via npmtrends; API verified against recharts.org
  • WhatsApp manual setup: MEDIUM — On-Premises deprecation confirmed; manual setup steps derived from Meta developer docs indirectly

Research date: 2026-03-23 Valid until: 2026-04-23 (Stripe and Slack APIs are stable; Next.js 16 is current)