770 lines
39 KiB
Markdown
770 lines
39 KiB
Markdown
# Phase 3: Operator Experience - Research
|
|
|
|
**Researched:** 2026-03-23
|
|
**Domain:** Slack OAuth V2, Stripe Subscriptions, BYO API Key Encryption, Cost Dashboard
|
|
**Confidence:** HIGH (core stack verified against official docs)
|
|
|
|
---
|
|
|
|
<user_constraints>
|
|
## User Constraints (from CONTEXT.md)
|
|
|
|
### Locked Decisions
|
|
- Slack connection via standard OAuth2 "Add to Slack" flow — operator clicks button, authorizes, tokens stored automatically
|
|
- WhatsApp connection: guided manual setup (Claude's discretion confirmed)
|
|
- After connecting a channel, the wizard MUST include a "send test message" step — required, not optional
|
|
- Test message verifies end-to-end connectivity before the agent goes live
|
|
- Onboarding sequence: Connect Channel → Configure Agent → Send Test Message
|
|
- Agent goes live automatically after the test message succeeds — no separate "Go Live" button
|
|
- Pricing model: per-agent monthly (e.g., $49/agent/month)
|
|
- 14-day free trial with full access, credit card required upfront
|
|
- Subscription management via Stripe: subscribe, upgrade (add agents), downgrade (remove agents), cancel
|
|
- LLM-03 resolved: BYO API keys IS in v1 scope (Phase 3)
|
|
- Cost metrics: token usage per agent, cost breakdown by LLM provider, message volume per agent/channel, budget alerts
|
|
- Budget alerts: visual indicator when approaching or exceeding per-agent budget limits (from AGNT-07)
|
|
|
|
### Claude's Discretion
|
|
- WhatsApp connection method (guided manual vs embedded signup)
|
|
- Stepper UI for onboarding (yes/no, visual style)
|
|
- Non-payment enforcement behavior
|
|
- BYO key scope (tenant-level settings page vs per-agent)
|
|
- Cost dashboard time range options
|
|
- Dashboard chart library (recharts, nivo, etc.)
|
|
- Stripe webhook event handling strategy (idempotency, retry)
|
|
|
|
### Deferred Ideas (OUT OF SCOPE)
|
|
None — discussion stayed within phase scope
|
|
</user_constraints>
|
|
|
|
---
|
|
|
|
<phase_requirements>
|
|
## Phase Requirements
|
|
|
|
| ID | Description | Research Support |
|
|
|----|-------------|-----------------|
|
|
| AGNT-07 | Agent token usage tracked per-agent per-tenant with configurable budget limits | Audit event JSONB metadata must store `prompt_tokens`, `completion_tokens`, `provider`; budget stored on Tenant model; alert threshold query pattern documented |
|
|
| LLM-03 | Tenant can provide their own API keys for supported LLM providers (BYO keys, encrypted at rest) | Fernet AES-128-CBC with HMAC-SHA256; envelope encryption pattern; new `tenant_llm_keys` table; LiteLLM routing integration |
|
|
| PRTA-03 | Operator can connect messaging channels (Slack, WhatsApp) via guided wizard | Slack OAuth V2 flow; required scopes; token storage in `channel_connections.config`; WhatsApp manual setup steps |
|
|
| PRTA-04 | New tenants are guided through structured onboarding (connect channel, configure agent, test message) | Stepper UI pattern; Next.js App Router multi-step page; test message endpoint |
|
|
| PRTA-05 | Operator can manage subscription plans and billing via Stripe integration | Stripe Checkout with per-seat quantity; Billing Portal for self-service; webhook event map; idempotency pattern |
|
|
| PRTA-06 | Portal displays agent cost tracking and usage metrics per tenant | SQL aggregate query on audit_events; JSONB path extraction; Recharts for visualization; time-range filtering |
|
|
</phase_requirements>
|
|
|
|
---
|
|
|
|
## Summary
|
|
|
|
Phase 3 adds the commercial and operational layer to the Konstruct portal: Slack OAuth, subscription billing, BYO key encryption, and a cost dashboard. All four areas are well-trodden territory with mature libraries — the risks are in integration details, not algorithmic complexity.
|
|
|
|
The largest architectural gap is in the audit trail: the existing `audit_events.metadata` JSONB field stores `model` and `iteration` but NOT `prompt_tokens`, `completion_tokens`, or `cost_usd`. These fields must be added to the audit logger before the cost dashboard can function. This is a prerequisite for PRTA-06 and AGNT-07 and needs to be Wave 0 work.
|
|
|
|
The second important finding is that WhatsApp Embedded Signup (Meta OAuth flow) is now the standard for BSP-level onboarding in 2026, but it requires a registered Facebook Business Verification and a BSP/Tech Provider program account. For v1 "guided manual setup" is the correct choice — it means operators manually create a WhatsApp Business App, get their phone number token, and paste credentials into the portal. This avoids the multi-week Meta verification process while shipping.
|
|
|
|
**Primary recommendation:** Build Slack OAuth → Stripe billing → BYO key encryption → cost dashboard in that order. Each is independently deployable. Start with the audit trail metadata migration as Wave 0.
|
|
|
|
---
|
|
|
|
## Standard Stack
|
|
|
|
### Core
|
|
| Library | Version | Purpose | Why Standard |
|
|
|---------|---------|---------|--------------|
|
|
| `stripe` (Python) | `>=12.0.0` | Stripe API, webhook verification, subscription management | Official Stripe Python SDK; `StripeClient` pattern is current API |
|
|
| `cryptography` (Python) | `>=47.0.0` | BYO key encryption via Fernet | pyca/cryptography is the Python standard; already used for bcrypt via `bcrypt` dep; Fernet is audited |
|
|
| `slack-bolt` (Python) | `>=1.22.0` | Slack OAuth installer, Events API | Already in CLAUDE.md tech stack; `OAuthFlow` handles token exchange |
|
|
| `stripe` (npm) | `>=17.0.0` | Stripe.js for frontend Checkout redirect | Official JS client |
|
|
| `recharts` | `>=2.15.0` | Cost dashboard charts | 17M weekly downloads vs Nivo's 2M; simpler JSX API; strong shadcn/ui alignment |
|
|
|
|
### Supporting
|
|
| Library | Version | Purpose | When to Use |
|
|
|---------|---------|---------|-------------|
|
|
| `@stripe/stripe-js` | `>=5.0.0` | Stripe Checkout redirect from browser | When creating Checkout Sessions from portal |
|
|
| `slack-sdk` (Python) | `>=3.35.0` | Lower-level Slack Web API calls (post test message) | For the "send test message" verification step |
|
|
|
|
### Alternatives Considered
|
|
| Instead of | Could Use | Tradeoff |
|
|
|------------|-----------|----------|
|
|
| Fernet (AES-128-CBC + HMAC) | AES-256-GCM via `cryptography.hazmat` | AES-256-GCM is stronger but requires manual MAC management; Fernet is audited, has MultiFernet key rotation, and AES-128-CBC + HMAC-SHA256 is sufficient for API key protection |
|
|
| Recharts | Nivo | Nivo has more chart types but 8x fewer downloads, worse documentation, and verbose API; Recharts is recommended for SaaS admin dashboards |
|
|
| Stripe Billing Portal (hosted) | Custom billing UI | Custom UI requires full payment method management; Billing Portal handles card updates, invoice history, cancellation in a Stripe-hosted page — use it |
|
|
|
|
**Installation:**
|
|
```bash
|
|
# Python (add to packages/shared/pyproject.toml)
|
|
uv add stripe cryptography
|
|
|
|
# Node (in packages/portal)
|
|
npm install recharts @stripe/stripe-js stripe
|
|
```
|
|
|
|
---
|
|
|
|
## Architecture Patterns
|
|
|
|
### Recommended Project Structure (new files only)
|
|
|
|
```
|
|
packages/
|
|
├── shared/
|
|
│ └── shared/
|
|
│ ├── models/
|
|
│ │ └── billing.py # TenantBilling, TenantLlmKey models
|
|
│ └── api/
|
|
│ ├── billing.py # Stripe webhooks + subscription endpoints
|
|
│ └── channels.py # Slack OAuth callback, channel connection
|
|
├── portal/
|
|
│ └── app/
|
|
│ ├── api/
|
|
│ │ └── slack/
|
|
│ │ └── callback/
|
|
│ │ └── route.ts # Slack OAuth redirect handler
|
|
│ └── (dashboard)/
|
|
│ ├── onboarding/
|
|
│ │ └── page.tsx # Connect Channel → Configure Agent → Test
|
|
│ ├── billing/
|
|
│ │ └── page.tsx # Subscription status + Billing Portal redirect
|
|
│ ├── usage/
|
|
│ │ └── [tenantId]/
|
|
│ │ └── page.tsx # Cost dashboard per tenant
|
|
│ └── settings/
|
|
│ └── api-keys/
|
|
│ └── page.tsx # BYO key management
|
|
migrations/
|
|
└── versions/
|
|
├── xxxx_add_billing_fields.py # stripe_customer_id, subscription_status, trial_ends_at on tenants
|
|
├── xxxx_add_tenant_llm_keys.py # tenant_llm_keys table
|
|
└── xxxx_add_token_fields.py # prompt_tokens, completion_tokens, cost_usd, provider on audit_events
|
|
```
|
|
|
|
### Pattern 1: Slack OAuth V2 Flow
|
|
|
|
**What:** Operator clicks "Add to Slack" → Slack authorization page → redirect back to portal callback → exchange code for bot token → store in `channel_connections`
|
|
|
|
**Scopes required (bot):**
|
|
- `app_mentions:read` — receive @mention events
|
|
- `channels:read` — list public channels
|
|
- `channels:history` — read channel message history
|
|
- `chat:write` — post messages (required for test message + agent replies)
|
|
- `groups:read` — private channels
|
|
- `im:read` / `im:write` / `im:history` — DM support
|
|
- `mpim:read` / `mpim:history` — multi-party DMs
|
|
|
|
**OAuth V2 flow:**
|
|
|
|
```
|
|
1. Operator visits /onboarding → clicks "Add to Slack"
|
|
2. Portal redirects to:
|
|
https://slack.com/oauth/v2/authorize
|
|
?client_id=<SLACK_CLIENT_ID>
|
|
&scope=app_mentions:read,channels:read,channels:history,chat:write,im:read,im:write,im:history
|
|
&redirect_uri=https://app.konstruct.ai/api/slack/callback
|
|
&state=<csrf_token:tenant_id>
|
|
|
|
3. User approves → Slack redirects to /api/slack/callback?code=xxx&state=yyy
|
|
|
|
4. FastAPI backend exchanges code:
|
|
POST https://slack.com/api/oauth.v2.access
|
|
client_id, client_secret, code, redirect_uri
|
|
|
|
5. Response contains:
|
|
{
|
|
"ok": true,
|
|
"access_token": "xoxb-...", ← bot token, store encrypted
|
|
"team": { "id": "T12345", "name": "Acme Corp" },
|
|
"bot_user_id": "U67890",
|
|
"scope": "app_mentions:read,..."
|
|
}
|
|
|
|
6. Store in channel_connections:
|
|
- channel_type: "slack"
|
|
- workspace_id: team.id
|
|
- config: { "bot_token": encrypt(access_token), "bot_user_id": ..., "team_name": ... }
|
|
```
|
|
|
|
**State parameter** must encode `tenant_id` + CSRF token (sign with HMAC-SHA256, verify on callback).
|
|
|
|
```python
|
|
# Source: https://docs.slack.dev/authentication/installing-with-oauth/
|
|
|
|
# Generate state
|
|
import hmac, hashlib, secrets, json, base64
|
|
|
|
def generate_oauth_state(tenant_id: str, secret: str) -> str:
|
|
nonce = secrets.token_urlsafe(16)
|
|
payload = json.dumps({"tenant_id": tenant_id, "nonce": nonce})
|
|
sig = hmac.new(secret.encode(), payload.encode(), hashlib.sha256).hexdigest()
|
|
return base64.urlsafe_b64encode(f"{payload}:{sig}".encode()).decode()
|
|
|
|
def verify_oauth_state(state: str, secret: str) -> str:
|
|
"""Returns tenant_id or raises ValueError."""
|
|
decoded = base64.urlsafe_b64decode(state.encode()).decode()
|
|
payload_str, sig = decoded.rsplit(":", 1)
|
|
expected = hmac.new(secret.encode(), payload_str.encode(), hashlib.sha256).hexdigest()
|
|
if not hmac.compare_digest(sig, expected):
|
|
raise ValueError("Invalid OAuth state")
|
|
return json.loads(payload_str)["tenant_id"]
|
|
```
|
|
|
|
### Pattern 2: Stripe Per-Agent Subscription
|
|
|
|
**What:** Operator subscribes → Checkout Session created with quantity=agent_count → redirected to Stripe → on success webhook, provision access.
|
|
|
|
**Key objects to persist on Tenant:**
|
|
- `stripe_customer_id` (String) — created once per tenant on first subscription
|
|
- `stripe_subscription_id` (String | None)
|
|
- `stripe_subscription_item_id` (String | None) — needed for quantity updates
|
|
- `subscription_status` (Enum: `trialing`, `active`, `past_due`, `canceled`, `unpaid`)
|
|
- `trial_ends_at` (DateTime | None)
|
|
- `agent_quota` (Integer) — number of paid seats
|
|
|
|
**Checkout Session creation (Python):**
|
|
```python
|
|
# Source: https://docs.stripe.com/payments/checkout/build-subscriptions
|
|
|
|
import stripe
|
|
|
|
client = stripe.StripeClient(api_key=settings.stripe_secret_key)
|
|
|
|
session = client.v1.checkout.sessions.create({
|
|
"mode": "subscription",
|
|
"customer": tenant.stripe_customer_id, # or create new
|
|
"line_items": [{
|
|
"price": settings.stripe_per_agent_price_id,
|
|
"quantity": agent_count, # number of agents being subscribed
|
|
}],
|
|
"subscription_data": {
|
|
"trial_period_days": 14,
|
|
},
|
|
"success_url": f"{settings.portal_url}/billing?session_id={{CHECKOUT_SESSION_ID}}",
|
|
"cancel_url": f"{settings.portal_url}/billing",
|
|
})
|
|
# Return session.url to frontend for redirect
|
|
```
|
|
|
|
**Quantity update when agents are added/removed:**
|
|
```python
|
|
# Source: https://docs.stripe.com/api/subscription_items/update?lang=python
|
|
|
|
client.v1.subscription_items.update(
|
|
tenant.stripe_subscription_item_id,
|
|
{"quantity": new_agent_count},
|
|
)
|
|
```
|
|
|
|
**Billing Portal session:**
|
|
```python
|
|
# Source: https://docs.stripe.com/customer-management/integrate-customer-portal
|
|
|
|
portal_session = client.v1.billing_portal.sessions.create({
|
|
"customer": tenant.stripe_customer_id,
|
|
"return_url": f"{settings.portal_url}/billing",
|
|
})
|
|
# Return portal_session.url to frontend
|
|
```
|
|
|
|
### Pattern 3: Stripe Webhook Handler
|
|
|
|
**Critical webhook events to handle:**
|
|
|
|
| Event | Action |
|
|
|-------|--------|
|
|
| `checkout.session.completed` | Store `subscription_id`, `subscription_item_id`, set status `trialing` or `active` |
|
|
| `customer.subscription.created` | Same as above if not using Checkout |
|
|
| `customer.subscription.updated` | Update `subscription_status`, `agent_quota`, `trial_ends_at` |
|
|
| `customer.subscription.deleted` | Set status `canceled`, deactivate all agents |
|
|
| `customer.subscription.trial_will_end` | Send alert email (3 days before trial ends) |
|
|
| `invoice.paid` | Set status `active`, re-enable agents if they were suspended |
|
|
| `invoice.payment_failed` | Set status `past_due`, send payment failure notification |
|
|
|
|
**FastAPI webhook endpoint:**
|
|
```python
|
|
# Source: https://docs.stripe.com/webhooks
|
|
|
|
from fastapi import APIRouter, Request, HTTPException
|
|
import stripe
|
|
|
|
webhook_router = APIRouter()
|
|
|
|
@webhook_router.post("/webhooks/stripe")
|
|
async def stripe_webhook(
|
|
request: Request,
|
|
session: AsyncSession = Depends(get_session),
|
|
) -> dict[str, str]:
|
|
payload = await request.body()
|
|
sig_header = request.headers.get("stripe-signature", "")
|
|
|
|
try:
|
|
event = stripe.WebhookEvent.construct_from(
|
|
stripe.Webhook.construct_event(
|
|
payload, sig_header, settings.stripe_webhook_secret
|
|
).to_dict(),
|
|
stripe.api_key,
|
|
)
|
|
except stripe.SignatureVerificationError:
|
|
raise HTTPException(status_code=400, detail="Invalid signature")
|
|
|
|
# Idempotency: check if event already processed
|
|
already_processed = await _check_event_processed(session, event["id"])
|
|
if already_processed:
|
|
return {"status": "already_processed"}
|
|
|
|
await _record_event_processed(session, event["id"])
|
|
await _dispatch_event(session, event)
|
|
return {"status": "ok"}
|
|
```
|
|
|
|
**Idempotency table:** Add a `stripe_events` table with `(event_id PRIMARY KEY, processed_at)` — INSERT with ON CONFLICT DO NOTHING; if 0 rows affected, skip processing.
|
|
|
|
**Non-payment enforcement:** When `subscription_status` becomes `past_due` after grace period (configurable, suggest 7 days), set `Agent.is_active = False` for all tenant agents. The gateway/orchestrator already gates on `is_active`, so no further changes needed.
|
|
|
|
### Pattern 4: BYO API Key Encryption (Envelope Encryption)
|
|
|
|
**What:** Tenant provides their OpenAI/Anthropic API key. We encrypt it before storing. The platform-level master encryption key is in environment variables (or secrets manager).
|
|
|
|
**Important:** Fernet uses AES-128-CBC + HMAC-SHA256, NOT AES-256. This is still cryptographically sound and the `cryptography` library is audited. CLAUDE.md specifies "AES-256" aspirationally — Fernet is the correct practical choice. Document this tradeoff in ADR-005.
|
|
|
|
**Schema — new table `tenant_llm_keys`:**
|
|
```sql
|
|
CREATE TABLE tenant_llm_keys (
|
|
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
|
|
tenant_id UUID NOT NULL REFERENCES tenants(id) ON DELETE CASCADE,
|
|
provider TEXT NOT NULL, -- 'openai' | 'anthropic' | 'custom'
|
|
label TEXT NOT NULL, -- human-readable name
|
|
encrypted_key TEXT NOT NULL,
|
|
key_version INT NOT NULL DEFAULT 1, -- for rotation tracking
|
|
created_at TIMESTAMPTZ NOT NULL DEFAULT now(),
|
|
UNIQUE(tenant_id, provider) -- one key per provider per tenant
|
|
);
|
|
-- RLS enabled: same pattern as agents table
|
|
```
|
|
|
|
**Encryption service:**
|
|
```python
|
|
# Source: https://cryptography.io/en/latest/fernet/
|
|
|
|
from cryptography.fernet import Fernet, MultiFernet
|
|
import os
|
|
|
|
class KeyEncryptionService:
|
|
"""
|
|
Encrypts/decrypts tenant BYO API keys.
|
|
|
|
PLATFORM_ENCRYPTION_KEY env var must be a URL-safe base64 Fernet key.
|
|
For rotation: PLATFORM_ENCRYPTION_KEY_PREVIOUS holds the prior key.
|
|
"""
|
|
|
|
def __init__(self) -> None:
|
|
primary = Fernet(os.environ["PLATFORM_ENCRYPTION_KEY"])
|
|
keys = [primary]
|
|
if prev := os.environ.get("PLATFORM_ENCRYPTION_KEY_PREVIOUS"):
|
|
keys.append(Fernet(prev))
|
|
self._fernet = MultiFernet(keys)
|
|
|
|
def encrypt(self, plaintext: str) -> str:
|
|
return self._fernet.encrypt(plaintext.encode()).decode()
|
|
|
|
def decrypt(self, ciphertext: str) -> str:
|
|
return self._fernet.decrypt(ciphertext.encode()).decode()
|
|
|
|
def rotate(self, ciphertext: str) -> str:
|
|
"""Re-encrypt under the current primary key."""
|
|
return self._fernet.rotate(ciphertext.encode()).decode()
|
|
```
|
|
|
|
**Key generation for setup:**
|
|
```bash
|
|
python -c "from cryptography.fernet import Fernet; print(Fernet.generate_key().decode())"
|
|
```
|
|
|
|
**LiteLLM integration:** When routing LLM calls, check if the tenant has a BYO key for the requested provider. If yes, decrypt and inject into the LiteLLM call. Never log the decrypted key.
|
|
|
|
### Pattern 5: Cost Dashboard — Audit Event Aggregation
|
|
|
|
**CRITICAL PREREQUISITE:** The current audit logger stores `model` in metadata but NOT token counts. The runner.py `log_llm_call` metadata must be extended before the cost dashboard can work.
|
|
|
|
**Required metadata fields to add to `log_llm_call`:**
|
|
```python
|
|
# In orchestrator/agents/runner.py — extend existing metadata dict:
|
|
metadata={
|
|
"model": data.get("model", agent.model_preference),
|
|
"provider": _extract_provider(data.get("model", "")), # "openai" | "anthropic" | "ollama"
|
|
"prompt_tokens": usage.get("prompt_tokens", 0),
|
|
"completion_tokens": usage.get("completion_tokens", 0),
|
|
"total_tokens": usage.get("total_tokens", 0),
|
|
"cost_usd": _calculate_cost(model, usage), # pre-calculated, stored as float
|
|
"iteration": iteration,
|
|
"tool_calls_count": len(response_tool_calls),
|
|
}
|
|
```
|
|
|
|
**Dashboard aggregation query:**
|
|
```sql
|
|
-- Token usage per agent for a time range
|
|
SELECT
|
|
agent_id,
|
|
SUM((metadata->>'prompt_tokens')::int) AS prompt_tokens,
|
|
SUM((metadata->>'completion_tokens')::int) AS completion_tokens,
|
|
SUM((metadata->>'total_tokens')::int) AS total_tokens,
|
|
SUM((metadata->>'cost_usd')::float) AS cost_usd,
|
|
COUNT(*) AS llm_call_count
|
|
FROM audit_events
|
|
WHERE
|
|
tenant_id = :tenant_id
|
|
AND action_type = 'llm_call'
|
|
AND created_at >= :start_date
|
|
AND created_at < :end_date
|
|
GROUP BY agent_id;
|
|
|
|
-- Cost by provider
|
|
SELECT
|
|
metadata->>'provider' AS provider,
|
|
SUM((metadata->>'cost_usd')::float) AS cost_usd,
|
|
COUNT(*) AS call_count
|
|
FROM audit_events
|
|
WHERE
|
|
tenant_id = :tenant_id
|
|
AND action_type = 'llm_call'
|
|
AND created_at >= :start_date
|
|
GROUP BY metadata->>'provider';
|
|
|
|
-- Message volume by channel (count message events)
|
|
SELECT
|
|
metadata->>'channel' AS channel,
|
|
COUNT(*) AS message_count
|
|
FROM audit_events
|
|
WHERE
|
|
tenant_id = :tenant_id
|
|
AND action_type = 'llm_call'
|
|
AND created_at >= :start_date
|
|
GROUP BY metadata->>'channel';
|
|
```
|
|
|
|
**Index required:**
|
|
```sql
|
|
CREATE INDEX CONCURRENTLY idx_audit_events_tenant_type_created
|
|
ON audit_events (tenant_id, action_type, created_at DESC);
|
|
|
|
-- GIN index for JSONB queries if aggregation volume is high
|
|
CREATE INDEX CONCURRENTLY idx_audit_events_metadata
|
|
ON audit_events USING GIN (metadata);
|
|
```
|
|
|
|
**Time range options (Claude's discretion):** Offer Last 7 days / Last 30 days / This month / Custom range. Default to Last 30 days. Use a simple `<select>` driving a query param — no date picker library needed for v1.
|
|
|
|
**Budget alert logic:** Compare `SUM(cost_usd)` against `Agent.budget_limit_usd` (new field). Visual indicator: amber at 80%, red at 100%. Render as colored badge on the cost dashboard row, not a modal.
|
|
|
|
### Pattern 6: WhatsApp Manual Setup (Claude's Discretion Recommendation)
|
|
|
|
Embedded Signup requires a registered BSP/Meta Tech Provider account — multi-week verification process. For v1, use **guided manual setup**:
|
|
|
|
1. Operator creates a Meta/Facebook developer account and a WhatsApp Business App
|
|
2. Portal shows step-by-step instructions with screenshots
|
|
3. Operator pastes: Phone Number ID, WhatsApp Business Account ID, permanent System User Token
|
|
4. Portal validates by calling `GET https://graph.facebook.com/v22.0/{phone_number_id}` with the token
|
|
5. If valid, store in `channel_connections.config` (token encrypted with Fernet)
|
|
6. Test message: send "Konstruct connected successfully" to operator's own WhatsApp number
|
|
|
|
### Anti-Patterns to Avoid
|
|
|
|
- **Do not verify Stripe webhooks manually** — always use `stripe.Webhook.construct_event()` with the endpoint secret; raw header parsing is error-prone
|
|
- **Do not store Slack bot tokens in plaintext** — encrypt with Fernet before writing to `channel_connections.config`
|
|
- **Do not update subscription quantity synchronously** — Stripe rate-limits if updated many times per hour; queue via Celery if high-frequency
|
|
- **Do not re-query audit_events without the partial index** — full table scan on audit_events will be slow; the composite index on `(tenant_id, action_type, created_at)` is mandatory
|
|
- **Do not use `func.now()` for `trial_ends_at` calculation** — set it from the Stripe webhook response `subscription.trial_end` (Unix timestamp), not from local time
|
|
|
|
---
|
|
|
|
## Don't Hand-Roll
|
|
|
|
| Problem | Don't Build | Use Instead | Why |
|
|
|---------|-------------|-------------|-----|
|
|
| OAuth state CSRF protection | Custom state encoding | HMAC-signed state (shown above) | Replay attacks, timing attacks |
|
|
| Stripe signature verification | Manual HMAC check on raw bytes | `stripe.Webhook.construct_event()` | Handles timestamp tolerance, replay prevention |
|
|
| Subscription lifecycle state machine | Custom FSM | Stripe subscription `status` field + webhooks | Stripe handles payment retries, dunning, proration |
|
|
| API key encryption | Custom AES wrapper | `cryptography.fernet.Fernet` + `MultiFernet` | Audited, handles IV generation, MAC, key rotation |
|
|
| Billing UI (card updates, invoices) | Custom payment form | Stripe Billing Portal | PCI scope, card updating, invoice history — all free |
|
|
| Token cost calculation | Per-request cost estimation | Pre-calculate at log time using LiteLLM's `completion_cost()` | LiteLLM already tracks pricing per model; reuse it |
|
|
|
|
**Key insight:** Stripe and Slack provide hosted/SDK flows for the most security-sensitive parts. Never replicate what they already do correctly.
|
|
|
|
---
|
|
|
|
## Common Pitfalls
|
|
|
|
### Pitfall 1: Slack OAuth `state` Not Validated
|
|
**What goes wrong:** Attacker crafts a Slack OAuth callback with a valid `code` but forged `state`, linking their Slack workspace to a victim's tenant.
|
|
**Why it happens:** Treating OAuth state as opaque and skipping verification.
|
|
**How to avoid:** Always HMAC-sign the state before sending; verify the signature AND the `tenant_id` before exchanging the code.
|
|
**Warning signs:** No state validation in the callback handler.
|
|
|
|
### Pitfall 2: Stripe Webhook Raw Body Mangling
|
|
**What goes wrong:** Signature verification fails in production because a middleware (e.g., JSON parser) modifies the request body before the webhook handler reads it.
|
|
**Why it happens:** FastAPI's `Request.json()` parses the body; Stripe signatures are computed over the raw bytes.
|
|
**How to avoid:** Always read with `await request.body()` (raw bytes), not `await request.json()`.
|
|
**Warning signs:** `stripe.SignatureVerificationError` in production but not in local testing.
|
|
|
|
### Pitfall 3: Duplicate Webhook Processing
|
|
**What goes wrong:** Stripe delivers the same event twice (network retry); agent gets provisioned twice, subscription updated twice.
|
|
**Why it happens:** No idempotency guard on the webhook handler.
|
|
**How to avoid:** Store processed `event.id` in a `stripe_events` table; use `INSERT ... ON CONFLICT DO NOTHING` and check rows affected.
|
|
**Warning signs:** Duplicate `channel_connections` rows or double-charged agents.
|
|
|
|
### Pitfall 4: Fernet Key Not in Environment
|
|
**What goes wrong:** Application starts without `PLATFORM_ENCRYPTION_KEY` set; first BYO key encryption call throws `KeyError`.
|
|
**Why it happens:** Key not added to `.env.example` / Docker Compose environment.
|
|
**How to avoid:** Validate key presence at startup in `shared/config.py` using `pydantic-settings` required fields; fail fast with a clear error.
|
|
**Warning signs:** `KeyError: PLATFORM_ENCRYPTION_KEY` in logs.
|
|
|
|
### Pitfall 5: Audit Events Missing Token Metadata
|
|
**What goes wrong:** Cost dashboard shows zeros because `prompt_tokens` / `cost_usd` were never written to `audit_events.metadata`.
|
|
**Why it happens:** Runner logs the LLM call but doesn't extract token counts from the LiteLLM response object.
|
|
**How to avoid:** Extend `runner.py` metadata dict BEFORE Phase 3 dashboard work begins; backfill is impossible (audit log is append-only).
|
|
**Warning signs:** `metadata->>'prompt_tokens'` returns NULL in dashboard queries.
|
|
|
|
### Pitfall 6: Subscription Quantity Mismatch
|
|
**What goes wrong:** Operator creates 3 agents in the portal but Stripe still charges for 1.
|
|
**Why it happens:** Agent creation doesn't trigger a subscription quantity update.
|
|
**How to avoid:** On `POST /agents` and `DELETE /agents`, update Stripe subscription item quantity. Use a Celery task to avoid blocking the API response.
|
|
**Warning signs:** `stripe_subscription_item_id` is NULL; no Celery task defined for quantity sync.
|
|
|
|
---
|
|
|
|
## Code Examples
|
|
|
|
Verified patterns from official sources:
|
|
|
|
### Stripe Subscription Creation with Trial
|
|
```python
|
|
# Source: https://docs.stripe.com/api/subscriptions/create?lang=python
|
|
|
|
client = stripe.StripeClient(api_key=settings.stripe_secret_key)
|
|
|
|
subscription = client.v1.subscriptions.create({
|
|
"customer": customer_id,
|
|
"trial_period_days": 14,
|
|
"items": [{"price": settings.stripe_per_agent_price_id, "quantity": agent_count}],
|
|
})
|
|
# subscription.id → store as stripe_subscription_id
|
|
# subscription.items.data[0].id → store as stripe_subscription_item_id
|
|
```
|
|
|
|
### Stripe Subscription Quantity Update
|
|
```python
|
|
# Source: https://docs.stripe.com/api/subscription_items/update?lang=python
|
|
|
|
client.v1.subscription_items.update(
|
|
tenant.stripe_subscription_item_id,
|
|
{"quantity": new_agent_count},
|
|
)
|
|
```
|
|
|
|
### Fernet Encrypt/Decrypt API Key
|
|
```python
|
|
# Source: https://cryptography.io/en/latest/fernet/
|
|
|
|
from cryptography.fernet import Fernet
|
|
key = Fernet.generate_key() # run once, store in env
|
|
f = Fernet(key)
|
|
ciphertext = f.encrypt(b"sk-openai-key...").decode()
|
|
plaintext = f.decrypt(ciphertext.encode()).decode()
|
|
```
|
|
|
|
### MultiFernet Key Rotation
|
|
```python
|
|
# Source: https://cryptography.io/en/latest/fernet/
|
|
|
|
from cryptography.fernet import Fernet, MultiFernet
|
|
|
|
# Step 1: Add new key to front, keep old key
|
|
new_fernet = MultiFernet([Fernet(new_key), Fernet(old_key)])
|
|
|
|
# Step 2: Rotate all existing ciphertexts in DB
|
|
async for row in session.stream(select(TenantLlmKey)):
|
|
row.encrypted_key = new_fernet.rotate(row.encrypted_key.encode()).decode()
|
|
row.key_version += 1
|
|
await session.commit()
|
|
|
|
# Step 3: Remove old key from env, restart
|
|
```
|
|
|
|
### Recharts Bar Chart for Token Usage
|
|
```tsx
|
|
// Source: recharts.org/api-reference
|
|
// Install: npm install recharts
|
|
|
|
import { BarChart, Bar, XAxis, YAxis, CartesianGrid, Tooltip, ResponsiveContainer } from "recharts";
|
|
|
|
<ResponsiveContainer width="100%" height={300}>
|
|
<BarChart data={agentUsageData}>
|
|
<CartesianGrid strokeDasharray="3 3" />
|
|
<XAxis dataKey="agentName" />
|
|
<YAxis />
|
|
<Tooltip formatter={(value) => [`${value} tokens`, "Usage"]} />
|
|
<Bar dataKey="total_tokens" fill="#6366f1" radius={[4, 4, 0, 0]} />
|
|
</BarChart>
|
|
</ResponsiveContainer>
|
|
```
|
|
|
|
### SQLAlchemy Async Aggregate Query for Cost
|
|
```python
|
|
# Source: SQLAlchemy 2.0 docs + project pattern
|
|
|
|
from sqlalchemy import select, func, text
|
|
from shared.models.audit import AuditEvent
|
|
|
|
result = await session.execute(
|
|
select(
|
|
AuditEvent.agent_id,
|
|
func.sum(
|
|
func.cast(AuditEvent.metadata["total_tokens"].astext, Integer)
|
|
).label("total_tokens"),
|
|
func.sum(
|
|
func.cast(AuditEvent.metadata["cost_usd"].astext, Float)
|
|
).label("cost_usd"),
|
|
func.count().label("call_count"),
|
|
)
|
|
.where(
|
|
AuditEvent.tenant_id == tenant_id,
|
|
AuditEvent.action_type == "llm_call",
|
|
AuditEvent.created_at >= start_date,
|
|
)
|
|
.group_by(AuditEvent.agent_id)
|
|
)
|
|
```
|
|
|
|
---
|
|
|
|
## State of the Art
|
|
|
|
| Old Approach | Current Approach | When Changed | Impact |
|
|
|--------------|------------------|--------------|--------|
|
|
| `middleware.ts` for auth guards | `proxy.ts` (named export `proxy`) | Next.js 16 | Already implemented in project; do not create middleware.ts |
|
|
| `stripe.Webhook.constructEvent()` (old SDK) | `client.parse_event_notification()` or `stripe.Webhook.construct_event()` | stripe-python v12+ | Both work; use `construct_event` for consistency with FastAPI raw body pattern |
|
|
| On-Premises WhatsApp API | Cloud API only | Oct 2025 (deprecated) | Must use Cloud API; On-Premises is unsupported |
|
|
| Fernet single key | MultiFernet for key rotation | Always available | Use MultiFernet from day one; single key is not rotatable without downtime |
|
|
|
|
**Deprecated/outdated:**
|
|
- `middleware.ts` in Next.js 16: renamed to `proxy.ts` with named export `proxy` — already handled
|
|
- WhatsApp On-Premises API: deprecated Oct 2025, Cloud API only
|
|
- `stripe.Customer.create()` old-style: use `StripeClient.v1.customers.create()` with new client pattern
|
|
|
|
---
|
|
|
|
## Open Questions
|
|
|
|
1. **LiteLLM `completion_cost()` availability**
|
|
- What we know: LiteLLM has a `completion_cost()` utility that calculates cost from model name + token counts
|
|
- What's unclear: Whether `llm-pool` service currently surfaces this in its response to the orchestrator, or whether the orchestrator calls LiteLLM directly
|
|
- Recommendation: Inspect `packages/llm-pool` to confirm response schema includes `usage` object with token counts; if not, add it as Wave 0
|
|
|
|
2. **Stripe Customer creation timing**
|
|
- What we know: `stripe_customer_id` needs to exist before Checkout Session creation
|
|
- What's unclear: Whether to create the Stripe Customer at tenant creation (Phase 1 migration needed) or lazily on first billing action
|
|
- Recommendation: Create lazily on first billing action — avoids creating Stripe customers for test/internal tenants
|
|
|
|
3. **Slack OAuth callback — Next.js Route Handler vs FastAPI**
|
|
- What we know: The OAuth redirect_uri must be a URL Konstruct controls; both portal (Next.js) and backend (FastAPI) can handle it
|
|
- What's unclear: Which service is public-facing for the OAuth callback
|
|
- Recommendation: Handle in Next.js Route Handler (`app/api/slack/callback/route.ts`) which calls the FastAPI backend to store the token; cleaner separation and avoids CORS complications
|
|
|
|
4. **Budget limit storage location**
|
|
- What we know: AGNT-07 requires per-agent budget limits; CONTEXT.md shows budget as visual alert
|
|
- What's unclear: Whether budget limit is per-agent on the `agents` table or per-tenant on the `tenants` table
|
|
- Recommendation: Add `budget_limit_usd` to the `Agent` model (per-agent is more flexible); default NULL means no limit
|
|
|
|
---
|
|
|
|
## Validation Architecture
|
|
|
|
### Test Framework
|
|
| Property | Value |
|
|
|----------|-------|
|
|
| Framework | pytest 8.3.0 + pytest-asyncio 0.25.0 |
|
|
| Config file | `pyproject.toml` `[tool.pytest.ini_options]` |
|
|
| Quick run command | `pytest tests/unit -x` |
|
|
| Full suite command | `pytest tests/unit tests/integration -x` |
|
|
|
|
### Phase Requirements → Test Map
|
|
|
|
| Req ID | Behavior | Test Type | Automated Command | File Exists? |
|
|
|--------|----------|-----------|-------------------|-------------|
|
|
| AGNT-07 | Token usage aggregation query returns correct totals | unit | `pytest tests/unit/test_usage_aggregation.py -x` | Wave 0 |
|
|
| AGNT-07 | Budget alert threshold triggers at 80% and 100% | unit | `pytest tests/unit/test_budget_alerts.py -x` | Wave 0 |
|
|
| LLM-03 | Fernet encrypt/decrypt round-trip preserves plaintext | unit | `pytest tests/unit/test_key_encryption.py -x` | Wave 0 |
|
|
| LLM-03 | MultiFernet key rotation re-encrypts without data loss | unit | `pytest tests/unit/test_key_encryption.py::test_rotation -x` | Wave 0 |
|
|
| PRTA-03 | Slack OAuth state HMAC generation + verification | unit | `pytest tests/unit/test_slack_oauth.py -x` | Wave 0 |
|
|
| PRTA-03 | Slack OAuth callback stores channel_connection correctly | integration | `pytest tests/integration/test_slack_oauth.py -x` | Wave 0 |
|
|
| PRTA-04 | Onboarding stepper transitions through all 3 steps | manual | — | manual-only (UI flow) |
|
|
| PRTA-04 | Test message send endpoint returns 200 with valid token | integration | `pytest tests/integration/test_channel_test_message.py -x` | Wave 0 |
|
|
| PRTA-05 | Stripe webhook handler ignores duplicate event IDs | unit | `pytest tests/unit/test_stripe_webhooks.py::test_idempotency -x` | Wave 0 |
|
|
| PRTA-05 | Subscription status updates on `customer.subscription.updated` | unit | `pytest tests/unit/test_stripe_webhooks.py::test_subscription_updated -x` | Wave 0 |
|
|
| PRTA-05 | Agent deactivation on subscription cancellation | unit | `pytest tests/unit/test_stripe_webhooks.py::test_cancellation -x` | Wave 0 |
|
|
| PRTA-06 | Cost aggregation query groups tokens by agent_id | unit | `pytest tests/unit/test_usage_aggregation.py::test_group_by_agent -x` | Wave 0 |
|
|
| PRTA-06 | Cost aggregation query groups cost by provider | unit | `pytest tests/unit/test_usage_aggregation.py::test_group_by_provider -x` | Wave 0 |
|
|
|
|
### Sampling Rate
|
|
- **Per task commit:** `pytest tests/unit -x`
|
|
- **Per wave merge:** `pytest tests/unit tests/integration -x`
|
|
- **Phase gate:** Full suite green before `/gsd:verify-work`
|
|
|
|
### Wave 0 Gaps
|
|
- [ ] `tests/unit/test_key_encryption.py` — covers LLM-03 (Fernet encrypt/decrypt/rotate)
|
|
- [ ] `tests/unit/test_slack_oauth.py` — covers PRTA-03 (state HMAC generation/verification)
|
|
- [ ] `tests/unit/test_stripe_webhooks.py` — covers PRTA-05 (idempotency, status updates, cancellation)
|
|
- [ ] `tests/unit/test_usage_aggregation.py` — covers AGNT-07, PRTA-06 (SQL aggregate queries)
|
|
- [ ] `tests/unit/test_budget_alerts.py` — covers AGNT-07 (threshold logic)
|
|
- [ ] `tests/integration/test_slack_oauth.py` — covers PRTA-03 (full callback flow)
|
|
- [ ] `tests/integration/test_channel_test_message.py` — covers PRTA-04 (test message endpoint)
|
|
- [ ] `packages/shared/shared/models/billing.py` — TenantBilling fields migration model
|
|
- [ ] `packages/shared/shared/models/billing.py` — TenantLlmKey model
|
|
- [ ] Alembic migration: `audit_events` metadata fields (`prompt_tokens`, `completion_tokens`, `cost_usd`, `provider`)
|
|
- [ ] Alembic migration: `tenant_llm_keys` table
|
|
- [ ] Alembic migration: `tenants` billing fields (`stripe_customer_id`, `stripe_subscription_id`, `stripe_subscription_item_id`, `subscription_status`, `trial_ends_at`, `agent_quota`)
|
|
- [ ] Alembic migration: `agents.budget_limit_usd` field
|
|
- [ ] Alembic migration: `stripe_events` idempotency table
|
|
- [ ] npm install: `recharts @stripe/stripe-js stripe` in `packages/portal`
|
|
- [ ] uv add: `stripe cryptography` in `packages/shared`
|
|
|
|
---
|
|
|
|
## Sources
|
|
|
|
### Primary (HIGH confidence)
|
|
- [https://cryptography.io/en/latest/fernet/](https://cryptography.io/en/latest/fernet/) — Fernet spec, AES-128-CBC+HMAC-SHA256, MultiFernet rotation
|
|
- [https://docs.stripe.com/billing/subscriptions/webhooks](https://docs.stripe.com/billing/subscriptions/webhooks) — webhook event types, idempotency
|
|
- [https://docs.stripe.com/api/subscriptions/create?lang=python](https://docs.stripe.com/api/subscriptions/create?lang=python) — subscription creation with trial and quantity
|
|
- [https://docs.stripe.com/api/subscription_items/update?lang=python](https://docs.stripe.com/api/subscription_items/update?lang=python) — quantity update API
|
|
- [https://docs.stripe.com/customer-management/integrate-customer-portal](https://docs.stripe.com/customer-management/integrate-customer-portal) — Billing Portal session creation
|
|
- [https://docs.slack.dev/authentication/installing-with-oauth/](https://docs.slack.dev/authentication/installing-with-oauth/) — OAuth V2 flow
|
|
- [https://docs.slack.dev/apis/events-api/](https://docs.slack.dev/apis/events-api/) — scopes for Events API
|
|
- [https://docs.slack.dev/authentication/tokens/](https://docs.slack.dev/authentication/tokens/) — bot token types
|
|
- `packages/portal/node_modules/next/dist/docs/01-app/01-getting-started/16-proxy.md` — Next.js 16 proxy.ts (not middleware.ts)
|
|
- `packages/portal/node_modules/next/dist/docs/01-app/01-getting-started/15-route-handlers.md` — Route Handlers
|
|
- Existing project source: `packages/orchestrator/orchestrator/audit/logger.py`, `packages/shared/shared/models/audit.py`, `packages/shared/shared/models/tenant.py`
|
|
|
|
### Secondary (MEDIUM confidence)
|
|
- [https://docs.stripe.com/webhooks](https://docs.stripe.com/webhooks) — webhook signature verification in Python
|
|
- [https://docs.stripe.com/subscriptions/pricing-models/per-seat-pricing](https://docs.stripe.com/subscriptions/pricing-models/per-seat-pricing) — per-seat quantity model
|
|
- [https://www.speakeasy.com/blog/nivo-vs-recharts](https://www.speakeasy.com/blog/nivo-vs-recharts) — Recharts vs Nivo comparison (Recharts recommended)
|
|
- [https://npmtrends.com/chart.js-vs-highcharts-vs-nivo-vs-recharts](https://npmtrends.com/chart.js-vs-highcharts-vs-nivo-vs-recharts) — download statistics confirming Recharts dominance
|
|
|
|
### Tertiary (LOW confidence)
|
|
- WhatsApp Embedded Signup standard status (2026) — multiple secondary sources agree; not directly verified against Meta developer docs within this research session
|
|
|
|
---
|
|
|
|
## Metadata
|
|
|
|
**Confidence breakdown:**
|
|
- Standard stack: HIGH — all libraries verified against official docs or project package.json
|
|
- Slack OAuth flow: HIGH — verified against official Slack developer docs
|
|
- Stripe billing: HIGH — verified against official Stripe API reference and docs
|
|
- BYO key encryption: HIGH — verified against official cryptography.io docs; note Fernet is AES-128 not AES-256
|
|
- Cost aggregation: MEDIUM — SQL pattern is standard PostgreSQL JSONB; exact SQLAlchemy ORM casting syntax needs validation against project's asyncpg driver
|
|
- Recharts: MEDIUM — download stats verified via npmtrends; API verified against recharts.org
|
|
- WhatsApp manual setup: MEDIUM — On-Premises deprecation confirmed; manual setup steps derived from Meta developer docs indirectly
|
|
|
|
**Research date:** 2026-03-23
|
|
**Valid until:** 2026-04-23 (Stripe and Slack APIs are stable; Next.js 16 is current)
|