docs: complete project research
This commit is contained in:
511
.planning/research/ARCHITECTURE.md
Normal file
511
.planning/research/ARCHITECTURE.md
Normal file
@@ -0,0 +1,511 @@
|
||||
# Architecture Research
|
||||
|
||||
**Domain:** Channel-native AI workforce platform (multi-tenant, messaging-channel-first)
|
||||
**Researched:** 2026-03-22
|
||||
**Confidence:** HIGH (core patterns verified against official Slack docs, LiteLLM docs, pgvector community resources, and multiple production-pattern sources)
|
||||
|
||||
---
|
||||
|
||||
## Standard Architecture
|
||||
|
||||
### System Overview
|
||||
|
||||
```
|
||||
External Channels (Slack, WhatsApp)
|
||||
│ HTTPS webhooks / Events API
|
||||
▼
|
||||
┌─────────────────────────────────────────────────────────────┐
|
||||
│ INGRESS LAYER │
|
||||
│ ┌─────────────────────┐ ┌─────────────────────────────┐ │
|
||||
│ │ Channel Gateway │ │ Stripe Webhook Endpoint │ │
|
||||
│ │ (FastAPI service) │ │ (billing events) │ │
|
||||
│ └──────────┬──────────┘ └─────────────────────────────┘ │
|
||||
└─────────────│───────────────────────────────────────────────┘
|
||||
│ Normalized KonstructMessage
|
||||
▼
|
||||
┌─────────────────────────────────────────────────────────────┐
|
||||
│ MESSAGE ROUTING LAYER │
|
||||
│ ┌──────────────────────────────────────────────────────┐ │
|
||||
│ │ Message Router │ │
|
||||
│ │ - Tenant resolution (channel org → tenant_id) │ │
|
||||
│ │ - Per-tenant rate limiting (Redis token bucket) │ │
|
||||
│ │ - Context loading (tenant config, agent config) │ │
|
||||
│ │ - Idempotency check (Redis dedup key) │ │
|
||||
│ └────────────────────────┬─────────────────────────────┘ │
|
||||
└───────────────────────────│─────────────────────────────────┘
|
||||
│ Enqueued task (Celery)
|
||||
▼
|
||||
┌─────────────────────────────────────────────────────────────┐
|
||||
│ AGENT ORCHESTRATION LAYER │
|
||||
│ ┌──────────────────────────────────────────────────────┐ │
|
||||
│ │ Agent Orchestrator (per-tenant Celery worker) │ │
|
||||
│ │ - Agent context assembly (persona, tools, memory) │ │
|
||||
│ │ - Conversation history retrieval (Redis + pgvector) │ │
|
||||
│ │ - LLM call dispatch → LLM Backend Pool │ │
|
||||
│ │ - Tool execution (registry lookup + run) │ │
|
||||
│ │ - Response routing back to originating channel │ │
|
||||
│ └──────────────────────────────────────────────────────┘ │
|
||||
└───────────────────┬────────────────────────┬────────────────┘
|
||||
│ │
|
||||
▼ ▼
|
||||
┌────────────────────────┐ ┌──────────────────────────────┐
|
||||
│ LLM BACKEND POOL │ │ TOOL EXECUTOR │
|
||||
│ │ │ - Registry (tool → handler) │
|
||||
│ LiteLLM Router │ │ - Execution (async/sync) │
|
||||
│ ├── Ollama (local) │ │ - Result capture + logging │
|
||||
│ ├── Anthropic API │ └──────────────────────────────┘
|
||||
│ └── OpenAI API │
|
||||
└────────────────────────┘
|
||||
│
|
||||
▼
|
||||
┌─────────────────────────────────────────────────────────────┐
|
||||
│ DATA LAYER │
|
||||
│ ┌──────────────┐ ┌──────────────┐ ┌───────────────────┐ │
|
||||
│ │ PostgreSQL │ │ Redis │ │ MinIO / S3 │ │
|
||||
│ │ (+ pgvector │ │ (sessions, │ │ (file attach., │ │
|
||||
│ │ + RLS) │ │ rate limit, │ │ agent artifacts) │ │
|
||||
│ │ │ │ task queue, │ │ │ │
|
||||
│ │ - tenants │ │ pub/sub) │ │ │ │
|
||||
│ │ - agents │ └──────────────┘ └───────────────────┘ │
|
||||
│ │ - messages │ │
|
||||
│ │ - tools │ │
|
||||
│ │ - billing │ │
|
||||
│ └──────────────┘ │
|
||||
└─────────────────────────────────────────────────────────────┘
|
||||
|
||||
┌─────────────────────────────────────────────────────────────┐
|
||||
│ ADMIN PORTAL │
|
||||
│ Next.js 14 App Router (separate deployment) │
|
||||
│ - Tenant management, agent config, billing, monitoring │
|
||||
│ - Reads/writes to FastAPI REST API (auth via JWT) │
|
||||
└─────────────────────────────────────────────────────────────┘
|
||||
```
|
||||
|
||||
### Component Responsibilities
|
||||
|
||||
| Component | Responsibility | Communicates With |
|
||||
|-----------|----------------|-------------------|
|
||||
| **Channel Gateway** | Receive and verify inbound webhooks from Slack/WhatsApp; normalize to KonstructMessage; acknowledge within 3s | Message Router (HTTP or enqueue), Redis (idempotency) |
|
||||
| **Message Router** | Resolve tenant from channel metadata; rate-limit per tenant; load tenant/agent context; enqueue to Celery | PostgreSQL (tenant lookup), Redis (rate limit + dedup), Celery (enqueue) |
|
||||
| **Agent Orchestrator** | Assemble agent prompt from persona + memory + conversation history; call LLM; execute tools; emit response back to channel | LLM Backend Pool, Tool Executor, Memory Layer (Redis + pgvector), Channel Gateway (outbound) |
|
||||
| **LLM Backend Pool** | Route LLM calls across Ollama/Anthropic/OpenAI with fallback, retry, and cost tracking | Ollama (local HTTP), Anthropic API, OpenAI API |
|
||||
| **Tool Executor** | Maintain tool registry; execute tool calls from agent; return results; log every invocation for audit | External APIs (per-tool), PostgreSQL (audit log) |
|
||||
| **Memory Layer** | Short-term: Redis sliding window for recent messages; Long-term: pgvector for semantic retrieval of past conversations | Redis, PostgreSQL (pgvector extension) |
|
||||
| **Admin Portal** | UI for tenant CRUD, agent configuration, channel setup, billing, and usage monitoring | FastAPI REST API (authenticated) |
|
||||
| **Billing Service** | Handle Stripe webhooks; update tenant subscription state; enforce feature limits based on plan | Stripe, PostgreSQL (subscription state) |
|
||||
|
||||
---
|
||||
|
||||
## Recommended Project Structure
|
||||
|
||||
```
|
||||
konstruct/
|
||||
├── packages/
|
||||
│ ├── gateway/ # Channel Gateway service (FastAPI)
|
||||
│ │ ├── channels/
|
||||
│ │ │ ├── slack.py # Slack Events API handler (HTTP mode)
|
||||
│ │ │ └── whatsapp.py # WhatsApp Cloud API webhook handler
|
||||
│ │ ├── normalize.py # → KonstructMessage
|
||||
│ │ ├── verify.py # Signature verification per channel
|
||||
│ │ └── main.py # FastAPI app, routes
|
||||
│ │
|
||||
│ ├── router/ # Message Router service (FastAPI)
|
||||
│ │ ├── tenant.py # Channel org ID → tenant_id lookup
|
||||
│ │ ├── ratelimit.py # Redis token bucket per tenant
|
||||
│ │ ├── idempotency.py # Redis dedup (message_id key, TTL)
|
||||
│ │ ├── context.py # Load agent config from DB
|
||||
│ │ └── main.py
|
||||
│ │
|
||||
│ ├── orchestrator/ # Agent Orchestrator (Celery workers)
|
||||
│ │ ├── tasks.py # Celery task: handle_message
|
||||
│ │ ├── agents/
|
||||
│ │ │ ├── builder.py # Assemble agent (persona + tools + memory)
|
||||
│ │ │ └── runner.py # LLM call loop (reason → tool → observe)
|
||||
│ │ ├── memory/
|
||||
│ │ │ ├── short_term.py # Redis sliding window (last N messages)
|
||||
│ │ │ └── long_term.py # pgvector semantic search
|
||||
│ │ ├── tools/
|
||||
│ │ │ ├── registry.py # Tool name → handler function mapping
|
||||
│ │ │ ├── executor.py # Async execution + audit logging
|
||||
│ │ │ └── builtins/ # Built-in tools (web search, calendar, etc.)
|
||||
│ │ └── main.py # Worker entry point
|
||||
│ │
|
||||
│ ├── llm-pool/ # LLM Backend Pool (LiteLLM wrapper)
|
||||
│ │ ├── router.py # LiteLLM Router config (model groups)
|
||||
│ │ ├── providers/
|
||||
│ │ │ ├── ollama.py
|
||||
│ │ │ ├── anthropic.py
|
||||
│ │ │ └── openai.py
|
||||
│ │ └── main.py # FastAPI app exposing /complete endpoint
|
||||
│ │
|
||||
│ ├── portal/ # Next.js 14 Admin Dashboard
|
||||
│ │ ├── app/
|
||||
│ │ │ ├── (auth)/ # Login, signup routes
|
||||
│ │ │ ├── dashboard/ # Post-auth layout
|
||||
│ │ │ ├── tenants/ # Tenant management
|
||||
│ │ │ ├── agents/ # Agent config
|
||||
│ │ │ ├── billing/ # Stripe customer portal
|
||||
│ │ │ └── api/ # Next.js API routes (thin proxy or auth only)
|
||||
│ │ ├── components/ # shadcn/ui components
|
||||
│ │ └── lib/
|
||||
│ │ ├── api.ts # TanStack Query hooks + API client
|
||||
│ │ └── auth.ts # NextAuth.js config
|
||||
│ │
|
||||
│ └── shared/ # Shared Python library (no service)
|
||||
│ ├── models/
|
||||
│ │ ├── message.py # KonstructMessage Pydantic model
|
||||
│ │ ├── tenant.py # Tenant, Agent SQLAlchemy models
|
||||
│ │ └── billing.py # Subscription, Plan models
|
||||
│ ├── db.py # SQLAlchemy async engine + session factory
|
||||
│ ├── rls.py # SET app.current_tenant helper
|
||||
│ └── config.py # Pydantic Settings (env vars)
|
||||
│
|
||||
├── migrations/ # Alembic (single migration history)
|
||||
├── tests/
|
||||
│ ├── unit/
|
||||
│ ├── integration/
|
||||
│ └── e2e/
|
||||
├── docker-compose.yml # All services + infra (Redis, PG, MinIO, Ollama)
|
||||
└── pyproject.toml # uv workspace config, shared deps
|
||||
```
|
||||
|
||||
### Structure Rationale
|
||||
|
||||
- **packages/ per service:** Each directory is a standalone FastAPI app or Celery worker with its own `main.py`. The boundary maps to a Docker container. Services communicate over HTTP or Celery/Redis, not in-process imports.
|
||||
- **shared/:** Common Pydantic models and SQLAlchemy models live here to prevent duplication and drift. No business logic — only types, DB session factory, and config.
|
||||
- **gateway/ channels/:** Each channel adapter is a separate file so adding a new channel (e.g., Telegram in v2) is an isolated change with no blast radius.
|
||||
- **orchestrator/ memory/:** Short-term and long-term memory are separate modules because they have different backends, eviction policies, and query semantics.
|
||||
- **portal/ app/:** Next.js App Router route grouping with `(auth)` for pre-auth pages and `dashboard/` for post-auth so layout boundaries are explicit.
|
||||
|
||||
---
|
||||
|
||||
## Architectural Patterns
|
||||
|
||||
### Pattern 1: Immediate-Acknowledge, Async-Process
|
||||
|
||||
**What:** The Channel Gateway returns HTTP 200 to Slack/WhatsApp within 3 seconds, without performing any LLM work. The actual processing is dispatched to Celery.
|
||||
|
||||
**When to use:** Always. Slack will retry and flag your app as unhealthy if it doesn't receive a 2xx within 3 seconds. WhatsApp Cloud API requires sub-20s acknowledgment.
|
||||
|
||||
**Trade-offs:** Adds Celery + Redis infrastructure requirement. The response to the user is sent as a follow-up message, not as the HTTP response — this is intentional and matches how Slack/WhatsApp users expect bots to behave anyway (typing indicator → message appears).
|
||||
|
||||
**Example:**
|
||||
```python
|
||||
# gateway/channels/slack.py
|
||||
@app.event("message")
|
||||
async def handle_message(event, say, client):
|
||||
# 1. Normalize immediately
|
||||
msg = normalize_slack(event)
|
||||
# 2. Verify idempotency (skip duplicate events)
|
||||
if await is_duplicate(msg.id):
|
||||
return
|
||||
# 3. Enqueue for async processing — DO NOT call LLM here
|
||||
handle_message_task.delay(msg.model_dump())
|
||||
# Gateway returns 200 implicitly — Slack is satisfied
|
||||
```
|
||||
|
||||
### Pattern 2: Tenant-Scoped RLS via SQLAlchemy Event Hook
|
||||
|
||||
**What:** Set `app.current_tenant` on the PostgreSQL connection immediately after acquiring it from the pool. RLS policies use this setting to filter every query automatically, so application code never manually adds `WHERE tenant_id = ...`.
|
||||
|
||||
**When to use:** Every DB interaction in the Message Router and Agent Orchestrator.
|
||||
|
||||
**Trade-offs:** Requires careful pool management — connections must be reset before returning to the pool. The `sqlalchemy-tenants` library or a custom `before_cursor_execute` event listener handles this.
|
||||
|
||||
**Example:**
|
||||
```python
|
||||
# shared/rls.py
|
||||
from sqlalchemy import event
|
||||
|
||||
@event.listens_for(engine.sync_engine, "before_cursor_execute")
|
||||
def set_tenant_context(conn, cursor, statement, parameters, context, executemany):
|
||||
tenant_id = get_current_tenant_id() # from contextvars
|
||||
if tenant_id:
|
||||
cursor.execute(f"SET app.current_tenant = '{tenant_id}'")
|
||||
```
|
||||
|
||||
### Pattern 3: Four-Layer Agent Memory
|
||||
|
||||
**What:** Combine Redis (fast, ephemeral) for short-term context and pgvector (persistent, semantic) for long-term recall. The agent always has the last N messages in context (Redis sliding window). For deeper history, the orchestrator optionally queries pgvector for semantically similar past exchanges.
|
||||
|
||||
**When to use:** Every agent invocation. Short-term is mandatory; long-term retrieval is triggered when conversation references past events or when context window pressure requires compressing history.
|
||||
|
||||
**Trade-offs:** Two backends to operate and keep in sync. A background Celery task flushes Redis conversation state to PostgreSQL/pgvector asynchronously — if it fails, recent messages may not be permanently indexed, but conversation continuity is preserved by Redis until flush succeeds.
|
||||
|
||||
**Example flow:**
|
||||
```
|
||||
User message arrives
|
||||
→ Load last 20 messages from Redis (short-term)
|
||||
→ Optionally: similarity search pgvector for relevant past conversations
|
||||
→ Build context window: [system prompt] + [retrieved history] + [recent messages]
|
||||
→ LLM call
|
||||
→ Append response to Redis sliding window
|
||||
→ Background task: embed + store to pgvector
|
||||
```
|
||||
|
||||
### Pattern 4: LiteLLM Router as Internal Singleton
|
||||
|
||||
**What:** The LLM Backend Pool exposes a single internal HTTP endpoint (`/complete`). All orchestrator workers call this endpoint. The LiteLLM Router behind it handles provider selection, fallback chains, and cost tracking without the orchestrator needing to know which model is used.
|
||||
|
||||
**When to use:** All LLM calls. Never call Anthropic/OpenAI SDKs directly from the orchestrator — always go through the pool.
|
||||
|
||||
**Trade-offs:** Adds one network hop per LLM call. This is acceptable — LiteLLM's own benchmarks show 8ms P95 overhead at 1k RPS.
|
||||
|
||||
**Configuration example:**
|
||||
```python
|
||||
# llm-pool/router.py
|
||||
from litellm import Router
|
||||
|
||||
router = Router(
|
||||
model_list=[
|
||||
{"model_name": "fast", "litellm_params": {"model": "ollama/qwen3:8b", "api_base": "http://ollama:11434"}},
|
||||
{"model_name": "quality", "litellm_params": {"model": "anthropic/claude-sonnet-4-20250514"}},
|
||||
{"model_name": "quality", "litellm_params": {"model": "openai/gpt-4o"}}, # fallback
|
||||
],
|
||||
fallbacks=[{"quality": ["fast"]}], # cost-cap fallback
|
||||
routing_strategy="latency-based-routing",
|
||||
)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Data Flow
|
||||
|
||||
### Inbound Message Flow (Happy Path)
|
||||
|
||||
```
|
||||
User sends message in Slack
|
||||
│
|
||||
▼ HTTPS POST (Events API, HTTP mode)
|
||||
Channel Gateway
|
||||
│ verify Slack signature (X-Slack-Signature)
|
||||
│ normalize → KonstructMessage(id, tenant_id=None, channel=slack, ...)
|
||||
│ check Redis idempotency key (message_id) → not seen
|
||||
│ set Redis idempotency key (TTL 24h)
|
||||
│ enqueue Celery task: handle_message(message)
|
||||
│ return HTTP 200 immediately
|
||||
▼
|
||||
Celery Broker (Redis)
|
||||
│
|
||||
▼
|
||||
Agent Orchestrator (Celery worker)
|
||||
│ resolve tenant_id from channel_metadata.workspace_id → PostgreSQL
|
||||
│ load agent config (persona, model preference, tools) → PostgreSQL (RLS-scoped)
|
||||
│ load short-term memory → Redis (last 20 messages for this thread_id)
|
||||
│ optionally query pgvector for relevant past context
|
||||
│ assemble prompt: system_prompt + memory + current message
|
||||
│ POST /complete → LLM Backend Pool
|
||||
│ LiteLLM Router selects provider, executes, returns response
|
||||
│ parse response: text reply OR tool_call
|
||||
│ if tool_call: Tool Executor.run(tool_name, args) → external API → result
|
||||
│ append result to prompt, re-call LLM if needed
|
||||
│ write KonstructMessage + response to PostgreSQL (audit)
|
||||
│ update Redis sliding window with new messages
|
||||
│ background: embed messages → pgvector
|
||||
▼
|
||||
Channel Gateway (outbound)
|
||||
│ POST message back to Slack via slack-sdk client.chat_postMessage()
|
||||
▼
|
||||
User sees response in Slack
|
||||
```
|
||||
|
||||
### Tenant Resolution Flow
|
||||
|
||||
```
|
||||
KonstructMessage.channel_metadata = {"workspace_id": "T123ABC"}
|
||||
│
|
||||
▼ Router: SELECT tenant_id FROM channel_connections
|
||||
WHERE channel_type = 'slack' AND external_org_id = 'T123ABC'
|
||||
│
|
||||
▼ tenant_id resolved → stored in Python contextvar for RLS
|
||||
│
|
||||
All subsequent DB queries automatically scoped by RLS policy:
|
||||
CREATE POLICY tenant_isolation ON agents
|
||||
USING (tenant_id = current_setting('app.current_tenant')::uuid);
|
||||
```
|
||||
|
||||
### Admin Portal Data Flow
|
||||
|
||||
```
|
||||
Browser → Next.js App Router (RSC or client component)
|
||||
│ TanStack Query useQuery / useMutation
|
||||
▼
|
||||
FastAPI REST API (authenticated endpoint)
|
||||
│ JWT verification (NextAuth.js token)
|
||||
│ Tenant scope enforced (user.tenant_id from token)
|
||||
▼
|
||||
PostgreSQL (RLS active: queries scoped to token's tenant)
|
||||
```
|
||||
|
||||
### Billing Event Flow
|
||||
|
||||
```
|
||||
Stripe subscription event (checkout.session.completed, etc.)
|
||||
│ HTTPS POST to /webhooks/stripe
|
||||
▼
|
||||
Billing endpoint (FastAPI)
|
||||
│ verify Stripe webhook signature (stripe-signature header)
|
||||
│ parse event type
|
||||
│ update tenants.subscription_status, plan_tier in PostgreSQL
|
||||
│ if downgrade: update agent count limit, feature flags
|
||||
▼
|
||||
Next message processed by Router picks up new plan limits
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Integration Points
|
||||
|
||||
### External Services
|
||||
|
||||
| Service | Integration Pattern | Key Requirements | Notes |
|
||||
|---------|---------------------|-----------------|-------|
|
||||
| **Slack** | HTTP Events API (webhook) — NOT Socket Mode | Public HTTPS URL with valid TLS; respond 200 within 3s | Socket Mode only for local dev. HTTP required for production and any future Marketplace distribution. Slack explicitly recommends HTTP for production reliability. |
|
||||
| **WhatsApp Cloud API** | Meta webhook (HTTPS POST) | TLS required (no self-signed); verify token for subscription; 200 response within 20s | Meta has fully deprecated on-premise option. Cloud API is now the only supported path. |
|
||||
| **LiteLLM** | In-process Python SDK OR sidecar HTTP proxy | Ollama running as Docker service; Anthropic/OpenAI API keys | Run as a separate service for isolation, or as an embedded router in the orchestrator. Separate service recommended for cost tracking and rate limiting. |
|
||||
| **Stripe** | Webhook (HTTPS POST) | Signature verification via `stripe.WebhookSignature`; idempotent event handlers | Use Stripe's hosted billing portal for self-service plan changes — avoids building custom subscription UI. |
|
||||
| **Ollama** | HTTP (Docker network) | GPU passthrough optional; accessible on internal Docker network | `http://ollama:11434` on compose network. No auth required on internal network. |
|
||||
|
||||
### Internal Service Boundaries
|
||||
|
||||
| Boundary | Communication | Protocol | Notes |
|
||||
|----------|---------------|----------|-------|
|
||||
| Gateway → Router | Direct HTTP POST (on same Docker network) or shared Celery queue | HTTP or Celery | For v1 simplicity, Gateway can enqueue directly to Celery, bypassing a separate Router HTTP call |
|
||||
| Router → Orchestrator | Celery task via Redis broker | Celery/Redis | Decouples ingress from processing; enables retries, dead-letter queue, and horizontal scaling of workers |
|
||||
| Orchestrator → LLM Pool | Internal HTTP POST | HTTP (FastAPI) | Keeps LLM routing concerns isolated; allows pool to be scaled independently |
|
||||
| Orchestrator → Channel Gateway (outbound) | Direct Slack/WhatsApp SDK calls | HTTPS (external) | Orchestrator holds channel credentials and calls the appropriate SDK directly for responses |
|
||||
| Portal → API | REST over HTTPS | HTTP (FastAPI) | Portal never accesses DB directly — all reads/writes through authenticated API |
|
||||
| Any service → PostgreSQL | SQLAlchemy async (asyncpg driver) | TCP | RLS enforced; tenant context set before every query |
|
||||
| Any service → Redis | aioredis / redis-py async | TCP | Namespaced by tenant_id to prevent accidental cross-tenant access |
|
||||
|
||||
---
|
||||
|
||||
## Build Order (Dependency Graph)
|
||||
|
||||
Building the wrong component first creates integration debt. The correct order:
|
||||
|
||||
```
|
||||
Phase 1 — Foundation (build this first)
|
||||
│
|
||||
├── 1. Shared models + DB schema (Pydantic models, SQLAlchemy models, Alembic migrations)
|
||||
│ └── Required by: every other service
|
||||
│
|
||||
├── 2. PostgreSQL + Redis + Docker Compose dev environment
|
||||
│ └── Required by: everything
|
||||
│
|
||||
├── 3. Channel Gateway — Slack adapter only
|
||||
│ └── Unblocks: end-to-end message flow testing
|
||||
│
|
||||
├── 4. Message Router — tenant resolution + rate limiting
|
||||
│ └── Unblocks: scoped agent invocation
|
||||
│
|
||||
├── 5. LLM Backend Pool — LiteLLM with Ollama + Anthropic
|
||||
│ └── Unblocks: agent can actually generate responses
|
||||
│
|
||||
└── 6. Agent Orchestrator — single agent, no tools, no memory
|
||||
└── First working end-to-end: Slack message → LLM response → Slack reply
|
||||
|
||||
Phase 2 — Feature Completeness
|
||||
│
|
||||
├── 7. Memory Layer (Redis short-term + pgvector long-term)
|
||||
│ └── Depends on: working orchestrator
|
||||
│
|
||||
├── 8. Tool Framework (registry + executor + first built-in tools)
|
||||
│ └── Depends on: working orchestrator
|
||||
│
|
||||
├── 9. WhatsApp channel adapter in Gateway
|
||||
│ └── Mostly isolated: same normalize.py, new channel handler
|
||||
│
|
||||
├── 10. Admin Portal (Next.js) — tenant CRUD + agent config
|
||||
│ └── Depends on: stable DB schema (stabilizes after step 8)
|
||||
│
|
||||
└── 11. Billing integration (Stripe webhooks + subscription enforcement)
|
||||
└── Depends on: tenant model, admin portal
|
||||
```
|
||||
|
||||
**Key dependency insight:** Steps 1-6 must be strictly sequential. Steps 7-11 can overlap after step 6 is working, but the portal (10) and billing (11) should not be started until the DB schema is stable, which happens after memory and tools are defined (steps 7-8).
|
||||
|
||||
---
|
||||
|
||||
## Scaling Considerations
|
||||
|
||||
| Scale | Architecture Adjustments |
|
||||
|-------|--------------------------|
|
||||
| 0-100 tenants (beta) | Single Docker Compose host. One Celery worker process. All services on same machine. PostgreSQL RLS sufficient. |
|
||||
| 100-1k tenants | Scale Celery workers horizontally (multiple replicas). Separate Redis for broker vs. cache. Add connection pooling (PgBouncer). Consider moving Ollama to dedicated GPU host. |
|
||||
| 1k-10k tenants | Kubernetes (k3s). Multiple Gateway replicas behind load balancer. Celery worker auto-scaling. PostgreSQL read replica for analytics/portal queries. Qdrant for vector search at scale (pgvector starts to slow above ~1M embeddings). |
|
||||
| 10k+ tenants | Schema-per-tenant for Enterprise tier. Dedicated inference cluster. Multi-region PostgreSQL (Citus or regional replicas). |
|
||||
|
||||
### Scaling Priorities
|
||||
|
||||
1. **First bottleneck:** Celery workers during LLM call bursts. LLM calls are slow (2-30s). Workers pile up. Fix: increase worker count, implement per-tenant concurrency limits, add request coalescing for burst traffic.
|
||||
2. **Second bottleneck:** PostgreSQL connection exhaustion under concurrent tenant load. Fix: PgBouncer transaction-mode pooling. This is critical early because each Celery worker opens its own SQLAlchemy async session.
|
||||
3. **Third bottleneck:** pgvector query latency as embedding count grows. Fix: HNSW index tuning, then migrate to Qdrant for the vector tier while keeping PostgreSQL for structured data.
|
||||
|
||||
---
|
||||
|
||||
## Anti-Patterns
|
||||
|
||||
### Anti-Pattern 1: Doing LLM Work Inside the Webhook Handler
|
||||
|
||||
**What people do:** Call the LLM synchronously inside the Slack event handler or WhatsApp webhook endpoint and return the AI response as the HTTP reply.
|
||||
|
||||
**Why it's wrong:** Slack requires HTTP 200 within 3 seconds. OpenAI/Anthropic calls routinely take 5-30 seconds. The webhook times out, Slack retries the event (causing duplicate processing), and the app gets flagged as unreliable.
|
||||
|
||||
**Do this instead:** Acknowledge immediately (HTTP 200), enqueue to Celery, and send the AI response as a follow-up message via the channel API.
|
||||
|
||||
### Anti-Pattern 2: Shared Redis Namespace Across Tenants
|
||||
|
||||
**What people do:** Store conversation history as `redis.set("history:{thread_id}", ...)` without scoping by tenant.
|
||||
|
||||
**Why it's wrong:** thread_id values can collide between tenants (e.g., two tenants both have Slack thread `C123/T456`). Tenant A reads Tenant B's conversation history.
|
||||
|
||||
**Do this instead:** Always namespace Redis keys as `{tenant_id}:{key_type}:{resource_id}`. Example: `tenant_abc123:history:slack_C123_T456`.
|
||||
|
||||
### Anti-Pattern 3: Calling LLM Providers Directly from Orchestrator
|
||||
|
||||
**What people do:** Import the `anthropic` SDK directly in the orchestrator and call `anthropic.messages.create(...)`.
|
||||
|
||||
**Why it's wrong:** Bypasses the LiteLLM router, losing fallback behavior, cost tracking, rate limit enforcement, and the ability to switch providers without touching orchestrator code.
|
||||
|
||||
**Do this instead:** All LLM calls go through the LLM Backend Pool service (or an embedded LiteLLM Router). The orchestrator sends a generic `complete(messages, model_group="quality")` call and the pool handles provider selection.
|
||||
|
||||
### Anti-Pattern 4: Fat Channel Gateway
|
||||
|
||||
**What people do:** Add tenant resolution, rate limiting, and business logic to the gateway service to "simplify" the architecture.
|
||||
|
||||
**Why it's wrong:** The gateway must respond in under 3 seconds and must stay stateless to handle channel-specific webhook verification. Mixing business logic in couples the gateway to your domain model and makes it impossible to scale independently.
|
||||
|
||||
**Do this instead:** Gateway does exactly three things: verify signature, normalize message, enqueue. All business logic lives downstream.
|
||||
|
||||
### Anti-Pattern 5: Embedding Agent Memory in PostgreSQL Without an Index
|
||||
|
||||
**What people do:** Store conversation embeddings in a `vector` column in PostgreSQL and run similarity queries without an HNSW or IVFFlat index.
|
||||
|
||||
**Why it's wrong:** pgvector without an index performs sequential scans. With more than ~50k embeddings per tenant, queries slow to seconds.
|
||||
|
||||
**Do this instead:** Create HNSW indexes on vector columns from the start. `CREATE INDEX ON conversation_embeddings USING hnsw (embedding vector_cosine_ops);`
|
||||
|
||||
---
|
||||
|
||||
## Sources
|
||||
|
||||
- [Slack: Comparing HTTP and Socket Mode](https://docs.slack.dev/apis/events-api/comparing-http-socket-mode/) — MEDIUM confidence (official Slack docs, accessed 2026-03-22)
|
||||
- [Slack: Using Socket Mode](https://docs.slack.dev/apis/events-api/using-socket-mode/) — HIGH confidence (official)
|
||||
- [LiteLLM: Router Architecture](https://docs.litellm.ai/docs/router_architecture) — HIGH confidence (official LiteLLM docs)
|
||||
- [LiteLLM: Routing and Load Balancing](https://docs.litellm.ai/docs/routing) — HIGH confidence (official)
|
||||
- [Redis: AI Agent Memory Architecture](https://redis.io/blog/ai-agent-memory-stateful-systems/) — MEDIUM confidence (official Redis blog)
|
||||
- [Redis: AI Agent Architecture 2026](https://redis.io/blog/ai-agent-architecture/) — MEDIUM confidence (official Redis blog)
|
||||
- [Crunchy Data: Row Level Security for Tenants](https://www.crunchydata.com/blog/row-level-security-for-tenants-in-postgres) — HIGH confidence (authoritative PostgreSQL resource)
|
||||
- [AWS: Multi-Tenant Data Isolation with PostgreSQL RLS](https://aws.amazon.com/blogs/database/multi-tenant-data-isolation-with-postgresql-row-level-security/) — MEDIUM confidence
|
||||
- [DEV Community: Building WhatsApp Business Bots](https://dev.to/achiya-automation/building-whatsapp-business-bots-with-the-official-api-architecture-webhooks-and-automation-1ce4) — LOW confidence (community post)
|
||||
- [ChatArchitect: Scalable Webhook Architecture for WhatsApp](https://www.chatarchitect.com/news/building-a-scalable-webhook-architecture-for-custom-whatsapp-solutions) — LOW confidence (community)
|
||||
- [PyWa documentation](https://pywa.readthedocs.io/en/1.6.0/) — MEDIUM confidence (library docs)
|
||||
- [fast.io: Multi-Tenant AI Agent Architecture](https://fast.io/resources/ai-agent-multi-tenant-architecture/) — LOW confidence (vendor blog)
|
||||
- [Microsoft Learn: AI Agent Orchestration Patterns](https://learn.microsoft.com/en-us/azure/architecture/ai-ml/guide/ai-agent-design-patterns) — MEDIUM confidence
|
||||
- [DEV Community: Webhooks at Scale — Idempotency](https://dev.to/art_light/webhooks-at-scale-designing-an-idempotent-replay-safe-and-observable-webhook-system-7lk) — LOW confidence (community post, pattern well-corroborated)
|
||||
|
||||
---
|
||||
|
||||
*Architecture research for: Konstruct — channel-native AI workforce platform*
|
||||
*Researched: 2026-03-22*
|
||||
Reference in New Issue
Block a user