docs(02-agent-features): research phase domain

2026-03-23 14:15:40 -06:00
parent e48fbaa3d4
commit 3fe334b702
1 changed files with 825 additions and 0 deletions
--- a/.planning/phases/02-agent-features/02-RESEARCH.md
+++ b/.planning/phases/02-agent-features/02-RESEARCH.md
@@ -0,0 +1,825 @@
 # Phase 2: Agent Features - Research
 **Researched:** 2026-03-23
 **Domain:** Conversational memory (Redis + pgvector), tool framework, WhatsApp Cloud API, human escalation, audit logging, multimodal media
 **Confidence:** HIGH (core patterns verified against pgvector docs, LiteLLM docs, Meta developer docs, and official library sources)
 ---
 <user_constraints>
 ## User Constraints (from CONTEXT.md)
 ### Locked Decisions
 **Conversational Memory:**
 - Full conversation history stored in pgvector — no messages are dropped
 - Vector retrieval surfaces relevant past context when assembling LLM prompt (not full history dump)
 - Cross-conversation memory — agent remembers user preferences and context across separate conversations
 - Memory keyed per-user per-agent — memory follows the user across channels (same agent remembers you in Slack and WhatsApp)
 - Indefinite retention — memory never expires; operator can purge manually if needed
 - Sliding window for immediate context (last N messages verbatim in prompt), vector search for older/cross-conversation context
 **Tool Framework:**
 - 4 built-in tools for v1: web search, knowledge base search, HTTP request, calendar lookup
 - Knowledge base content populated via both file upload (PDFs, docs) and URL ingestion (crawl, chunk, embed)
 - Seamless tool usage — agent incorporates tool results naturally, does NOT announce "let me look that up"
 - Always confirm before consequential actions — agent asks before booking calendar slots, sending HTTP requests, or any side-effecting action; read-only tools (search, KB lookup) execute without confirmation
 - Tool invocations are schema-validated before execution — prevents prompt injection into tool arguments
 - Every tool invocation logged in audit trail
 **Human Escalation:**
 - Handoff destination: DM to an assigned human (configured per tenant in Agent Designer)
 - Full conversation transcript included in the DM — human sees complete context
 - Agent stays in the thread as assistant after escalation — defers to human for end-user responses
 - Natural language escalation ("can I talk to a human?") is configurable per tenant
 - Configurable rule-based escalation triggers (failed resolution attempts, billing disputes, etc.)
 **WhatsApp Interaction Model:**
 - Same persona across Slack and WhatsApp — consistent employee identity
 - Business-function scoping: layered enforcement — explicit allowlist first (canned rejection for clearly off-topic, no LLM call), then role-based LLM handling for edge cases
 - Operator defines allowed business functions in Agent Designer
 - Off-topic messages get a polite redirect
 **Media Support (All Channels):**
 - Bidirectional media support across Slack and WhatsApp
 - Agent can RECEIVE images and documents and interpret them via multimodal LLM
 - Agent can SEND images and documents back to users
 - KonstructMessage format must be extended to handle media attachments (image URLs, file references)
 - Media stored in MinIO (self-hosted) / S3 with per-tenant isolation
 **Audit Logging:**
 - Every LLM call, tool invocation, and handoff event recorded in immutable audit trail
 - Audit entries include: timestamp, tenant_id, agent_id, user_id, action_type, input/output summary, latency
 - Queryable by tenant — operators can review agent actions
 - Audit data feeds into Phase 3 cost tracking dashboard
 ### Claude's Discretion
 - Sliding window size (how many recent messages kept verbatim)
 - Vector similarity threshold for memory retrieval
 - KB chunking strategy and embedding model choice
 - Calendar lookup integration approach (Google Calendar API vs generic iCal)
 - Web search provider (Brave Search API, SerpAPI, etc.)
 - HTTP request tool: timeout limits, allowed methods, response size caps
 - WhatsApp message template format for outbound media
 - Audit log storage strategy (PostgreSQL table vs append-only log)
 ### Deferred Ideas (OUT OF SCOPE)
 None — discussion stayed within phase scope
 </user_constraints>
 ---
 <phase_requirements>
 ## Phase Requirements
 | ID | Description | Research Support |
 |----|-------------|-----------------|
 | CHAN-03 | User can interact with AI employee via WhatsApp Business Cloud API | WhatsApp Cloud API webhook pattern; httpx direct integration; per-tenant phone_number_id isolation |
 | CHAN-04 | WhatsApp adapter enforces business-function scoping per Meta 2026 policy | Meta Jan 2026 ban on general-purpose chatbots; two-tier allowlist + LLM gate pattern |
 | AGNT-02 | Agent maintains conversational memory within sessions (sliding window) | Redis LRANGE sliding window; last-N-messages verbatim in prompt; existing `redis_keys.py` pattern |
 | AGNT-03 | Agent retrieves relevant past context via vector search (pgvector long-term memory) | pgvector HNSW index; `sentence-transformers` for local embedding; tenant_id pre-filter mandatory |
 | AGNT-04 | Agent can invoke registered tools to perform actions (tool registry + execution) | Tool registry dict pattern; Pydantic schema validation before execution; 4 built-in tools defined |
 | AGNT-05 | Agent escalates to human when configured rules trigger, transferring full conversation context | Slack DM via httpx (no slack-bolt in orchestrator); full transcript packaging; agent stays as assistant |
 | AGNT-06 | Every agent action (LLM call, tool invocation, handoff) is logged in an audit trail | PostgreSQL append-only audit table; immutable via REVOKE UPDATE/DELETE; queryable by tenant |
 </phase_requirements>
 ---
 ## Summary
 Phase 2 transforms the basic LLM-call pipeline from Phase 1 into a capable AI employee. The four work streams are: (1) conversational memory layer using Redis sliding window plus pgvector long-term vector storage, (2) tool framework with schema-validated execution and 4 built-in tools, (3) WhatsApp channel adapter with Meta 2026 policy compliance, and (4) human escalation/handoff with full context transfer.
 The memory architecture is the most architecturally complex piece. The decision to use full-history pgvector storage (never dropping messages) means the embedding backfill pipeline must be designed carefully — a Celery beat task embeds and writes messages asynchronously after each conversation turn. The sliding window (Redis LRANGE) handles the last N turns verbatim, while pgvector similarity search supplies older or cross-conversation context. The HNSW index MUST include a `WHERE tenant_id = $1` pre-filter on every query — ANN indexes cannot prune by tenant, so unfiltered queries could return other tenants' embeddings.
 WhatsApp integration builds directly on the existing Slack adapter pattern (`normalize.py`, `channels/`, signature verification) but requires careful handling of Meta's 2026 business-function scoping policy. The two-tier gate — allowlist check first (no LLM call), then role-based LLM for edge cases — implements the required "clear, predictable results tied to business messaging" requirement. Each tenant gets their own `phone_number_id` stored in `channel_connections.config`, providing quality-rating isolation.
 **Primary recommendation:** Build memory first (Plan 02-01), then tools (Plan 02-02), then WhatsApp (Plan 02-03), then escalation (Plan 02-04). Memory and tools both extend the `handle_message` Celery task and the `run_agent` runner — get those stable before adding channel complexity.
 ---
 ## Standard Stack
 ### Core (Phase 1 — already installed)
 | Library | Version | Purpose | Notes |
 |---------|---------|---------|-------|
 | pgvector (Python) | 0.4.2 | Vector operations in SQLAlchemy | Already in stack; HNSW index creation needed in Phase 2 migrations |
 | SQLAlchemy | 2.0.48 | ORM for new tables | All new tables use `Mapped[]` / `mapped_column()` style; RLS required |
 | Redis | 7.x | Sliding window storage | `RPUSH` / `LTRIM` / `LRANGE` pattern for conversation history |
 | Celery | 5.6.2 | Async embedding backfill tasks | All tasks sync def with `asyncio.run()` |
 | httpx | latest | WhatsApp API calls, tool HTTP requests | Already used in orchestrator for LLM pool calls |
 | LiteLLM | 1.82.5 (pinned) | Multimodal LLM calls for media interpretation | `supports_vision()` check; image_url content blocks |
 ### New Phase 2 Dependencies
 | Library | Version | Purpose | Why Standard |
 |---------|---------|---------|--------------|
 | sentence-transformers | 3.x | Local text embedding for memory | Runs locally via Ollama container or as a lightweight Python process; `all-MiniLM-L6-v2` (384-dim) for dev, `text-embedding-3-small` via OpenAI API for production quality; avoids separate embedding service |
 | boto3 | 1.x | MinIO/S3 media storage | S3-compatible; works with MinIO via `endpoint_url`; presigned URLs for agent media delivery |
 | google-api-python-client | 2.x | Google Calendar lookup tool | `calendar.readonly` scope; service account or per-tenant OAuth; `events.list()` for availability |
 | brave-search | latest | Web search tool | Brave Search API; `BRAVE_API_KEY` env var; async-compatible; returns structured results; SOC 2 Type II (Oct 2025) |
 ### Installation (Phase 2 additions)
 ```bash
 # From repo root
 uv add sentence-transformers boto3 google-api-python-client google-auth brave-search
 # Dev/test
 uv add --dev pytest-mock moto  # moto for S3/MinIO mocking in tests
 ```
 ### Alternatives Considered
 | Instead of | Could Use | Tradeoff |
 |------------|-----------|----------|
 | sentence-transformers (local) | OpenAI text-embedding-3-small via LiteLLM | OpenAI embedding has higher quality but adds cost and API dependency; start with local for dev, add OpenAI as optional upgrade in tenant config |
 | Brave Search API | SerpAPI / Serper.dev | Brave is SOC 2 Type II, independent index (not Google-dependent), flat pricing; SerpAPI is higher quality but more expensive |
 | Google Calendar API | Generic iCal parsing | Google Calendar covers the highest-value use case; generic iCal adds complexity without v1 validation |
 | boto3 (MinIO) | minio-py client | boto3 is more widely understood; MinIO is S3-compatible; avoids maintaining two object storage SDKs |
 ---
 ## Architecture Patterns
 ### Recommended Structure (Phase 2 additions)
 ```
 packages/orchestrator/orchestrator/
 ├── tasks.py                      # Extended: memory + tools + escalation + audit in pipeline
 ├── agents/
 │   ├── builder.py                # Extended: memory injection into prompt
 │   └── runner.py                 # Extended: tool-call loop (reason → tool → observe → respond)
 ├── memory/
 │   ├── __init__.py
 │   ├── short_term.py             # Redis sliding window (LRANGE/RPUSH/LTRIM)
 │   └── long_term.py              # pgvector embedding store + HNSW similarity search
 ├── tools/
 │   ├── __init__.py
 │   ├── registry.py               # Tool name → ToolDefinition mapping
 │   ├── executor.py               # Schema validation → execution → audit log
 │   └── builtins/
 │       ├── __init__.py
 │       ├── web_search.py         # Brave Search API
 │       ├── kb_search.py          # pgvector knowledge base search
 │       ├── http_request.py       # Guarded outbound HTTP (allowlist, size cap, confirm)
 │       └── calendar_lookup.py    # Google Calendar readonly
 ├── escalation/
 │   ├── __init__.py
 │   └── handler.py                # Rule evaluation + Slack DM + context packaging
 └── audit/
    ├── __init__.py
    └── logger.py                 # Write to audit_events table (tenant-scoped)
 packages/gateway/gateway/channels/
 ├── slack.py                      # Existing — no changes needed for Phase 2
 └── whatsapp.py                   # New: webhook handler, signature verify, normalize
 packages/shared/shared/
 ├── models/
 │   ├── message.py                # Extended: MediaAttachment model added to MessageContent
 │   ├── tenant.py                 # Extended: add tool_schemas JSON field to Agent if needed
 │   ├── memory.py                 # New: ConversationMessage, ConversationEmbedding ORM models
 │   ├── audit.py                  # New: AuditEvent ORM model
 │   └── kb.py                     # New: KnowledgeBaseDocument, KBChunk ORM models
 └── redis_keys.py                 # Extended: add memory-specific key constructors
 migrations/                       # New Alembic migrations:
 │                                 # - conversation_messages (full history)
 │                                 # - conversation_embeddings (HNSW-indexed vectors)
 │                                 # - audit_events (immutable log)
 │                                 # - kb_documents, kb_chunks (knowledge base)
 │                                 # - REVOKE UPDATE/DELETE on audit_events
 ```
 ### Pattern 1: Two-Layer Memory Assembly
 **What:** For every agent invocation, load the last N messages from Redis (sliding window — verbatim, fast) plus up to M semantically relevant past exchanges from pgvector (long-term recall — slower, richer). Combine into the LLM context window: `[system_prompt] + [pgvector retrieved context] + [sliding window recent messages] + [current message]`.
 **When to use:** Every `handle_message` invocation. Short-term is mandatory. Long-term retrieval is triggered when the conversation references past events, when Redis window size exceeds limit, or on every call (simpler — pick based on latency budget).
 **Sliding window size recommendation (Claude's discretion):** 20 messages. This covers ~5-10 back-and-forth exchanges. Beyond 20, retrieve via vector. At 384-dim (all-MiniLM), 20 messages of ~200 tokens each is well within any LLM context window.
 **Vector similarity threshold recommendation (Claude's discretion):** cosine similarity >= 0.75 for retrieval. Return top-3 past exchanges above threshold. Below 0.75, don't inject stale irrelevant context.
 **Memory key per user per agent (per CONTEXT.md decision):**
 ```
 {tenant_id}:memory:short:{agent_id}:{user_id}     # Redis list (sliding window)
 {tenant_id}:memory:long:{agent_id}:{user_id}      # pgvector rows (filtered by these columns)
 ```
 **Example:**
 ```python
 # packages/orchestrator/memory/short_term.py
 import json
 from shared.redis_keys import memory_short_key
 async def get_recent_messages(redis, tenant_id: str, agent_id: str, user_id: str, n: int = 20) -> list[dict]:
    key = memory_short_key(tenant_id, agent_id, user_id)
    raw = await redis.lrange(key, -n, -1)  # Last N items
    return [json.loads(m) for m in raw]
 async def append_message(redis, tenant_id: str, agent_id: str, user_id: str, role: str, content: str, window: int = 20) -> None:
    key = memory_short_key(tenant_id, agent_id, user_id)
    msg = json.dumps({"role": role, "content": content})
    await redis.rpush(key, msg)
    await redis.ltrim(key, -window, -1)  # Keep only last N
    # NOTE: no TTL — indefinite retention per user decision
 ```
 **Long-term retrieval with mandatory tenant filter:**
 ```python
 # packages/orchestrator/memory/long_term.py
 # Source: pgvector-python GitHub + PITFALLS.md
 async def retrieve_relevant(session, tenant_id: str, agent_id: str, user_id: str,
                             query_embedding: list[float], top_k: int = 3,
                             threshold: float = 0.75) -> list[str]:
    from sqlalchemy import text
    # CRITICAL: tenant_id filter BEFORE ANN search — ANN cannot prune by tenant
    result = await session.execute(
        text("""
            SELECT content, 1 - (embedding <=> :embedding) as similarity
            FROM conversation_embeddings
            WHERE tenant_id = :tenant_id
              AND agent_id = :agent_id
              AND user_id = :user_id
              AND 1 - (embedding <=> :embedding) >= :threshold
            ORDER BY embedding <=> :embedding
            LIMIT :top_k
        """),
        {"embedding": str(query_embedding), "tenant_id": tenant_id,
         "agent_id": agent_id, "user_id": user_id,
         "threshold": threshold, "top_k": top_k}
    )
    return [row.content for row in result]
 ```
 **Embedding backfill via Celery:**
 ```python
 # In tasks.py — after LLM response, dispatch embedding task
 embed_and_store.delay(
    tenant_id=str(msg.tenant_id),
    agent_id=str(agent.id),
    user_id=msg.sender.user_id,
    messages=[
        {"role": "user", "content": msg.content.text},
        {"role": "assistant", "content": response_text},
    ]
 )
 ```
 ### Pattern 2: Tool-Call Loop (Reason → Tool → Observe → Respond)
 **What:** After the initial LLM call, check if the response contains a `tool_calls` array. If yes, dispatch each tool call through the executor (validate schema, check authorization, execute, log). Append the tool result as a `tool` role message. Re-call the LLM with the result appended. Repeat until the LLM returns a plain text response (no tool calls).
 **When to use:** Every agent invocation that has tools assigned. The loop is bounded — set a max iteration count (default: 5) to prevent runaway chains.
 **Tool definition format (LiteLLM uses OpenAI function-calling schema):**
 ```python
 # Source: LiteLLM docs — tools parameter
 # packages/orchestrator/tools/registry.py
 from pydantic import BaseModel
 from typing import Any
 class ToolDefinition(BaseModel):
    name: str
    description: str
    parameters: dict[str, Any]          # JSON Schema
    requires_confirmation: bool = False  # True for side-effecting tools
    handler: Any = None                 # Callable (excluded from serialization)
    class Config:
        arbitrary_types_allowed = True
 WEB_SEARCH_TOOL = ToolDefinition(
    name="web_search",
    description="Search the web for current information. Use for facts, news, or anything not in the knowledge base.",
    parameters={
        "type": "object",
        "properties": {
            "query": {"type": "string", "description": "The search query"}
        },
        "required": ["query"]
    },
    requires_confirmation=False,
 )
 ```
 **Executor with schema validation (prevents prompt injection):**
 ```python
 # packages/orchestrator/tools/executor.py
 import jsonschema
 async def execute_tool(tool_call: dict, registry: dict[str, ToolDefinition],
                       tenant_id: str, agent_id: str, audit_logger) -> str:
    tool_name = tool_call["function"]["name"]
    raw_args = tool_call["function"]["arguments"]  # LLM-generated JSON string
    tool_def = registry.get(tool_name)
    if not tool_def:
        raise ValueError(f"Unknown tool: {tool_name}")
    # Schema validation — treat LLM output as untrusted (PITFALLS.md #7)
    try:
        args = json.loads(raw_args)
        jsonschema.validate(args, tool_def.parameters)
    except (json.JSONDecodeError, jsonschema.ValidationError) as e:
        await audit_logger.log_tool_call(tool_name, raw_args, error=str(e))
        return f"Tool call failed: invalid arguments — {e}"
    # Authorization check — does this agent have this tool?
    # (checked before calling executor; registry is already filtered to agent's tools)
    start = time.monotonic()
    result = await tool_def.handler(**args)
    latency_ms = int((time.monotonic() - start) * 1000)
    await audit_logger.log_tool_call(
        tool_name=tool_name, args=args, result=result[:500],  # truncate for audit
        tenant_id=tenant_id, agent_id=agent_id, latency_ms=latency_ms
    )
    return result
 ```
 ### Pattern 3: WhatsApp Webhook (Meta Cloud API)
 **What:** Two FastAPI routes: GET for verification handshake, POST for inbound events. Signature verified via HMAC-SHA256 on raw body bytes before JSON parsing. Outbound messages via httpx POST to `https://graph.facebook.com/v20.0/{phone_number_id}/messages`.
 **Tenant resolution:** `phone_number_id` in the webhook metadata maps to a `channel_connections` row with `workspace_id = phone_number_id`. Follows exact same pattern as Slack `workspace_id` lookup.
 **Per-tenant isolation:** Each tenant's WhatsApp connection stores `phone_number_id` and `access_token` in `channel_connections.config`. Never shared across tenants.
 **Business-function scoping gate (two-tier, per CONTEXT.md):**
 ```python
 # Tier 1: Keyword/pattern allowlist check (fast, no LLM cost)
 def is_clearly_off_topic(text: str, allowed_functions: list[str]) -> bool:
    # Simple keyword overlap check against allowed_functions list
    # If clearly unrelated, return True → canned rejection, no LLM call
 # Tier 2: For borderline messages, let LLM decide with a scoping system prompt
 # System prompt includes: "You only handle: {allowed_functions}. If not applicable,
 # respond with the redirect message."
 ```
 **WhatsApp signature verification (raw body BEFORE JSON parsing):**
 ```python
 # packages/gateway/channels/whatsapp.py
 import hashlib, hmac
 from fastapi import Request, HTTPException
 async def verify_whatsapp_signature(request: Request, app_secret: str) -> bytes:
    sig_header = request.headers.get("X-Hub-Signature-256", "")
    raw_body = await request.body()  # Must read BEFORE parsing
    expected = "sha256=" + hmac.new(
        app_secret.encode(), raw_body, hashlib.sha256
    ).hexdigest()
    if not hmac.compare_digest(sig_header, expected):
        raise HTTPException(status_code=403, detail="Invalid signature")
    return raw_body
 ```
 **Webhook verification handshake:**
 ```python
@router.get("/whatsapp/webhook")
 async def whatsapp_verify(
    hub_mode: str = Query(alias="hub.mode"),
    hub_verify_token: str = Query(alias="hub.verify_token"),
    hub_challenge: str = Query(alias="hub.challenge"),
 ):
    if hub_mode == "subscribe" and hub_verify_token == settings.whatsapp_verify_token:
        return PlainTextResponse(hub_challenge)
    raise HTTPException(status_code=403)
 ```
 **Media download pattern (inbound images/docs):**
 ```python
 # WhatsApp doesn't send file bytes — sends a media_id
 # Must call GET /media/{media_id} to get temporary download URL
 # Then download and store to MinIO with per-tenant prefix
 async def download_whatsapp_media(media_id: str, access_token: str) -> bytes:
    async with httpx.AsyncClient() as client:
        # Step 1: Get download URL
        resp = await client.get(
            f"https://graph.facebook.com/v20.0/{media_id}",
            headers={"Authorization": f"Bearer {access_token}"}
        )
        url = resp.json()["url"]
        # Step 2: Download actual file
        file_resp = await client.get(url, headers={"Authorization": f"Bearer {access_token}"})
        return file_resp.content
 ```
 ### Pattern 4: Human Escalation/Handoff
 **What:** When escalation triggers (rule-based or user-requested), the agent packages the full conversation transcript, sends it as a DM to the configured human assignee via Slack, and marks the conversation as escalated in the DB. The agent remains in the thread as an assistant (can answer factual questions from the human) but stops directing responses to the end user.
 **Escalation state in Redis:** A key `{tenant_id}:escalation:{thread_id}` stores escalation status. `handle_message` checks this key first — if the conversation is escalated, the agent goes into "assistant mode" (answers from the human only, not the end user) or routes end-user messages to a "human is handling this" auto-reply.
 **Context packaging for the DM:**
 ```python
 # packages/orchestrator/escalation/handler.py
 async def escalate_to_human(
    tenant_id: str, agent: Agent, thread_id: str,
    trigger_reason: str, recent_messages: list[dict],
    assignee_slack_user_id: str, bot_token: str,
 ) -> None:
    # Build formatted transcript
    transcript = "\n".join([
        f"*{m['role'].capitalize()}:* {m['content']}"
        for m in recent_messages
    ])
    dm_text = (
        f":rotating_light: *Escalation: {agent.name} needs human assistance*\n"
        f"*Reason:* {trigger_reason}\n"
        f"*Tenant:* {tenant_id}\n\n"
        f"*Conversation transcript:*\n{transcript}\n\n"
        f"The agent will stay in the thread. You can reply directly to the user."
    )
    # Open DM channel and post — using httpx (no slack-bolt in orchestrator)
    async with httpx.AsyncClient() as client:
        open_resp = await client.post(
            "https://slack.com/api/conversations.open",
            headers={"Authorization": f"Bearer {bot_token}"},
            json={"users": assignee_slack_user_id}
        )
        dm_channel = open_resp.json()["channel"]["id"]
        await client.post(
            "https://slack.com/api/chat.postMessage",
            headers={"Authorization": f"Bearer {bot_token}"},
            json={"channel": dm_channel, "text": dm_text}
        )
 ```
 ### Pattern 5: Immutable Audit Log
 **What:** Every LLM call, tool invocation, and handoff event is written to `audit_events` table. Table is protected at the PostgreSQL level — `REVOKE UPDATE, DELETE ON audit_events FROM konstruct_app` ensures no row can be modified or deleted by the application role. Inserts only.
 **Audit table design:**
 ```sql
 CREATE TABLE audit_events (
    id          UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    tenant_id   UUID NOT NULL,             -- RLS-scoped
    agent_id    UUID,
    user_id     TEXT,
    action_type TEXT NOT NULL,             -- 'llm_call' | 'tool_invocation' | 'escalation'
    input_summary TEXT,
    output_summary TEXT,
    latency_ms  INTEGER,
    metadata    JSONB NOT NULL DEFAULT '{}',
    created_at  TIMESTAMPTZ NOT NULL DEFAULT now()
 );
 CREATE INDEX ON audit_events (tenant_id, created_at DESC);
 ALTER TABLE audit_events ENABLE ROW LEVEL SECURITY;
 ALTER TABLE audit_events FORCE ROW LEVEL SECURITY;
 -- Immutability: application role cannot update or delete
 REVOKE UPDATE, DELETE ON audit_events FROM konstruct_app;
 ```
 ### Pattern 6: KonstructMessage Media Extension
 **What:** `MessageContent` needs a typed `MediaAttachment` model so images and documents are first-class. Existing `attachments: list[dict]` field is too loose for typed dispatch.
 ```python
 # packages/shared/shared/models/message.py (extension)
 from pydantic import BaseModel
 from enum import StrEnum
 class MediaType(StrEnum):
    IMAGE = "image"
    DOCUMENT = "document"
    AUDIO = "audio"
    VIDEO = "video"
 class MediaAttachment(BaseModel):
    media_type: MediaType
    url: str | None = None          # Presigned MinIO URL or external URL
    storage_key: str | None = None  # MinIO key: {tenant_id}/{message_id}/{filename}
    mime_type: str | None = None    # e.g. "image/jpeg", "application/pdf"
    filename: str | None = None
    size_bytes: int | None = None
 class MessageContent(BaseModel):
    text: str
    html: str | None = None
    attachments: list[dict] = []        # Legacy — keep for backward compat
    media: list[MediaAttachment] = []   # New typed media list
    mentions: list[str] = []
 ```
 ### Anti-Patterns to Avoid
 - **Unfiltered vector search:** Never call `SELECT ... ORDER BY embedding <=> :embedding LIMIT 5` without a `WHERE tenant_id = :tenant_id` clause first. The HNSW index does not know about tenants.
 - **Embedding in Celery task at request time:** Embedding inference (even local) takes 20-200ms. Do it asynchronously via a separate Celery task after the response is sent. Never block the LLM response pipeline.
 - **Dumping all pgvector history into every prompt:** The user decision is "full history stored" — not "full history injected into every prompt." Inject only the recent sliding window plus top-K retrieved vectors. Injecting all stored messages violates context window limits and causes context rot.
 - **LLM-generated tool arguments passed directly to tools:** Always `jsonschema.validate()` args before calling any tool handler. Raw LLM output is untrusted input.
 - **async def Celery tasks for memory embedding:** The embedding backfill task is still a Celery task — must be `def` with `asyncio.run()`. See STATE.md architectural constraint.
 - **Shared WhatsApp phone number across tenants:** Each tenant gets their own `phone_number_id`. Quality rating degradation from one tenant's behavior cannot affect others.
 - **Storing raw media bytes in PostgreSQL:** Use MinIO/S3 with per-tenant key prefix `{tenant_id}/{agent_id}/{message_id}/`. Store only the key/presigned URL in the DB.
 ---
 ## Don't Hand-Roll
 | Problem | Don't Build | Use Instead | Why |
 |---------|-------------|-------------|-----|
 | JSON Schema validation of LLM tool args | Custom arg parser | `jsonschema.validate()` (stdlib-like) | Edge cases in LLM-generated JSON: nested objects, type coercion, missing required fields — jsonschema handles all of these |
 | Embedding model serving | Custom inference server | `sentence-transformers` in-process or via Ollama `nomic-embed-text` | Embedding models are CPU-friendly; no separate service needed for dev/early prod |
 | WhatsApp HMAC verification | Custom hash comparison | `hmac.compare_digest()` (Python stdlib) | Timing-safe comparison; rolling your own is vulnerable to timing attacks |
 | Presigned URL generation for MinIO | Custom URL signing | `boto3.generate_presigned_url()` | AWS SigV4 signing is complex; boto3 handles all edge cases |
 | Calendar availability parsing | Custom iCal parser | `google-api-python-client` `events.list()` | The Google Calendar API returns structured events; iCal parsing is a rabbit hole |
 | Knowledge base text chunking | Custom splitter | `langchain_text_splitters.RecursiveCharacterTextSplitter` (or `tiktoken`-based) | Correct chunking respects sentence boundaries, token limits, and overlap — non-trivial to get right |
 **Key insight:** The tool framework itself is custom (registry, executor, confirmation flow) — but each tool's underlying implementation should use proven libraries, not raw HTTP calls or custom parsers.
 ---
 ## Common Pitfalls
 ### Pitfall 1: pgvector ANN Without Tenant Filter (Cross-Tenant Leakage)
 **What goes wrong:** Vector similarity search returns another tenant's conversation history because the HNSW index has no concept of tenant isolation.
 **Why it happens:** HNSW is a global index over all embeddings. Without explicit `WHERE tenant_id = ?` pre-filter, the nearest neighbor might belong to any tenant.
 **How to avoid:** All pgvector queries MUST have `WHERE tenant_id = :tenant_id` (plus `agent_id` and `user_id`) before the ANN operator. Write an integration test with two tenants that verifies cross-contamination is impossible.
 **Warning signs:** Any vector query that uses only the embedding operator `<=>` without column equality filters.
 ### Pitfall 2: Context Rot After 30+ Turns
 **What goes wrong:** Agent quality degrades after extended conversations — contradicts earlier statements, ignores established preferences, hallucinates details.
 **Why it happens:** Without a sliding window, the entire conversation history is injected into every prompt. Token counts grow unbounded. Even large-context models show recall degradation above ~50% context window utilization.
 **How to avoid:** Redis sliding window (last 20 messages verbatim) + pgvector retrieval (top-3 relevant). Never dump all stored messages into the prompt.
 **Warning signs:** LLM token usage per request growing linearly with conversation length; testing at turn 30+ required per phase success criteria.
 ### Pitfall 3: WhatsApp Business Account Suspension
 **What goes wrong:** Platform-wide outage if a phone number's quality rating drops to Red — or worse, Meta suspends the number.
 **Why it happens:** High user block rates, messages to non-opted-in users, volume spikes. One bad tenant can affect the number if numbers are shared.
 **How to avoid:** Per-tenant phone numbers (already a locked decision). Never initiate outbound messages outside of approved templates. Rate-limit outbound per tenant below 80% of tier cap.
 **Warning signs:** Quality rating in Meta Business Manager moving from Green → Yellow.
 ### Pitfall 4: WhatsApp 2026 Policy Violation
 **What goes wrong:** Meta flags the platform for running "general-purpose AI chatbots." This can result in policy strike or WABA suspension.
 **Why it happens:** The Meta Jan 2026 policy bans bots that "answer arbitrary questions" rather than serving specific business functions.
 **How to avoid:** Two-tier business-function gate (allowlist check + scoped LLM). The agent's system prompt must include the allowed functions. Off-topic canned response must be used for clearly unrelated queries. Agents must "maintain clear handoff options to a human agent."
 **Warning signs:** Agent responding to clearly personal/off-topic queries without redirecting; missing escalation path to human.
 ### Pitfall 5: Raw Request Body Consumed Before Signature Verification
 **What goes wrong:** FastAPI/Pydantic automatically parses the request JSON body before the route handler runs. If you read the parsed body and try to re-encode it to verify the HMAC signature, the result will not match (whitespace differences, key ordering).
 **Why it happens:** Standard FastAPI route declaration with `body: SomeModel` causes body consumption. The raw bytes needed for HMAC are gone.
 **How to avoid:** In the WhatsApp webhook route, read `await request.body()` explicitly BEFORE any JSON parsing. Pass raw bytes to the HMAC verification function. Then parse JSON separately.
 **Warning signs:** HMAC verification always failing even with correct secrets.
 ### Pitfall 6: Embedding Backfill Blocking Agent Response
 **What goes wrong:** Embedding inference takes 20-200ms for local models. If done synchronously in `handle_message`, it adds latency to every user interaction.
 **Why it happens:** Developers add embedding inline since it needs the message content.
 **How to avoid:** Embedding is a fire-and-forget Celery task dispatched AFTER the response is sent to the user. The Redis sliding window is updated synchronously (fast — just a list push). The pgvector store lags by one Celery task execution time — acceptable given the use case.
 **Warning signs:** LLM response latency increasing; `embed_and_store` task appearing in the critical path.
 ### Pitfall 7: Tool Calls With `requires_confirmation=True` Not Awaiting User Confirmation
 **What goes wrong:** The agent books a calendar slot or fires an HTTP request without asking the user first, causing side effects the user didn't intend.
 **Why it happens:** Confirmation logic is easy to skip when wiring up the tool loop.
 **How to avoid:** In the tool executor, check `tool_def.requires_confirmation`. If True, stop the loop, send the confirmation message to the user ("I found a slot at 2pm Thursday — shall I book it?"), and store pending action state in Redis. Only execute on user affirmation.
 **Warning signs:** Calendar bookings or HTTP POSTs appearing in audit log without a preceding user confirmation message.
 ---
 ## Code Examples
 ### Sliding Window Memory Keys
 ```python
 # packages/shared/shared/redis_keys.py (extension)
 # Source: Existing redis_keys.py pattern from Phase 1
 def memory_short_key(tenant_id: str, agent_id: str, user_id: str) -> str:
    """Redis list key for conversation sliding window."""
    return f"{tenant_id}:memory:short:{agent_id}:{user_id}"
 def escalation_status_key(tenant_id: str, thread_id: str) -> str:
    """Redis key indicating a conversation is escalated."""
    return f"{tenant_id}:escalation:{thread_id}"
 def pending_tool_confirm_key(tenant_id: str, thread_id: str) -> str:
    """Redis key for pending tool confirmation (awaiting user yes/no)."""
    return f"{tenant_id}:tool_confirm:{thread_id}"
 ```
 ### pgvector HNSW Index Migration
 ```sql
 -- migrations/xxxx_phase2_memory.sql
 -- Source: pgvector GitHub README + PITFALLS.md anti-pattern #5
 CREATE TABLE conversation_embeddings (
    id          UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    tenant_id   UUID NOT NULL,
    agent_id    UUID NOT NULL,
    user_id     TEXT NOT NULL,          -- Channel-native user ID
    content     TEXT NOT NULL,          -- The message text that was embedded
    role        TEXT NOT NULL,          -- 'user' | 'assistant'
    embedding   vector(384) NOT NULL,   -- all-MiniLM-L6-v2 dimension
    created_at  TIMESTAMPTZ NOT NULL DEFAULT now()
 );
 -- MANDATORY: HNSW index with tenant pre-filter
 -- Using partial index per tenant is not feasible (dynamic tenants)
 -- Instead: filter tenant_id first, then ANN — pgvector 0.7+ supports this efficiently
 CREATE INDEX ON conversation_embeddings
    USING hnsw (embedding vector_cosine_ops)
    WITH (m = 16, ef_construction = 64);
 -- Covering index for tenant+agent+user pre-filter
 CREATE INDEX ON conversation_embeddings (tenant_id, agent_id, user_id, created_at DESC);
 ALTER TABLE conversation_embeddings ENABLE ROW LEVEL SECURITY;
 ALTER TABLE conversation_embeddings FORCE ROW LEVEL SECURITY;
 CREATE POLICY tenant_isolation ON conversation_embeddings
    USING (tenant_id = current_setting('app.current_tenant')::uuid);
 ```
 ### LiteLLM Multimodal Call (for document/image interpretation)
 ```python
 # packages/orchestrator/agents/runner.py (extension)
 # Source: docs.litellm.ai/docs/completion/vision
 async def run_agent_with_media(msg: KonstructMessage, agent: Agent) -> str:
    """Handle messages with media attachments via multimodal LLM."""
    content_blocks = [{"type": "text", "text": msg.content.text or ""}]
    for attachment in msg.content.media:
        if attachment.media_type in ("image", "document") and attachment.url:
            content_blocks.append({
                "type": "image_url",
                "image_url": {"url": attachment.url}  # Presigned MinIO URL
            })
    messages = [
        {"role": "system", "content": build_system_prompt(agent)},
        {"role": "user", "content": content_blocks}
    ]
    # Check if model supports vision; fall back to text-only if not
    import litellm
    if not litellm.supports_vision(model=agent.model_preference):
        messages[-1]["content"] = msg.content.text  # Degrade gracefully
    # POST to llm-pool /complete with multimodal messages
    ...
 ```
 ### WhatsApp Outbound Message
 ```python
 # packages/gateway/gateway/channels/whatsapp.py
 # Source: Meta WhatsApp Cloud API docs
 async def send_whatsapp_message(
    phone_number_id: str, access_token: str,
    to_phone: str, text: str
 ) -> None:
    async with httpx.AsyncClient() as client:
        await client.post(
            f"https://graph.facebook.com/v20.0/{phone_number_id}/messages",
            headers={"Authorization": f"Bearer {access_token}"},
            json={
                "messaging_product": "whatsapp",
                "recipient_type": "individual",
                "to": to_phone,
                "type": "text",
                "text": {"body": text, "preview_url": False}
            }
        )
 ```
 ---
 ## State of the Art
 | Old Approach | Current Approach | When Changed | Impact |
 |--------------|------------------|--------------|--------|
 | Dump all conversation history in every prompt | Sliding window + vector retrieval | 2024-2025 | Prevents context rot; controls token costs |
 | pgvector IVFFlat index | pgvector HNSW index | pgvector 0.7.0 (2023) | Better speed-recall tradeoff; no need to rebuild index as data grows |
 | One WhatsApp number per platform | Per-tenant phone numbers | 2024+ BSP recommendation | Isolates quality ratings; one bad tenant can't affect others |
 | Meta allowed general-purpose chatbots | Business-function scoping required | January 15, 2026 | Agents must serve defined business purposes; open-ended assistants banned |
 | Tool argument passed as raw string | JSON Schema validation before execution | 2025 OWASP GenAI Top 10 | Prevents prompt injection via tool arguments (CVSS 9.6 class vulnerability) |
 | WhatsApp Cloud API charged per 24h window | Meta charges per delivered template message | July 2025 | Only affects outbound-initiated conversations; inbound-reply sessions still per-24h-window for service messages |
 **Deprecated/outdated:**
 - **WhatsApp On-Premises API:** Fully deprecated. Cloud API is the only supported path.
 - **Socket Mode (Slack) in production:** Still not appropriate for production — webhook Events API only. No change from Phase 1.
 - **IVFFlat index for pgvector:** Functionally superseded by HNSW for most use cases. Use HNSW from the start.
 ---
 ## Open Questions
 1. **Embedding model for production**
   - What we know: `all-MiniLM-L6-v2` (384-dim) runs locally, fast, adequate quality; `text-embedding-3-small` (OpenAI) has higher MTEB scores
   - What's unclear: Whether the quality difference matters for conversation recall at the scale of individual SMB deployments
   - Recommendation: Use `all-MiniLM-L6-v2` via sentence-transformers for Phase 2. Add OpenAI embedding as an optional upgrade configurable per-tenant in Phase 3 if quality complaints emerge.
 2. **Pending tool confirmation UX**
   - What we know: Read-only tools execute immediately; side-effecting tools require confirmation
   - What's unclear: What is the timeout for a pending confirmation? If the user doesn't respond within N minutes, the pending action should expire.
   - Recommendation: Redis TTL of 10 minutes on `pending_tool_confirm` keys. After expiry, agent informs user the action was cancelled.
 3. **WhatsApp per-tenant WABA vs shared WABA**
   - What we know: Per-tenant phone numbers are required (locked decision). Each phone number is registered under a WABA. Meta limits to 2 phone numbers initially, scaling to 20 after verification, with higher limits available.
   - What's unclear: Whether each tenant needs their own WABA, or whether all tenants' numbers can live under Konstruct's WABA.
   - Recommendation: For Phase 2, use Konstruct's WABA with one phone number per tenant. This is the standard BSP (Business Solution Provider) model. Document the Meta limit (2 initially, 20 after verification) — Phase 2 is limited to 20 WhatsApp-enabled tenants until a higher limit is granted.
 4. **Knowledge base URL ingestion depth**
   - What we know: URL ingestion (crawl, chunk, embed) is a locked decision for Phase 2
   - What's unclear: Single-page ingestion or full-site crawl? Crawl depth, max pages?
   - Recommendation: Single-page ingestion for Phase 2 (fetch one URL, extract text, chunk, embed). Full-site crawl is a Phase 3 enhancement.
 ---
 ## Validation Architecture
 ### Test Framework
 | Property | Value |
 |----------|-------|
 | Framework | pytest + pytest-asyncio (already installed from Phase 1) |
 | Config file | `pyproject.toml` (existing `[tool.pytest.ini_options]` section) |
 | Quick run command | `pytest tests/unit/ -x -q` |
 | Full suite command | `pytest tests/ -x` |
 ### Phase Requirements → Test Map
 | Req ID | Behavior | Test Type | Automated Command | File Exists? |
 |--------|----------|-----------|-------------------|-------------|
 | AGNT-02 | Sliding window returns last N messages from Redis | unit | `pytest tests/unit/test_memory_short_term.py -x` | ❌ Wave 0 |
 | AGNT-02 | Messages beyond window N are not included in prompt | unit | `pytest tests/unit/test_memory_short_term.py::test_window_truncation -x` | ❌ Wave 0 |
 | AGNT-03 | pgvector query includes tenant_id filter (no cross-tenant leak) | integration | `pytest tests/integration/test_memory_long_term.py::test_tenant_isolation -x` | ❌ Wave 0 |
 | AGNT-03 | Vector retrieval returns semantically relevant past context | integration | `pytest tests/integration/test_memory_long_term.py::test_similarity_retrieval -x` | ❌ Wave 0 |
 | AGNT-04 | Tool registry resolves tool name to handler | unit | `pytest tests/unit/test_tool_registry.py -x` | ❌ Wave 0 |
 | AGNT-04 | Tool executor rejects invalid arguments (schema validation) | unit | `pytest tests/unit/test_tool_executor.py::test_schema_validation -x` | ❌ Wave 0 |
 | AGNT-04 | Side-effecting tools require confirmation before execution | unit | `pytest tests/unit/test_tool_executor.py::test_confirmation_required -x` | ❌ Wave 0 |
 | AGNT-04 | Every tool call is written to audit_events | integration | `pytest tests/integration/test_audit.py::test_tool_call_logged -x` | ❌ Wave 0 |
 | AGNT-05 | Escalation rule match sends DM to assigned human | integration | `pytest tests/integration/test_escalation.py::test_dm_sent -x` | ❌ Wave 0 |
 | AGNT-05 | Escalation DM includes full conversation transcript | unit | `pytest tests/unit/test_escalation.py::test_transcript_included -x` | ❌ Wave 0 |
 | AGNT-06 | LLM calls are written to audit_events | integration | `pytest tests/integration/test_audit.py::test_llm_call_logged -x` | ❌ Wave 0 |
 | AGNT-06 | Audit events are immutable (no UPDATE/DELETE possible) | integration | `pytest tests/integration/test_audit.py::test_immutable -x` | ❌ Wave 0 |
 | CHAN-03 | WhatsApp webhook signature verification rejects tampered payload | unit | `pytest tests/unit/test_whatsapp_verify.py -x` | ❌ Wave 0 |
 | CHAN-03 | WhatsApp message normalizes to KonstructMessage | unit | `pytest tests/unit/test_whatsapp_normalize.py -x` | ❌ Wave 0 |
 | CHAN-04 | Clearly off-topic messages are rejected without LLM call | unit | `pytest tests/unit/test_whatsapp_scoping.py::test_allowlist_gate -x` | ❌ Wave 0 |
 | CHAN-04 | Allowed business-function messages proceed to agent | unit | `pytest tests/unit/test_whatsapp_scoping.py::test_allowed_passes -x` | ❌ Wave 0 |
 ### Sampling Rate
 - **Per task commit:** `pytest tests/unit/ -x -q`
 - **Per wave merge:** `pytest tests/ -x`
 - **Phase gate:** Full suite green before `/gsd:verify-work`
 ### Wave 0 Gaps
 - [ ] `tests/unit/test_memory_short_term.py` — covers AGNT-02 sliding window
 - [ ] `tests/integration/test_memory_long_term.py` — covers AGNT-03 vector retrieval + tenant isolation
 - [ ] `tests/unit/test_tool_registry.py` — covers AGNT-04 registry lookup
 - [ ] `tests/unit/test_tool_executor.py` — covers AGNT-04 schema validation + confirmation
 - [ ] `tests/integration/test_audit.py` — covers AGNT-06 audit immutability
 - [ ] `tests/unit/test_escalation.py` — covers AGNT-05 context packaging
 - [ ] `tests/integration/test_escalation.py` — covers AGNT-05 DM delivery
 - [ ] `tests/unit/test_whatsapp_verify.py` — covers CHAN-03 signature verification
 - [ ] `tests/unit/test_whatsapp_normalize.py` — covers CHAN-03 normalization
 - [ ] `tests/unit/test_whatsapp_scoping.py` — covers CHAN-04 business-function gate
 - [ ] `tests/conftest.py` — extend with fixtures for second tenant, mock Redis, mock MinIO (moto)
 - [ ] Install: `uv add --dev moto` (S3/MinIO mocking)
 ---
 ## Sources
 ### Primary (HIGH confidence)
 - [pgvector GitHub README](https://github.com/pgvector/pgvector) — HNSW index syntax, `vector_cosine_ops`, tenant filter pattern
 - [pgvector-python GitHub](https://github.com/pgvector/pgvector-python) — SQLAlchemy integration, embedding query examples
 - [LiteLLM Vision docs](https://docs.litellm.ai/docs/completion/vision) — multimodal message format, `supports_vision()` API
 - Phase 1 established patterns — `redis_keys.py` namespacing, `rls.py` hook, `tasks.py` sync def constraint
 ### Secondary (MEDIUM confidence)
 - [respond.io — WhatsApp 2026 AI Policy Explained](https://respond.io/blog/whatsapp-general-purpose-chatbots-ban) — Meta Jan 15 2026 policy verified against multiple sources; business function requirements
 - [Meta developer docs — Business phone numbers](https://developers.facebook.com/documentation/business-messaging/whatsapp/business-phone-numbers/phone-numbers) — WABA structure, phone number limits (2 initial, 20 after verification)
 - [Crunchy Data — HNSW Indexes with pgvector](https://www.crunchydata.com/blog/hnsw-indexes-with-postgres-and-pgvector) — HNSW vs IVFFlat tradeoffs, m/ef_construction parameters
 - [sentence-transformers PyPI](https://pypi.org/project/sentence-transformers/) — all-MiniLM-L6-v2 dimensions (384), usage pattern
 - [Brave Search API](https://brave.com/search/api/) — SOC 2 Type II (Oct 2025), MCP integration, Python library availability
 - [elephas.app — Best Embedding Models 2026](https://elephas.app/blog/best-embedding-models) — Microsoft E5, OpenAI text-embedding-3-small comparison
 - [WhatsApp Cloud API webhook guide — WASenderApi](https://wasenderapi.com/blog/how-to-receive-whatsapp-messages-via-webhook-the-ultimate-2025-guide) — payload structure, media_id download pattern
 ### Tertiary (LOW confidence — cross-verified with other sources)
 - [gmcsco.com — WhatsApp Business API Compliance 2026](https://gmcsco.com/your-simple-guide-to-whatsapp-api-compliance-2026/) — policy details (consistent with respond.io primary source)
 - [DEV Community — 3 Patterns That Fix LLM API Calling 2026](https://dev.to/docat0209/3-patterns-that-fix-llm-api-calling-stop-getting-hallucinated-parameters-4n3b) — schema validation patterns for tool args
 ---
 ## Metadata
 **Confidence breakdown:**
 - Standard stack: HIGH — all Phase 1 libraries already verified; new additions (sentence-transformers, boto3, google-api-python-client, brave-search) are well-established libraries with PyPI presence
 - Architecture (memory): HIGH — pgvector HNSW + Redis sliding window is a documented production pattern; tenant filter requirement verified against pgvector docs and PITFALLS.md
 - Architecture (WhatsApp): HIGH — webhook pattern mirrors existing Slack adapter; Meta policy changes verified against multiple sources
 - Architecture (tools): HIGH — OpenAI function-calling schema is standard; Pydantic/jsonschema validation is established practice
 - Architecture (escalation): MEDIUM — Slack DM delivery via httpx (no SDK) is a new pattern; tested conceptually against Phase 1 httpx usage in orchestrator
 - Pitfalls: HIGH — cross-tenant pgvector leak and WhatsApp policy violations are documented in PITFALLS.md with production evidence
 **Research date:** 2026-03-23
 **Valid until:** 2026-06-23 (90 days — WhatsApp policy is stable post-Jan 2026 rollout; pgvector/LiteLLM APIs are stable)