docs(10): research phase agent capabilities

2026-03-25 23:24:03 -06:00
parent 003bebc39f
commit 9f70eede69
1 changed files with 621 additions and 0 deletions
--- a/.planning/phases/10-agent-capabilities/10-RESEARCH.md
+++ b/.planning/phases/10-agent-capabilities/10-RESEARCH.md
@@ -0,0 +1,621 @@
+# Phase 10: Agent Capabilities - Research
+
+**Researched:** 2026-03-26
+**Domain:** Document ingestion pipeline, Google Calendar OAuth, web search activation, KB portal UI
+**Confidence:** HIGH
+
+---
+
+<user_constraints>
+## User Constraints (from CONTEXT.md)
+
+### Locked Decisions
+
+- **KB format support:** PDF, DOCX/Word, TXT, Markdown, CSV/Excel, PPT/PowerPoint, URLs (via Firecrawl), YouTube (transcript API + Whisper fallback)
+- **KB scope:** Per-tenant — all agents in a tenant share the same knowledge base
+- **KB portal:** Dedicated KB management page (not inline in Agent Designer)
+  - Upload files (drag-and-drop + file picker)
+  - Add URLs for scraping
+  - Add YouTube URLs for transcription
+  - View ingested documents with status (processing, ready, error)
+  - Delete documents (removes chunks from pgvector)
+  - Re-index option
+- **Document processing:** Async/background via Celery — upload returns immediately
+- **Processing status:** Visible in portal (progress indicator per document)
+- **Web search:** Brave Search API already implemented in `web_search.py` — just needs `BRAVE_API_KEY` added to `.env`
+- **HTTP request tool:** Already implemented — no changes needed
+- **Calendar:** Google Calendar OAuth per tenant — tenant admin authorizes in portal; full CRUD for v1 (check availability, list upcoming events, create events); OAuth callback in portal; credentials stored encrypted via Fernet
+
+### Claude's Discretion
+
+- Web search: platform-wide vs per-tenant API key (recommend platform-wide)
+- Chunking strategy (chunk size, overlap)
+- Embedding model for KB (reuse all-MiniLM-L6-v2 or upgrade)
+- Firecrawl integration approach (self-hosted vs cloud API)
+- YouTube transcription: when to use existing captions vs OpenWhisper
+- Document size limits
+- KB chunk deduplication strategy
+
+### Deferred Ideas (OUT OF SCOPE)
+
+None — discussion stayed within phase scope.
+</user_constraints>
+
+---
+
+<phase_requirements>
+## Phase Requirements
+
+| ID | Description | Research Support |
+|----|-------------|-----------------|
+| CAP-01 | Web search tool returns real results from Brave Search | Tool already calls Brave API — just needs `BRAVE_API_KEY` env var set; `web_search.py` is production-ready |
+| CAP-02 | KB tool searches tenant-scoped documents that have been uploaded, chunked, and embedded in pgvector | `kb_search.py` + `kb_chunks` table + HNSW index all exist; needs real chunk data from the ingestion pipeline |
+| CAP-03 | Operators can upload documents (PDF, DOCX, TXT + more formats) via portal | Needs: new FastAPI `/api/portal/kb/*` router, Celery ingestion task, portal `/knowledge-base` page, per-format text extraction libraries |
+| CAP-04 | HTTP request tool can call operator-configured URLs with response parsing and timeout handling | `http_request.py` is fully implemented — no code changes needed, only documentation |
+| CAP-05 | Calendar tool can check Google Calendar availability | Stub in `calendar_lookup.py` must be replaced with per-tenant OAuth token read + Google Calendar API call |
+| CAP-06 | Tool results incorporated naturally into agent responses — no raw JSON | Agent runner already formats tool results as text strings; this is an LLM prompt quality concern, not architecture |
+| CAP-07 | All tool invocations logged in audit trail with input parameters and output summary | `execute_tool()` in executor.py already calls `audit_logger.log_tool_call()` on every invocation — already satisfied |
+</phase_requirements>
+
+---
+
+## Summary
+
+Phase 10 has two distinct effort levels. CAP-01, CAP-04, CAP-07, and partially CAP-06 are already architecturally complete — they need configuration, environment variables, or documentation rather than new code. The heavy lifting is CAP-03 (document ingestion pipeline) and CAP-05 (Google Calendar OAuth per tenant).
+
+The document ingestion pipeline is the largest deliverable: a multipart file upload endpoint, text extraction for 7 format families, chunking + embedding Celery task, MinIO storage for original files, status tracking on `kb_documents`, and a new portal page with drag-and-drop upload and live status polling. The KB table schema and pgvector HNSW index already exist from Phase 2 migration 004.
+
+The Google Calendar integration requires replacing the service-account stub in `calendar_lookup.py` with per-tenant OAuth token lookup (decrypt from DB), building a Google OAuth initiation + callback endpoint pair in the gateway, storing encrypted access+refresh tokens per tenant, and expanding the calendar tool to support event creation in addition to read. This follows the same HMAC-signed state + encrypted token storage pattern already used for Slack OAuth.
+
+**Primary recommendation:** Build the document ingestion pipeline first (CAP-02/CAP-03), then Google Calendar OAuth (CAP-05), then wire CAP-01 via `.env` configuration.
+
+---
+
+## Standard Stack
+
+### Core (Python backend)
+
+| Library | Version | Purpose | Why Standard |
+|---------|---------|---------|--------------|
+| `pypdf` | >=4.0 | PDF text extraction | Pure Python, no C deps, fast, reliable for standard PDFs |
+| `python-docx` | >=1.1 | DOCX text extraction | Official-style library, handles paragraphs + tables |
+| `python-pptx` | >=1.0 | PPT/PPTX text extraction | Standard library for PowerPoint, iterates slides/shapes |
+| `openpyxl` | >=3.1 | XLSX text extraction | Already likely installed; reads cell values with `data_only=True` |
+| `pandas` | >=2.0 | CSV + Excel parsing | Handles encodings, type coercion, multi-sheet Excel |
+| `firecrawl-py` | >=1.0 | URL scraping to markdown | Returns clean LLM-ready markdown, handles JS rendering |
+| `youtube-transcript-api` | >=1.2 | YouTube caption extraction | No API key needed, works with auto-generated captions |
+| `google-api-python-client` | >=2.0 | Google Calendar API calls | Official Google client |
+| `google-auth-oauthlib` | >=1.0 | Google OAuth 2.0 web flow | Handles code exchange, token refresh |
+
+### Supporting
+
+| Library | Version | Purpose | When to Use |
+|---------|---------|---------|-------------|
+| `aiofiles` | >=23.0 | Async file I/O in FastAPI upload handler | Prevents blocking event loop during file writes |
+| `python-multipart` | already installed (FastAPI dep) | Multipart form parsing for UploadFile | Required by FastAPI for file upload endpoints |
+
+### Alternatives Considered
+
+| Instead of | Could Use | Tradeoff |
+|------------|-----------|----------|
+| `pypdf` | `pymupdf4llm` | pymupdf4llm is faster and higher quality but has GPL/AGPL license restrictions |
+| `pypdf` | `pdfplumber` | pdfplumber is better for tables but 4x slower; sufficient for KB ingestion |
+| `firecrawl-py` (cloud API) | Self-hosted Firecrawl | Self-hosted has full feature parity via Docker but adds infrastructure overhead; cloud API is simpler for v1 |
+| `youtube-transcript-api` | `openai-whisper` | Whisper requires model download + GPU; use youtube-transcript-api first and fall back to Whisper only when captions are unavailable |
+| Simple text chunking | `langchain-text-splitters` | langchain-text-splitters adds a large dependency for what is ~20 lines of custom code; write a simple recursive chunker inline |
+
+**Installation:**
+
+```bash
+# Orchestrator: document processing + Google Calendar
+uv add --project packages/orchestrator \
+    pypdf python-docx python-pptx openpyxl pandas \
+    firecrawl-py youtube-transcript-api \
+    google-api-python-client google-auth-oauthlib
+
+# Gateway: file upload endpoint (python-multipart already installed via FastAPI)
+# No additional deps needed for gateway
+
+# Add status column to kb_documents: handled in new Alembic migration
+```
+
+---
+
+## Architecture Patterns
+
+### Recommended Project Structure (new files this phase)
+
+```
+packages/
+├── orchestrator/orchestrator/
+│   ├── tasks.py                         # Add: ingest_document Celery task
+│   └── tools/builtins/
+│       └── calendar_lookup.py           # Replace stub with OAuth token lookup + full CRUD
+├── shared/shared/
+│   ├── api/
+│   │   ├── kb.py                        # New: KB management router (upload, list, delete)
+│   │   └── calendar_auth.py             # New: Google Calendar OAuth initiation + callback
+│   └── models/
+│       └── kb.py                        # Extend: add status + error_message columns
+migrations/versions/
+└── 013_kb_document_status.py            # New: add status + error_message to kb_documents
+packages/portal/app/(dashboard)/
+└── knowledge-base/
+    └── page.tsx                         # New: KB management page
+```
+
+### Pattern 1: Document Ingestion Pipeline (CAP-02/CAP-03)
+
+**What:** Upload returns immediately (201), a Celery task handles text extraction → chunking → embedding → pgvector insert asynchronously.
+
+**When to use:** All document types (file, URL, YouTube).
+
+```
+POST /api/portal/kb/upload (multipart file)
+  → Save file to MinIO (kb-documents bucket)
+  → Insert KbDocument with status='processing'
+  → Return 201 with document ID
+  → [async] ingest_document.delay(document_id, tenant_id)
+      → Extract text (format-specific extractor)
+      → Chunk text (500 chars, 50 char overlap)
+      → embed_texts(chunks) in batch
+      → INSERT kb_chunks rows
+      → UPDATE kb_documents SET status='ready'
+      → On error: UPDATE kb_documents SET status='error', error_message=...
+
+GET /api/portal/kb/{tenant_id}/documents
+  → List KbDocument rows with status field for portal polling
+
+DELETE /api/portal/kb/{document_id}
+  → DELETE KbDocument (CASCADE deletes kb_chunks via FK)
+  → DELETE file from MinIO
+```
+
+**Migration 013 needed — add to `kb_documents`:**
+
+```sql
+-- status: processing | ready | error
+ALTER TABLE kb_documents ADD COLUMN status TEXT NOT NULL DEFAULT 'processing';
+ALTER TABLE kb_documents ADD COLUMN error_message TEXT;
+ALTER TABLE kb_documents ADD COLUMN chunk_count INTEGER;
+```
+
+Note: `kb_documents.agent_id` is `NOT NULL` in the existing schema but KB is now tenant-scoped (all agents share it). Resolution: use a sentinel UUID (e.g., all-zeros UUID) or make `agent_id` nullable in migration 013. Making it nullable is cleaner.
+
+### Pattern 2: Text Extraction by Format
+
+```python
+# Source: standard library usage — no external doc needed
+
+def extract_text(file_bytes: bytes, filename: str) -> str:
+    ext = filename.lower().rsplit(".", 1)[-1]
+
+    if ext == "pdf":
+        from pypdf import PdfReader
+        import io
+        reader = PdfReader(io.BytesIO(file_bytes))
+        return "\n".join(p.extract_text() or "" for p in reader.pages)
+
+    elif ext in ("docx",):
+        from docx import Document
+        import io
+        doc = Document(io.BytesIO(file_bytes))
+        return "\n".join(p.text for p in doc.paragraphs)
+
+    elif ext in ("pptx",):
+        from pptx import Presentation
+        import io
+        prs = Presentation(io.BytesIO(file_bytes))
+        lines = []
+        for slide in prs.slides:
+            for shape in slide.shapes:
+                if hasattr(shape, "text"):
+                    lines.append(shape.text)
+        return "\n".join(lines)
+
+    elif ext in ("xlsx", "xls"):
+        import pandas as pd
+        import io
+        df = pd.read_excel(io.BytesIO(file_bytes))
+        return df.to_csv(index=False)
+
+    elif ext == "csv":
+        return file_bytes.decode("utf-8", errors="replace")
+
+    elif ext in ("txt", "md"):
+        return file_bytes.decode("utf-8", errors="replace")
+
+    else:
+        raise ValueError(f"Unsupported file extension: {ext}")
+```
+
+### Pattern 3: Chunking Strategy (Claude's Discretion)
+
+**Recommendation:** Simple recursive chunking with `chunk_size=500, overlap=50` (characters, not tokens). This matches the `all-MiniLM-L6-v2` model's effective input length (~256 tokens ≈ ~1000 chars) with room to spare.
+
+```python
+def chunk_text(text: str, chunk_size: int = 500, overlap: int = 50) -> list[str]:
+    """Split text into overlapping chunks."""
+    chunks = []
+    start = 0
+    while start < len(text):
+        end = start + chunk_size
+        chunks.append(text[start:end])
+        start += chunk_size - overlap
+    return [c.strip() for c in chunks if c.strip()]
+```
+
+No external library needed. `langchain-text-splitters` would add ~50MB of dependencies for this single use case.
+
+### Pattern 4: Google Calendar OAuth per Tenant (CAP-05)
+
+**What:** Each tenant authorizes Konstruct to access their Google Calendar. OAuth tokens (access + refresh) stored encrypted in a new `calendar_tokens` DB table per tenant (or in `channel_connections` as a `google_calendar` entry — reuse existing pattern).
+
+**Reuse `channel_connections` table:** Add `channel_type = 'google_calendar'` entry per tenant. Store encrypted token JSON in `config` JSONB column. This avoids a new migration for a new table.
+
+```
+GET /api/portal/calendar/install?tenant_id={id}
+  → Generate HMAC-signed OAuth state (same generate_oauth_state() as Slack)
+  → Return Google OAuth URL with state param
+
+GET /api/portal/calendar/callback?code={code}&state={state}
+  → Verify HMAC state → extract tenant_id
+  → Exchange code for {access_token, refresh_token, expiry}
+  → Encrypt token JSON with Fernet
+  → Upsert ChannelConnection(channel_type='google_calendar', config={...})
+  → Redirect to portal /settings/calendar?connected=true
+```
+
+**Google OAuth scopes needed (FULL CRUD per locked decision):**
+
+```python
+_GOOGLE_CALENDAR_SCOPES = [
+    "https://www.googleapis.com/auth/calendar",  # Full read+write
+]
+# NOT readonly — create events requires full calendar scope
+```
+
+**calendar_lookup.py replacement — per-tenant token lookup:**
+
+```python
+async def calendar_lookup(
+    date: str,
+    action: str = "list",          # list | create | check_availability
+    event_summary: str | None = None,
+    event_start: str | None = None,  # ISO 8601 with timezone
+    event_end: str | None = None,
+    calendar_id: str = "primary",
+    tenant_id: str | None = None,   # Injected by executor
+    **kwargs: object,
+) -> str:
+    # 1. Load encrypted token from channel_connections
+    # 2. Decrypt with KeyEncryptionService
+    # 3. Build google.oauth2.credentials.Credentials from token dict
+    # 4. Auto-refresh if expired (google-auth handles this)
+    # 5. Call Calendar API (list or insert)
+    # 6. Format result as natural language
+```
+
+**Token refresh:** `google.oauth2.credentials.Credentials` auto-refreshes using the stored `refresh_token` when `access_token` is expired. After any refresh, write the updated token back to `channel_connections.config`.
+
+### Pattern 5: URL Ingestion via Firecrawl (CAP-03)
+
+```python
+from firecrawl import FirecrawlApp
+
+async def scrape_url(url: str) -> str:
+    app = FirecrawlApp(api_key=settings.firecrawl_api_key)
+    result = app.scrape_url(url, params={"formats": ["markdown"]})
+    return result.get("markdown", "")
+```
+
+**Claude's Discretion recommendation:** Use Firecrawl cloud API for v1. Add `FIRECRAWL_API_KEY` to `.env`. Self-host only when data sovereignty is required.
+
+### Pattern 6: YouTube Ingestion (CAP-03)
+
+```python
+from youtube_transcript_api import YouTubeTranscriptApi
+from youtube_transcript_api.formatters import TextFormatter
+
+def get_youtube_transcript(video_url: str) -> str:
+    # Extract video ID from URL
+    video_id = _extract_video_id(video_url)
+
+    # Try to fetch existing captions (no API key needed)
+    ytt_api = YouTubeTranscriptApi()
+    try:
+        transcript = ytt_api.fetch(video_id)
+        formatter = TextFormatter()
+        return formatter.format_transcript(transcript)
+    except Exception:
+        # Fall back to Whisper transcription if captions unavailable
+        raise ValueError("No captions available and Whisper not configured")
+```
+
+**Claude's Discretion recommendation:** For v1, skip Whisper entirely — only ingest YouTube videos that have existing captions (auto-generated counts). Add Whisper as a future enhancement. Return a user-friendly error when captions are unavailable.
+
+### Anti-Patterns to Avoid
+
+- **Synchronous text extraction in FastAPI endpoint:** Extracting PDF/DOCX text blocks the event loop. Always delegate to the Celery task.
+- **Storing raw file bytes in PostgreSQL:** Use MinIO for file storage; only store the MinIO key in `kb_documents`.
+- **Re-embedding on every search:** Embed the search query in `kb_search.py` (already done), not at document query time.
+- **Loading SentenceTransformer per Celery task invocation:** Already solved via the lazy singleton in `embedder.py`. Import `embed_texts` from the same module.
+- **Using service account for Google Calendar:** The stub uses `GOOGLE_SERVICE_ACCOUNT_KEY` (wrong for per-tenant user data). Replace with per-tenant OAuth tokens.
+- **Storing Google refresh tokens in env vars:** Must be per-tenant in DB, encrypted with Fernet.
+- **Making `agent_id NOT NULL` on KB documents:** KB is now tenant-scoped (per locked decision). Migration 013 must make `agent_id` nullable. The `kb_search.py` tool already accepts `agent_id` but does not filter by it.
+
+---
+
+## Don't Hand-Roll
+
+| Problem | Don't Build | Use Instead | Why |
+|---------|-------------|-------------|-----|
+| PDF text extraction | Custom PDF parser | `pypdf` | PDF binary format is extremely complex; pypdf handles encryption, compressed streams, multi-page |
+| DOCX parsing | XML unzipper | `python-docx` | DOCX is a zip of XML schemas; python-docx handles versioning, embedded tables, styles |
+| YouTube caption fetching | YouTube Data API scraper | `youtube-transcript-api` | No API key needed, handles 10+ subtitle track formats, works with auto-generated captions |
+| OAuth token refresh | Custom token refresh logic | `google.oauth2.credentials.Credentials` | google-auth handles expiry, refresh, and HTTP headers automatically |
+| URL → clean text | httpx + BeautifulSoup | `firecrawl-py` | Firecrawl handles JS rendering, anti-bot bypass, returns clean markdown |
+| Text chunking | Custom sentence splitter | Simple recursive char splitter (20 lines) | No library needed; langchain-text-splitters adds bloat for a single use case |
+
+**Key insight:** Document parsing libraries handle edge cases that take months to rediscover (corrupted headers, nested tables, character encoding, password-protected files). The only thing worth writing custom is the chunking algorithm, which is genuinely trivial.
+
+---
+
+## Common Pitfalls
+
+### Pitfall 1: `kb_documents.agent_id` is NOT NULL in Migration 004
+
+**What goes wrong:** Inserting a KB document without an `agent_id` will fail with a DB constraint error. The locked decision says KB is per-tenant (not per-agent), so there is no `agent_id` context at upload time.
+
+**Why it happens:** The original Phase 2 schema assumed per-agent knowledge bases. The locked decision changed this to per-tenant.
+
+**How to avoid:** Migration 013 must `ALTER TABLE kb_documents ALTER COLUMN agent_id DROP NOT NULL`. Update the ORM model in `shared/models/kb.py` to match.
+
+**Warning signs:** `IntegrityError: null value in column "agent_id"` when uploading a KB document.
+
+### Pitfall 2: Celery Tasks Are Always `sync def` with `asyncio.run()`
+
+**What goes wrong:** Writing `async def ingest_document(...)` as a Celery task causes `RuntimeError: no running event loop` or silent task hang.
+
+**Why it happens:** Celery workers are not async-native. This is a hard architectural constraint documented in `tasks.py`.
+
+**How to avoid:** `ingest_document` must be `def ingest_document(...)` with `asyncio.run()` for any async DB operations.
+
+**Warning signs:** Task appears in the Celery queue but never completes; no exception in logs.
+
+### Pitfall 3: Google OAuth Callback Must Not Require Auth
+
+**What goes wrong:** If the `/api/portal/calendar/callback` endpoint has `Depends(require_tenant_admin)`, Google's redirect will fail because the callback URL has no session cookie.
+
+**Why it happens:** OAuth callbacks are external redirects — they arrive unauthenticated.
+
+**How to avoid:** The callback endpoint must be unauthenticated (no RBAC dependency). Tenant identity is recovered from the HMAC-signed `state` parameter, same as the Slack callback pattern in `channels.py`.
+
+**Warning signs:** HTTP 401 or redirect loop on the callback URL.
+
+### Pitfall 4: Google Access Token Expiry + Write-Back
+
+**What goes wrong:** A calendar tool call fails with 401 after the access token (1-hour TTL) expires, even though the refresh token is stored.
+
+**Why it happens:** `google.oauth2.credentials.Credentials` auto-refreshes in-memory but does not persist the new token to the database.
+
+**How to avoid:** After every Google API call, check `credentials.token` — if it changed (i.e., a refresh occurred), write the updated token JSON back to `channel_connections.config`. Use an `after_refresh` callback or check the token before and after.
+
+**Warning signs:** Calendar tool works once, then fails 1 hour later.
+
+### Pitfall 5: pypdf Returns Empty String for Scanned PDFs
+
+**What goes wrong:** `page.extract_text()` returns `""` for image-based scanned PDFs. The document is ingested with zero chunks and returns no results in KB search.
+
+**Why it happens:** pypdf only reads embedded text — it cannot OCR images.
+
+**How to avoid:** After extraction, check if text length < 100 characters. If so, set `status='error'` with `error_message="This PDF contains images only. Text extraction requires OCR, which is not yet supported."`.
+
+**Warning signs:** Document status shows "ready" but KB search returns nothing.
+
+### Pitfall 6: `ChannelTypeEnum` Does Not Include `google_calendar`
+
+**What goes wrong:** Inserting a `ChannelConnection` with `channel_type='google_calendar'` fails if `ChannelTypeEnum` only includes messaging channels.
+
+**Why it happens:** `ChannelTypeEnum` was defined in Phase 1 for messaging channels only.
+
+**How to avoid:** Check `shared/models/tenant.py` — if `ChannelTypeEnum` is a Python `Enum` using `sa.Enum`, adding a new value requires a DB migration. Per the Phase 1 ADR, channel_type is stored as `TEXT` with a `CHECK` constraint, which makes adding new values trivial.
+
+**Warning signs:** `LookupError` or `IntegrityError` when inserting the Google Calendar connection.
+
+---
+
+## Code Examples
+
+### Upload Endpoint Pattern (FastAPI multipart)
+
+```python
+# Source: FastAPI official docs — https://fastapi.tiangolo.com/tutorial/request-files/
+from fastapi import UploadFile, File, Form
+import uuid
+
+@kb_router.post("/{tenant_id}/documents", status_code=201)
+async def upload_document(
+    tenant_id: uuid.UUID,
+    file: UploadFile = File(...),
+    caller: PortalCaller = Depends(require_tenant_admin),
+    session: AsyncSession = Depends(get_session),
+) -> dict:
+    file_bytes = await file.read()
+    # 1. Upload to MinIO
+    # 2. Insert KbDocument(status='processing')
+    # 3. ingest_document.delay(str(doc.id), str(tenant_id))
+    # 4. Return 201 with doc.id
+```
+
+### Google Calendar Token Storage Pattern
+
+```python
+# Reuse existing ChannelConnection + HMAC OAuth state from channels.py
+# After OAuth callback:
+token_data = {
+    "token": credentials.token,
+    "refresh_token": credentials.refresh_token,
+    "token_uri": credentials.token_uri,
+    "client_id": settings.google_client_id,
+    "client_secret": settings.google_client_secret,
+    "scopes": list(credentials.scopes),
+    "expiry": credentials.expiry.isoformat() if credentials.expiry else None,
+}
+enc_svc = _get_encryption_service()
+encrypted_token = enc_svc.encrypt(json.dumps(token_data))
+
+conn = ChannelConnection(
+    tenant_id=tenant_id,
+    channel_type="google_calendar",   # TEXT column — no enum migration needed
+    workspace_id=str(tenant_id),      # Sentinel: tenant ID as workspace ID
+    config={"token": encrypted_token},
+)
+```
+
+### Celery Ingestion Task Structure
+
+```python
+# Source: tasks.py architectural pattern (always sync def + asyncio.run())
+@celery_app.task(bind=True, max_retries=3)
+def ingest_document(self, document_id: str, tenant_id: str) -> None:
+    """Background document ingestion — extract, chunk, embed, store."""
+    try:
+        asyncio.run(_ingest_document_async(document_id, tenant_id))
+    except Exception as exc:
+        asyncio.run(_mark_document_error(document_id, str(exc)))
+        raise self.retry(exc=exc, countdown=60)
+```
+
+### Google Calendar Event Creation
+
+```python
+# Source: https://developers.google.com/workspace/calendar/api/guides/create-events
+event_body = {
+    "summary": event_summary,
+    "start": {"dateTime": event_start, "timeZone": "UTC"},
+    "end": {"dateTime": event_end, "timeZone": "UTC"},
+}
+event = service.events().insert(calendarId="primary", body=event_body).execute()
+return f"Event created: {event.get('summary')} at {event.get('start', {}).get('dateTime')}"
+```
+
+---
+
+## State of the Art
+
+| Old Approach | Current Approach | When Changed | Impact |
+|--------------|------------------|--------------|--------|
+| `calendar_lookup.py` uses service account (global) | Per-tenant OAuth tokens (per locked decision) | Phase 10 | Agents access each tenant's own calendar, not a shared service account |
+| KB is per-agent (`agent_id NOT NULL`) | KB is per-tenant (`agent_id` nullable) | Phase 10 locked decision | All agents in a tenant share one knowledge base |
+| `youtube-transcript-api` v0.x synchronous only | v1.2.4 (Jan 2026) uses `YouTubeTranscriptApi()` instance | 2025 | Minor API change — instantiate the class, call `.fetch(video_id)` |
+
+**Deprecated/outdated:**
+
+- `calendar_lookup.py` service account path: To be replaced entirely. The `GOOGLE_SERVICE_ACCOUNT_KEY` env var check should be removed.
+- `agent_id NOT NULL` on `kb_documents`: Migration 013 removes this constraint.
+
+---
+
+## Open Questions
+
+1. **Firecrawl API key management**
+   - What we know: `firecrawl-py` SDK connects to cloud API by default; self-hosted option available
+   - What's unclear: Whether to add `FIRECRAWL_API_KEY` as a platform-wide setting in `shared/config.py` or as a tenant BYO credential
+   - Recommendation: Add as platform-wide `FIRECRAWL_API_KEY` in `settings` (same pattern as `BRAVE_API_KEY`); make it optional with graceful degradation
+
+2. **`ChannelTypeEnum` compatibility for `google_calendar`**
+   - What we know: Phase 1 ADR chose `TEXT + CHECK` over `sa.Enum` to avoid migration DDL conflicts
+   - What's unclear: Whether there's a CHECK constraint that needs updating, or if it's open TEXT
+   - Recommendation: Inspect `channel_connections` table DDL in migration 001 before writing migration 013
+
+3. **Document re-index flow**
+   - What we know: CONTEXT.md mentions a re-index option in the KB portal
+   - What's unclear: Whether re-index deletes all existing chunks first or appends
+   - Recommendation: Delete all `kb_chunks` for the document, then re-run `ingest_document.delay()` — simplest and idempotent
+
+4. **Whisper fallback for YouTube**
+   - What we know: `openai-whisper` requires model download (~140MB minimum) and GPU for reasonable speed
+   - What's unclear: Whether v1 should include Whisper at all given the infrastructure cost
+   - Recommendation: Omit Whisper for v1; return error when captions unavailable; add to v2 requirements
+
+---
+
+## Validation Architecture
+
+### Test Framework
+
+| Property | Value |
+|----------|-------|
+| Framework | pytest + pytest-asyncio (existing) |
+| Config file | `pytest.ini` or `pyproject.toml [tool.pytest]` at repo root |
+| Quick run command | `pytest tests/unit -x -q` |
+| Full suite command | `pytest tests/unit tests/integration -x -q` |
+
+### Phase Requirements → Test Map
+
+| Req ID | Behavior | Test Type | Automated Command | File Exists? |
+|--------|----------|-----------|-------------------|-------------|
+| CAP-01 | `web_search()` returns Brave results when key is set; gracefully degrades when key is missing | unit | `pytest tests/unit/test_web_search.py -x` | ❌ Wave 0 |
+| CAP-02 | `kb_search()` returns ranked chunks for a query after ingestion | integration | `pytest tests/integration/test_kb_search.py -x` | ❌ Wave 0 |
+| CAP-03 | File upload endpoint accepts PDF/DOCX/TXT, creates KbDocument with status=processing, triggers Celery task | unit+integration | `pytest tests/unit/test_kb_upload.py tests/integration/test_kb_ingestion.py -x` | ❌ Wave 0 |
+| CAP-04 | `http_request()` returns correct response; rejects invalid methods; handles timeout | unit | `pytest tests/unit/test_http_request.py -x` | ❌ Wave 0 |
+| CAP-05 | Calendar tool reads tenant token from DB, calls Google API, returns formatted events | unit (mock Google) | `pytest tests/unit/test_calendar_lookup.py -x` | ❌ Wave 0 |
+| CAP-06 | Tool results in agent responses are natural language, not raw JSON | unit (prompt check) | `pytest tests/unit/test_tool_response_format.py -x` | ❌ Wave 0 |
+| CAP-07 | Every tool invocation writes an audit_events row with tool name + args summary | integration | Covered by existing `tests/integration/test_audit.py` — extend with tool invocation cases | ✅ (extend) |
+
+### Sampling Rate
+
+- **Per task commit:** `pytest tests/unit -x -q`
+- **Per wave merge:** `pytest tests/unit tests/integration -x -q`
+- **Phase gate:** Full suite green before `/gsd:verify-work`
+
+### Wave 0 Gaps
+
+- [ ] `tests/unit/test_web_search.py` — covers CAP-01 (mock httpx, test key-missing degradation + success path)
+- [ ] `tests/unit/test_kb_upload.py` — covers CAP-03 upload endpoint (mock MinIO, mock Celery task dispatch)
+- [ ] `tests/unit/test_kb_ingestion.py` — covers text extraction functions per format (PDF, DOCX, TXT, CSV)
+- [ ] `tests/integration/test_kb_search.py` — covers CAP-02 (real pgvector, insert test chunks, verify similarity search)
+- [ ] `tests/integration/test_kb_ingestion.py` — covers CAP-03 end-to-end (upload → task → chunks in DB)
+- [ ] `tests/unit/test_http_request.py` — covers CAP-04 (mock httpx, test method validation, timeout)
+- [ ] `tests/unit/test_calendar_lookup.py` — covers CAP-05 (mock Google API, mock DB token lookup)
+
+---
+
+## Sources
+
+### Primary (HIGH confidence)
+
+- FastAPI official docs (https://fastapi.tiangolo.com/tutorial/request-files/) — UploadFile pattern
+- Google Calendar API docs (https://developers.google.com/workspace/calendar/api/guides/create-events) — event creation
+- Google OAuth 2.0 web server docs (https://developers.google.com/identity/protocols/oauth2/web-server) — token exchange flow
+- Existing codebase: `packages/orchestrator/orchestrator/tools/builtins/` — 4 tool files reviewed
+- Existing codebase: `migrations/versions/004_phase2_audit_kb.py` — KB schema confirmed
+- Existing codebase: `packages/shared/shared/api/channels.py` — Slack OAuth HMAC pattern to reuse
+- Existing codebase: `packages/orchestrator/orchestrator/tools/executor.py` — CAP-07 already implemented
+
+### Secondary (MEDIUM confidence)
+
+- PyPI: `youtube-transcript-api` v1.2.4 (Jan 2026) — version + API confirmed
+- PyPI: `firecrawl-py` — cloud + self-hosted documented
+- WebSearch 2025: pypdf for PDF extraction — confirmed as lightweight, no C-deps option
+- WebSearch 2025: Celery sync def constraint confirmed via tasks.py docstring cross-reference
+
+### Tertiary (LOW confidence)
+
+- Chunking parameters (500 chars, 50 overlap) — from community RAG practice, not benchmarked for this dataset
+- Firecrawl cloud vs self-hosted recommendation — based on project stage, not measured performance comparison
+
+---
+
+## Metadata
+
+**Confidence breakdown:**
+
+- Standard stack: HIGH — all libraries verified via PyPI + official docs
+- Architecture: HIGH — pattern directly extends existing Phase 1-3 Slack OAuth and Celery task patterns in codebase
+- Pitfalls: HIGH — agent_id NOT NULL issue is verified directly from migration 004 source code; token write-back is documented in google-auth source
+- Chunking strategy: MEDIUM — recommended values are community defaults, not project-specific benchmarks
+
+**Research date:** 2026-03-26
+**Valid until:** 2026-06-26 (stable domain; Google OAuth API is very stable)