Files

Adolfo Delorenzo 9f70eede69 docs(10): research phase agent capabilities

2026-03-25 23:24:03 -06:00

31 KiB

Raw Blame History

Phase 10: Agent Capabilities - Research

Researched: 2026-03-26 Domain: Document ingestion pipeline, Google Calendar OAuth, web search activation, KB portal UI Confidence: HIGH

<user_constraints>

User Constraints (from CONTEXT.md)

Locked Decisions

KB format support: PDF, DOCX/Word, TXT, Markdown, CSV/Excel, PPT/PowerPoint, URLs (via Firecrawl), YouTube (transcript API + Whisper fallback)
KB scope: Per-tenant — all agents in a tenant share the same knowledge base
KB portal: Dedicated KB management page (not inline in Agent Designer)
- Upload files (drag-and-drop + file picker)
- Add URLs for scraping
- Add YouTube URLs for transcription
- View ingested documents with status (processing, ready, error)
- Delete documents (removes chunks from pgvector)
- Re-index option
Document processing: Async/background via Celery — upload returns immediately
Processing status: Visible in portal (progress indicator per document)
Web search: Brave Search API already implemented in web_search.py — just needs BRAVE_API_KEY added to .env
HTTP request tool: Already implemented — no changes needed
Calendar: Google Calendar OAuth per tenant — tenant admin authorizes in portal; full CRUD for v1 (check availability, list upcoming events, create events); OAuth callback in portal; credentials stored encrypted via Fernet

Claude's Discretion

Web search: platform-wide vs per-tenant API key (recommend platform-wide)
Chunking strategy (chunk size, overlap)
Embedding model for KB (reuse all-MiniLM-L6-v2 or upgrade)
Firecrawl integration approach (self-hosted vs cloud API)
YouTube transcription: when to use existing captions vs OpenWhisper
Document size limits
KB chunk deduplication strategy

Deferred Ideas (OUT OF SCOPE)

None — discussion stayed within phase scope. </user_constraints>

<phase_requirements>

Phase Requirements

ID	Description	Research Support
CAP-01	Web search tool returns real results from Brave Search	Tool already calls Brave API — just needs `BRAVE_API_KEY` env var set; `web_search.py` is production-ready
CAP-02	KB tool searches tenant-scoped documents that have been uploaded, chunked, and embedded in pgvector	`kb_search.py` + `kb_chunks` table + HNSW index all exist; needs real chunk data from the ingestion pipeline
CAP-03	Operators can upload documents (PDF, DOCX, TXT + more formats) via portal	Needs: new FastAPI `/api/portal/kb/*` router, Celery ingestion task, portal `/knowledge-base` page, per-format text extraction libraries
CAP-04	HTTP request tool can call operator-configured URLs with response parsing and timeout handling	`http_request.py` is fully implemented — no code changes needed, only documentation
CAP-05	Calendar tool can check Google Calendar availability	Stub in `calendar_lookup.py` must be replaced with per-tenant OAuth token read + Google Calendar API call
CAP-06	Tool results incorporated naturally into agent responses — no raw JSON	Agent runner already formats tool results as text strings; this is an LLM prompt quality concern, not architecture
CAP-07	All tool invocations logged in audit trail with input parameters and output summary	`execute_tool()` in executor.py already calls `audit_logger.log_tool_call()` on every invocation — already satisfied
</phase_requirements>

Summary

Phase 10 has two distinct effort levels. CAP-01, CAP-04, CAP-07, and partially CAP-06 are already architecturally complete — they need configuration, environment variables, or documentation rather than new code. The heavy lifting is CAP-03 (document ingestion pipeline) and CAP-05 (Google Calendar OAuth per tenant).

The document ingestion pipeline is the largest deliverable: a multipart file upload endpoint, text extraction for 7 format families, chunking + embedding Celery task, MinIO storage for original files, status tracking on kb_documents, and a new portal page with drag-and-drop upload and live status polling. The KB table schema and pgvector HNSW index already exist from Phase 2 migration 004.

The Google Calendar integration requires replacing the service-account stub in calendar_lookup.py with per-tenant OAuth token lookup (decrypt from DB), building a Google OAuth initiation + callback endpoint pair in the gateway, storing encrypted access+refresh tokens per tenant, and expanding the calendar tool to support event creation in addition to read. This follows the same HMAC-signed state + encrypted token storage pattern already used for Slack OAuth.

Primary recommendation: Build the document ingestion pipeline first (CAP-02/CAP-03), then Google Calendar OAuth (CAP-05), then wire CAP-01 via .env configuration.

Standard Stack

Core (Python backend)

Library	Version	Purpose	Why Standard
`pypdf`	>=4.0	PDF text extraction	Pure Python, no C deps, fast, reliable for standard PDFs
`python-docx`	>=1.1	DOCX text extraction	Official-style library, handles paragraphs + tables
`python-pptx`	>=1.0	PPT/PPTX text extraction	Standard library for PowerPoint, iterates slides/shapes
`openpyxl`	>=3.1	XLSX text extraction	Already likely installed; reads cell values with `data_only=True`
`pandas`	>=2.0	CSV + Excel parsing	Handles encodings, type coercion, multi-sheet Excel
`firecrawl-py`	>=1.0	URL scraping to markdown	Returns clean LLM-ready markdown, handles JS rendering
`youtube-transcript-api`	>=1.2	YouTube caption extraction	No API key needed, works with auto-generated captions
`google-api-python-client`	>=2.0	Google Calendar API calls	Official Google client
`google-auth-oauthlib`	>=1.0	Google OAuth 2.0 web flow	Handles code exchange, token refresh

Supporting

Library	Version	Purpose	When to Use
`aiofiles`	>=23.0	Async file I/O in FastAPI upload handler	Prevents blocking event loop during file writes
`python-multipart`	already installed (FastAPI dep)	Multipart form parsing for UploadFile	Required by FastAPI for file upload endpoints

Alternatives Considered

Instead of	Could Use	Tradeoff
`pypdf`	`pymupdf4llm`	pymupdf4llm is faster and higher quality but has GPL/AGPL license restrictions
`pypdf`	`pdfplumber`	pdfplumber is better for tables but 4x slower; sufficient for KB ingestion
`firecrawl-py` (cloud API)	Self-hosted Firecrawl	Self-hosted has full feature parity via Docker but adds infrastructure overhead; cloud API is simpler for v1
`youtube-transcript-api`	`openai-whisper`	Whisper requires model download + GPU; use youtube-transcript-api first and fall back to Whisper only when captions are unavailable
Simple text chunking	`langchain-text-splitters`	langchain-text-splitters adds a large dependency for what is ~20 lines of custom code; write a simple recursive chunker inline

Installation:

# Orchestrator: document processing + Google Calendar
uv add --project packages/orchestrator \
    pypdf python-docx python-pptx openpyxl pandas \
    firecrawl-py youtube-transcript-api \
    google-api-python-client google-auth-oauthlib

# Gateway: file upload endpoint (python-multipart already installed via FastAPI)
# No additional deps needed for gateway

# Add status column to kb_documents: handled in new Alembic migration

Architecture Patterns

Recommended Project Structure (new files this phase)

packages/
├── orchestrator/orchestrator/
│   ├── tasks.py                         # Add: ingest_document Celery task
│   └── tools/builtins/
│       └── calendar_lookup.py           # Replace stub with OAuth token lookup + full CRUD
├── shared/shared/
│   ├── api/
│   │   ├── kb.py                        # New: KB management router (upload, list, delete)
│   │   └── calendar_auth.py             # New: Google Calendar OAuth initiation + callback
│   └── models/
│       └── kb.py                        # Extend: add status + error_message columns
migrations/versions/
└── 013_kb_document_status.py            # New: add status + error_message to kb_documents
packages/portal/app/(dashboard)/
└── knowledge-base/
    └── page.tsx                         # New: KB management page

Pattern 1: Document Ingestion Pipeline (CAP-02/CAP-03)

What: Upload returns immediately (201), a Celery task handles text extraction → chunking → embedding → pgvector insert asynchronously.

When to use: All document types (file, URL, YouTube).

POST /api/portal/kb/upload (multipart file)
  → Save file to MinIO (kb-documents bucket)
  → Insert KbDocument with status='processing'
  → Return 201 with document ID
  → [async] ingest_document.delay(document_id, tenant_id)
      → Extract text (format-specific extractor)
      → Chunk text (500 chars, 50 char overlap)
      → embed_texts(chunks) in batch
      → INSERT kb_chunks rows
      → UPDATE kb_documents SET status='ready'
      → On error: UPDATE kb_documents SET status='error', error_message=...

GET /api/portal/kb/{tenant_id}/documents
  → List KbDocument rows with status field for portal polling

DELETE /api/portal/kb/{document_id}
  → DELETE KbDocument (CASCADE deletes kb_chunks via FK)
  → DELETE file from MinIO

Migration 013 needed — add to kb_documents:

-- status: processing | ready | error
ALTER TABLE kb_documents ADD COLUMN status TEXT NOT NULL DEFAULT 'processing';
ALTER TABLE kb_documents ADD COLUMN error_message TEXT;
ALTER TABLE kb_documents ADD COLUMN chunk_count INTEGER;

Note: kb_documents.agent_id is NOT NULL in the existing schema but KB is now tenant-scoped (all agents share it). Resolution: use a sentinel UUID (e.g., all-zeros UUID) or make agent_id nullable in migration 013. Making it nullable is cleaner.

Pattern 2: Text Extraction by Format

# Source: standard library usage — no external doc needed

def extract_text(file_bytes: bytes, filename: str) -> str:
    ext = filename.lower().rsplit(".", 1)[-1]

    if ext == "pdf":
        from pypdf import PdfReader
        import io
        reader = PdfReader(io.BytesIO(file_bytes))
        return "\n".join(p.extract_text() or "" for p in reader.pages)

    elif ext in ("docx",):
        from docx import Document
        import io
        doc = Document(io.BytesIO(file_bytes))
        return "\n".join(p.text for p in doc.paragraphs)

    elif ext in ("pptx",):
        from pptx import Presentation
        import io
        prs = Presentation(io.BytesIO(file_bytes))
        lines = []
        for slide in prs.slides:
            for shape in slide.shapes:
                if hasattr(shape, "text"):
                    lines.append(shape.text)
        return "\n".join(lines)

    elif ext in ("xlsx", "xls"):
        import pandas as pd
        import io
        df = pd.read_excel(io.BytesIO(file_bytes))
        return df.to_csv(index=False)

    elif ext == "csv":
        return file_bytes.decode("utf-8", errors="replace")

    elif ext in ("txt", "md"):
        return file_bytes.decode("utf-8", errors="replace")

    else:
        raise ValueError(f"Unsupported file extension: {ext}")

Pattern 3: Chunking Strategy (Claude's Discretion)

Recommendation: Simple recursive chunking with chunk_size=500, overlap=50 (characters, not tokens). This matches the all-MiniLM-L6-v2 model's effective input length (~256 tokens ≈ ~1000 chars) with room to spare.

def chunk_text(text: str, chunk_size: int = 500, overlap: int = 50) -> list[str]:
    """Split text into overlapping chunks."""
    chunks = []
    start = 0
    while start < len(text):
        end = start + chunk_size
        chunks.append(text[start:end])
        start += chunk_size - overlap
    return [c.strip() for c in chunks if c.strip()]

No external library needed. langchain-text-splitters would add ~50MB of dependencies for this single use case.

Pattern 4: Google Calendar OAuth per Tenant (CAP-05)

What: Each tenant authorizes Konstruct to access their Google Calendar. OAuth tokens (access + refresh) stored encrypted in a new calendar_tokens DB table per tenant (or in channel_connections as a google_calendar entry — reuse existing pattern).

Reuse channel_connections table: Add channel_type = 'google_calendar' entry per tenant. Store encrypted token JSON in config JSONB column. This avoids a new migration for a new table.

GET /api/portal/calendar/install?tenant_id={id}
  → Generate HMAC-signed OAuth state (same generate_oauth_state() as Slack)
  → Return Google OAuth URL with state param

GET /api/portal/calendar/callback?code={code}&state={state}
  → Verify HMAC state → extract tenant_id
  → Exchange code for {access_token, refresh_token, expiry}
  → Encrypt token JSON with Fernet
  → Upsert ChannelConnection(channel_type='google_calendar', config={...})
  → Redirect to portal /settings/calendar?connected=true

Google OAuth scopes needed (FULL CRUD per locked decision):

_GOOGLE_CALENDAR_SCOPES = [
    "https://www.googleapis.com/auth/calendar",  # Full read+write
]
# NOT readonly — create events requires full calendar scope

calendar_lookup.py replacement — per-tenant token lookup:

async def calendar_lookup(
    date: str,
    action: str = "list",          # list | create | check_availability
    event_summary: str | None = None,
    event_start: str | None = None,  # ISO 8601 with timezone
    event_end: str | None = None,
    calendar_id: str = "primary",
    tenant_id: str | None = None,   # Injected by executor
    **kwargs: object,
) -> str:
    # 1. Load encrypted token from channel_connections
    # 2. Decrypt with KeyEncryptionService
    # 3. Build google.oauth2.credentials.Credentials from token dict
    # 4. Auto-refresh if expired (google-auth handles this)
    # 5. Call Calendar API (list or insert)
    # 6. Format result as natural language

Token refresh: google.oauth2.credentials.Credentials auto-refreshes using the stored refresh_token when access_token is expired. After any refresh, write the updated token back to channel_connections.config.

Pattern 5: URL Ingestion via Firecrawl (CAP-03)

from firecrawl import FirecrawlApp

async def scrape_url(url: str) -> str:
    app = FirecrawlApp(api_key=settings.firecrawl_api_key)
    result = app.scrape_url(url, params={"formats": ["markdown"]})
    return result.get("markdown", "")

Claude's Discretion recommendation: Use Firecrawl cloud API for v1. Add FIRECRAWL_API_KEY to .env. Self-host only when data sovereignty is required.

Pattern 6: YouTube Ingestion (CAP-03)

from youtube_transcript_api import YouTubeTranscriptApi
from youtube_transcript_api.formatters import TextFormatter

def get_youtube_transcript(video_url: str) -> str:
    # Extract video ID from URL
    video_id = _extract_video_id(video_url)

    # Try to fetch existing captions (no API key needed)
    ytt_api = YouTubeTranscriptApi()
    try:
        transcript = ytt_api.fetch(video_id)
        formatter = TextFormatter()
        return formatter.format_transcript(transcript)
    except Exception:
        # Fall back to Whisper transcription if captions unavailable
        raise ValueError("No captions available and Whisper not configured")

Claude's Discretion recommendation: For v1, skip Whisper entirely — only ingest YouTube videos that have existing captions (auto-generated counts). Add Whisper as a future enhancement. Return a user-friendly error when captions are unavailable.

Anti-Patterns to Avoid

Synchronous text extraction in FastAPI endpoint: Extracting PDF/DOCX text blocks the event loop. Always delegate to the Celery task.
Storing raw file bytes in PostgreSQL: Use MinIO for file storage; only store the MinIO key in kb_documents.
Re-embedding on every search: Embed the search query in kb_search.py (already done), not at document query time.
Loading SentenceTransformer per Celery task invocation: Already solved via the lazy singleton in embedder.py. Import embed_texts from the same module.
Using service account for Google Calendar: The stub uses GOOGLE_SERVICE_ACCOUNT_KEY (wrong for per-tenant user data). Replace with per-tenant OAuth tokens.
Storing Google refresh tokens in env vars: Must be per-tenant in DB, encrypted with Fernet.
Making agent_id NOT NULL on KB documents: KB is now tenant-scoped (per locked decision). Migration 013 must make agent_id nullable. The kb_search.py tool already accepts agent_id but does not filter by it.

Don't Hand-Roll

Problem	Don't Build	Use Instead	Why
PDF text extraction	Custom PDF parser	`pypdf`	PDF binary format is extremely complex; pypdf handles encryption, compressed streams, multi-page
DOCX parsing	XML unzipper	`python-docx`	DOCX is a zip of XML schemas; python-docx handles versioning, embedded tables, styles
YouTube caption fetching	YouTube Data API scraper	`youtube-transcript-api`	No API key needed, handles 10+ subtitle track formats, works with auto-generated captions
OAuth token refresh	Custom token refresh logic	`google.oauth2.credentials.Credentials`	google-auth handles expiry, refresh, and HTTP headers automatically
URL → clean text	httpx + BeautifulSoup	`firecrawl-py`	Firecrawl handles JS rendering, anti-bot bypass, returns clean markdown
Text chunking	Custom sentence splitter	Simple recursive char splitter (20 lines)	No library needed; langchain-text-splitters adds bloat for a single use case

Key insight: Document parsing libraries handle edge cases that take months to rediscover (corrupted headers, nested tables, character encoding, password-protected files). The only thing worth writing custom is the chunking algorithm, which is genuinely trivial.

Common Pitfalls

Pitfall 1: `kb_documents.agent_id` is NOT NULL in Migration 004

What goes wrong: Inserting a KB document without an agent_id will fail with a DB constraint error. The locked decision says KB is per-tenant (not per-agent), so there is no agent_id context at upload time.

Why it happens: The original Phase 2 schema assumed per-agent knowledge bases. The locked decision changed this to per-tenant.

How to avoid: Migration 013 must ALTER TABLE kb_documents ALTER COLUMN agent_id DROP NOT NULL. Update the ORM model in shared/models/kb.py to match.

Warning signs: IntegrityError: null value in column "agent_id" when uploading a KB document.

Pitfall 2: Celery Tasks Are Always `sync def` with `asyncio.run()`

What goes wrong: Writing async def ingest_document(...) as a Celery task causes RuntimeError: no running event loop or silent task hang.

Why it happens: Celery workers are not async-native. This is a hard architectural constraint documented in tasks.py.

How to avoid: ingest_document must be def ingest_document(...) with asyncio.run() for any async DB operations.

Warning signs: Task appears in the Celery queue but never completes; no exception in logs.

Pitfall 3: Google OAuth Callback Must Not Require Auth

What goes wrong: If the /api/portal/calendar/callback endpoint has Depends(require_tenant_admin), Google's redirect will fail because the callback URL has no session cookie.

Why it happens: OAuth callbacks are external redirects — they arrive unauthenticated.

How to avoid: The callback endpoint must be unauthenticated (no RBAC dependency). Tenant identity is recovered from the HMAC-signed state parameter, same as the Slack callback pattern in channels.py.

Warning signs: HTTP 401 or redirect loop on the callback URL.

Pitfall 4: Google Access Token Expiry + Write-Back

What goes wrong: A calendar tool call fails with 401 after the access token (1-hour TTL) expires, even though the refresh token is stored.

Why it happens: google.oauth2.credentials.Credentials auto-refreshes in-memory but does not persist the new token to the database.

How to avoid: After every Google API call, check credentials.token — if it changed (i.e., a refresh occurred), write the updated token JSON back to channel_connections.config. Use an after_refresh callback or check the token before and after.

Warning signs: Calendar tool works once, then fails 1 hour later.

Pitfall 5: pypdf Returns Empty String for Scanned PDFs

What goes wrong: page.extract_text() returns "" for image-based scanned PDFs. The document is ingested with zero chunks and returns no results in KB search.

Why it happens: pypdf only reads embedded text — it cannot OCR images.

How to avoid: After extraction, check if text length < 100 characters. If so, set status='error' with error_message="This PDF contains images only. Text extraction requires OCR, which is not yet supported.".

Warning signs: Document status shows "ready" but KB search returns nothing.

Pitfall 6: `ChannelTypeEnum` Does Not Include `google_calendar`

What goes wrong: Inserting a ChannelConnection with channel_type='google_calendar' fails if ChannelTypeEnum only includes messaging channels.

Why it happens: ChannelTypeEnum was defined in Phase 1 for messaging channels only.

How to avoid: Check shared/models/tenant.py — if ChannelTypeEnum is a Python Enum using sa.Enum, adding a new value requires a DB migration. Per the Phase 1 ADR, channel_type is stored as TEXT with a CHECK constraint, which makes adding new values trivial.

Warning signs: LookupError or IntegrityError when inserting the Google Calendar connection.

Code Examples

Upload Endpoint Pattern (FastAPI multipart)

# Source: FastAPI official docs — https://fastapi.tiangolo.com/tutorial/request-files/
from fastapi import UploadFile, File, Form
import uuid

@kb_router.post("/{tenant_id}/documents", status_code=201)
async def upload_document(
    tenant_id: uuid.UUID,
    file: UploadFile = File(...),
    caller: PortalCaller = Depends(require_tenant_admin),
    session: AsyncSession = Depends(get_session),
) -> dict:
    file_bytes = await file.read()
    # 1. Upload to MinIO
    # 2. Insert KbDocument(status='processing')
    # 3. ingest_document.delay(str(doc.id), str(tenant_id))
    # 4. Return 201 with doc.id

Google Calendar Token Storage Pattern

# Reuse existing ChannelConnection + HMAC OAuth state from channels.py
# After OAuth callback:
token_data = {
    "token": credentials.token,
    "refresh_token": credentials.refresh_token,
    "token_uri": credentials.token_uri,
    "client_id": settings.google_client_id,
    "client_secret": settings.google_client_secret,
    "scopes": list(credentials.scopes),
    "expiry": credentials.expiry.isoformat() if credentials.expiry else None,
}
enc_svc = _get_encryption_service()
encrypted_token = enc_svc.encrypt(json.dumps(token_data))

conn = ChannelConnection(
    tenant_id=tenant_id,
    channel_type="google_calendar",   # TEXT column — no enum migration needed
    workspace_id=str(tenant_id),      # Sentinel: tenant ID as workspace ID
    config={"token": encrypted_token},
)

Celery Ingestion Task Structure

# Source: tasks.py architectural pattern (always sync def + asyncio.run())
@celery_app.task(bind=True, max_retries=3)
def ingest_document(self, document_id: str, tenant_id: str) -> None:
    """Background document ingestion — extract, chunk, embed, store."""
    try:
        asyncio.run(_ingest_document_async(document_id, tenant_id))
    except Exception as exc:
        asyncio.run(_mark_document_error(document_id, str(exc)))
        raise self.retry(exc=exc, countdown=60)

Google Calendar Event Creation

# Source: https://developers.google.com/workspace/calendar/api/guides/create-events
event_body = {
    "summary": event_summary,
    "start": {"dateTime": event_start, "timeZone": "UTC"},
    "end": {"dateTime": event_end, "timeZone": "UTC"},
}
event = service.events().insert(calendarId="primary", body=event_body).execute()
return f"Event created: {event.get('summary')} at {event.get('start', {}).get('dateTime')}"

State of the Art

Old Approach	Current Approach	When Changed	Impact
`calendar_lookup.py` uses service account (global)	Per-tenant OAuth tokens (per locked decision)	Phase 10	Agents access each tenant's own calendar, not a shared service account
KB is per-agent (`agent_id NOT NULL`)	KB is per-tenant (`agent_id` nullable)	Phase 10 locked decision	All agents in a tenant share one knowledge base
`youtube-transcript-api` v0.x synchronous only	v1.2.4 (Jan 2026) uses `YouTubeTranscriptApi()` instance	2025	Minor API change — instantiate the class, call `.fetch(video_id)`

Deprecated/outdated:

calendar_lookup.py service account path: To be replaced entirely. The GOOGLE_SERVICE_ACCOUNT_KEY env var check should be removed.
agent_id NOT NULL on kb_documents: Migration 013 removes this constraint.

Open Questions

Firecrawl API key management
- What we know: firecrawl-py SDK connects to cloud API by default; self-hosted option available
- What's unclear: Whether to add FIRECRAWL_API_KEY as a platform-wide setting in shared/config.py or as a tenant BYO credential
- Recommendation: Add as platform-wide FIRECRAWL_API_KEY in settings (same pattern as BRAVE_API_KEY); make it optional with graceful degradation
ChannelTypeEnum compatibility for google_calendar
- What we know: Phase 1 ADR chose TEXT + CHECK over sa.Enum to avoid migration DDL conflicts
- What's unclear: Whether there's a CHECK constraint that needs updating, or if it's open TEXT
- Recommendation: Inspect channel_connections table DDL in migration 001 before writing migration 013
Document re-index flow
- What we know: CONTEXT.md mentions a re-index option in the KB portal
- What's unclear: Whether re-index deletes all existing chunks first or appends
- Recommendation: Delete all kb_chunks for the document, then re-run ingest_document.delay() — simplest and idempotent
Whisper fallback for YouTube
- What we know: openai-whisper requires model download (~140MB minimum) and GPU for reasonable speed
- What's unclear: Whether v1 should include Whisper at all given the infrastructure cost
- Recommendation: Omit Whisper for v1; return error when captions unavailable; add to v2 requirements

Validation Architecture

Test Framework

Property	Value
Framework	pytest + pytest-asyncio (existing)
Config file	`pytest.ini` or `pyproject.toml [tool.pytest]` at repo root
Quick run command	`pytest tests/unit -x -q`
Full suite command	`pytest tests/unit tests/integration -x -q`

Phase Requirements → Test Map

Req ID	Behavior	Test Type	Automated Command	File Exists?
CAP-01	`web_search()` returns Brave results when key is set; gracefully degrades when key is missing	unit	`pytest tests/unit/test_web_search.py -x`	❌ Wave 0
CAP-02	`kb_search()` returns ranked chunks for a query after ingestion	integration	`pytest tests/integration/test_kb_search.py -x`	❌ Wave 0
CAP-03	File upload endpoint accepts PDF/DOCX/TXT, creates KbDocument with status=processing, triggers Celery task	unit+integration	`pytest tests/unit/test_kb_upload.py tests/integration/test_kb_ingestion.py -x`	❌ Wave 0
CAP-04	`http_request()` returns correct response; rejects invalid methods; handles timeout	unit	`pytest tests/unit/test_http_request.py -x`	❌ Wave 0
CAP-05	Calendar tool reads tenant token from DB, calls Google API, returns formatted events	unit (mock Google)	`pytest tests/unit/test_calendar_lookup.py -x`	❌ Wave 0
CAP-06	Tool results in agent responses are natural language, not raw JSON	unit (prompt check)	`pytest tests/unit/test_tool_response_format.py -x`	❌ Wave 0
CAP-07	Every tool invocation writes an audit_events row with tool name + args summary	integration	Covered by existing `tests/integration/test_audit.py` — extend with tool invocation cases	✅ (extend)

Sampling Rate

Per task commit: pytest tests/unit -x -q
Per wave merge: pytest tests/unit tests/integration -x -q
Phase gate: Full suite green before /gsd:verify-work

Wave 0 Gaps

tests/unit/test_web_search.py — covers CAP-01 (mock httpx, test key-missing degradation + success path)
tests/unit/test_kb_upload.py — covers CAP-03 upload endpoint (mock MinIO, mock Celery task dispatch)
tests/unit/test_kb_ingestion.py — covers text extraction functions per format (PDF, DOCX, TXT, CSV)
tests/integration/test_kb_search.py — covers CAP-02 (real pgvector, insert test chunks, verify similarity search)
tests/integration/test_kb_ingestion.py — covers CAP-03 end-to-end (upload → task → chunks in DB)
tests/unit/test_http_request.py — covers CAP-04 (mock httpx, test method validation, timeout)
tests/unit/test_calendar_lookup.py — covers CAP-05 (mock Google API, mock DB token lookup)

Sources

Primary (HIGH confidence)

FastAPI official docs (https://fastapi.tiangolo.com/tutorial/request-files/) — UploadFile pattern
Google Calendar API docs (https://developers.google.com/workspace/calendar/api/guides/create-events) — event creation
Google OAuth 2.0 web server docs (https://developers.google.com/identity/protocols/oauth2/web-server) — token exchange flow
Existing codebase: packages/orchestrator/orchestrator/tools/builtins/ — 4 tool files reviewed
Existing codebase: migrations/versions/004_phase2_audit_kb.py — KB schema confirmed
Existing codebase: packages/shared/shared/api/channels.py — Slack OAuth HMAC pattern to reuse
Existing codebase: packages/orchestrator/orchestrator/tools/executor.py — CAP-07 already implemented

Secondary (MEDIUM confidence)

PyPI: youtube-transcript-api v1.2.4 (Jan 2026) — version + API confirmed
PyPI: firecrawl-py — cloud + self-hosted documented
WebSearch 2025: pypdf for PDF extraction — confirmed as lightweight, no C-deps option
WebSearch 2025: Celery sync def constraint confirmed via tasks.py docstring cross-reference

Tertiary (LOW confidence)

Chunking parameters (500 chars, 50 overlap) — from community RAG practice, not benchmarked for this dataset
Firecrawl cloud vs self-hosted recommendation — based on project stage, not measured performance comparison

Metadata

Confidence breakdown:

Standard stack: HIGH — all libraries verified via PyPI + official docs
Architecture: HIGH — pattern directly extends existing Phase 1-3 Slack OAuth and Celery task patterns in codebase
Pitfalls: HIGH — agent_id NOT NULL issue is verified directly from migration 004 source code; token write-back is documented in google-auth source
Chunking strategy: MEDIUM — recommended values are community defaults, not project-specific benchmarks

Research date: 2026-03-26 Valid until: 2026-06-26 (stable domain; Google OAuth API is very stable)

31 KiB Raw Blame History