Files
konstruct/.planning/phases/10-agent-capabilities/10-RESEARCH.md

31 KiB

Phase 10: Agent Capabilities - Research

Researched: 2026-03-26 Domain: Document ingestion pipeline, Google Calendar OAuth, web search activation, KB portal UI Confidence: HIGH


<user_constraints>

User Constraints (from CONTEXT.md)

Locked Decisions

  • KB format support: PDF, DOCX/Word, TXT, Markdown, CSV/Excel, PPT/PowerPoint, URLs (via Firecrawl), YouTube (transcript API + Whisper fallback)
  • KB scope: Per-tenant — all agents in a tenant share the same knowledge base
  • KB portal: Dedicated KB management page (not inline in Agent Designer)
    • Upload files (drag-and-drop + file picker)
    • Add URLs for scraping
    • Add YouTube URLs for transcription
    • View ingested documents with status (processing, ready, error)
    • Delete documents (removes chunks from pgvector)
    • Re-index option
  • Document processing: Async/background via Celery — upload returns immediately
  • Processing status: Visible in portal (progress indicator per document)
  • Web search: Brave Search API already implemented in web_search.py — just needs BRAVE_API_KEY added to .env
  • HTTP request tool: Already implemented — no changes needed
  • Calendar: Google Calendar OAuth per tenant — tenant admin authorizes in portal; full CRUD for v1 (check availability, list upcoming events, create events); OAuth callback in portal; credentials stored encrypted via Fernet

Claude's Discretion

  • Web search: platform-wide vs per-tenant API key (recommend platform-wide)
  • Chunking strategy (chunk size, overlap)
  • Embedding model for KB (reuse all-MiniLM-L6-v2 or upgrade)
  • Firecrawl integration approach (self-hosted vs cloud API)
  • YouTube transcription: when to use existing captions vs OpenWhisper
  • Document size limits
  • KB chunk deduplication strategy

Deferred Ideas (OUT OF SCOPE)

None — discussion stayed within phase scope. </user_constraints>


<phase_requirements>

Phase Requirements

ID Description Research Support
CAP-01 Web search tool returns real results from Brave Search Tool already calls Brave API — just needs BRAVE_API_KEY env var set; web_search.py is production-ready
CAP-02 KB tool searches tenant-scoped documents that have been uploaded, chunked, and embedded in pgvector kb_search.py + kb_chunks table + HNSW index all exist; needs real chunk data from the ingestion pipeline
CAP-03 Operators can upload documents (PDF, DOCX, TXT + more formats) via portal Needs: new FastAPI /api/portal/kb/* router, Celery ingestion task, portal /knowledge-base page, per-format text extraction libraries
CAP-04 HTTP request tool can call operator-configured URLs with response parsing and timeout handling http_request.py is fully implemented — no code changes needed, only documentation
CAP-05 Calendar tool can check Google Calendar availability Stub in calendar_lookup.py must be replaced with per-tenant OAuth token read + Google Calendar API call
CAP-06 Tool results incorporated naturally into agent responses — no raw JSON Agent runner already formats tool results as text strings; this is an LLM prompt quality concern, not architecture
CAP-07 All tool invocations logged in audit trail with input parameters and output summary execute_tool() in executor.py already calls audit_logger.log_tool_call() on every invocation — already satisfied
</phase_requirements>

Summary

Phase 10 has two distinct effort levels. CAP-01, CAP-04, CAP-07, and partially CAP-06 are already architecturally complete — they need configuration, environment variables, or documentation rather than new code. The heavy lifting is CAP-03 (document ingestion pipeline) and CAP-05 (Google Calendar OAuth per tenant).

The document ingestion pipeline is the largest deliverable: a multipart file upload endpoint, text extraction for 7 format families, chunking + embedding Celery task, MinIO storage for original files, status tracking on kb_documents, and a new portal page with drag-and-drop upload and live status polling. The KB table schema and pgvector HNSW index already exist from Phase 2 migration 004.

The Google Calendar integration requires replacing the service-account stub in calendar_lookup.py with per-tenant OAuth token lookup (decrypt from DB), building a Google OAuth initiation + callback endpoint pair in the gateway, storing encrypted access+refresh tokens per tenant, and expanding the calendar tool to support event creation in addition to read. This follows the same HMAC-signed state + encrypted token storage pattern already used for Slack OAuth.

Primary recommendation: Build the document ingestion pipeline first (CAP-02/CAP-03), then Google Calendar OAuth (CAP-05), then wire CAP-01 via .env configuration.


Standard Stack

Core (Python backend)

Library Version Purpose Why Standard
pypdf >=4.0 PDF text extraction Pure Python, no C deps, fast, reliable for standard PDFs
python-docx >=1.1 DOCX text extraction Official-style library, handles paragraphs + tables
python-pptx >=1.0 PPT/PPTX text extraction Standard library for PowerPoint, iterates slides/shapes
openpyxl >=3.1 XLSX text extraction Already likely installed; reads cell values with data_only=True
pandas >=2.0 CSV + Excel parsing Handles encodings, type coercion, multi-sheet Excel
firecrawl-py >=1.0 URL scraping to markdown Returns clean LLM-ready markdown, handles JS rendering
youtube-transcript-api >=1.2 YouTube caption extraction No API key needed, works with auto-generated captions
google-api-python-client >=2.0 Google Calendar API calls Official Google client
google-auth-oauthlib >=1.0 Google OAuth 2.0 web flow Handles code exchange, token refresh

Supporting

Library Version Purpose When to Use
aiofiles >=23.0 Async file I/O in FastAPI upload handler Prevents blocking event loop during file writes
python-multipart already installed (FastAPI dep) Multipart form parsing for UploadFile Required by FastAPI for file upload endpoints

Alternatives Considered

Instead of Could Use Tradeoff
pypdf pymupdf4llm pymupdf4llm is faster and higher quality but has GPL/AGPL license restrictions
pypdf pdfplumber pdfplumber is better for tables but 4x slower; sufficient for KB ingestion
firecrawl-py (cloud API) Self-hosted Firecrawl Self-hosted has full feature parity via Docker but adds infrastructure overhead; cloud API is simpler for v1
youtube-transcript-api openai-whisper Whisper requires model download + GPU; use youtube-transcript-api first and fall back to Whisper only when captions are unavailable
Simple text chunking langchain-text-splitters langchain-text-splitters adds a large dependency for what is ~20 lines of custom code; write a simple recursive chunker inline

Installation:

# Orchestrator: document processing + Google Calendar
uv add --project packages/orchestrator \
    pypdf python-docx python-pptx openpyxl pandas \
    firecrawl-py youtube-transcript-api \
    google-api-python-client google-auth-oauthlib

# Gateway: file upload endpoint (python-multipart already installed via FastAPI)
# No additional deps needed for gateway

# Add status column to kb_documents: handled in new Alembic migration

Architecture Patterns

packages/
├── orchestrator/orchestrator/
│   ├── tasks.py                         # Add: ingest_document Celery task
│   └── tools/builtins/
│       └── calendar_lookup.py           # Replace stub with OAuth token lookup + full CRUD
├── shared/shared/
│   ├── api/
│   │   ├── kb.py                        # New: KB management router (upload, list, delete)
│   │   └── calendar_auth.py             # New: Google Calendar OAuth initiation + callback
│   └── models/
│       └── kb.py                        # Extend: add status + error_message columns
migrations/versions/
└── 013_kb_document_status.py            # New: add status + error_message to kb_documents
packages/portal/app/(dashboard)/
└── knowledge-base/
    └── page.tsx                         # New: KB management page

Pattern 1: Document Ingestion Pipeline (CAP-02/CAP-03)

What: Upload returns immediately (201), a Celery task handles text extraction → chunking → embedding → pgvector insert asynchronously.

When to use: All document types (file, URL, YouTube).

POST /api/portal/kb/upload (multipart file)
  → Save file to MinIO (kb-documents bucket)
  → Insert KbDocument with status='processing'
  → Return 201 with document ID
  → [async] ingest_document.delay(document_id, tenant_id)
      → Extract text (format-specific extractor)
      → Chunk text (500 chars, 50 char overlap)
      → embed_texts(chunks) in batch
      → INSERT kb_chunks rows
      → UPDATE kb_documents SET status='ready'
      → On error: UPDATE kb_documents SET status='error', error_message=...

GET /api/portal/kb/{tenant_id}/documents
  → List KbDocument rows with status field for portal polling

DELETE /api/portal/kb/{document_id}
  → DELETE KbDocument (CASCADE deletes kb_chunks via FK)
  → DELETE file from MinIO

Migration 013 needed — add to kb_documents:

-- status: processing | ready | error
ALTER TABLE kb_documents ADD COLUMN status TEXT NOT NULL DEFAULT 'processing';
ALTER TABLE kb_documents ADD COLUMN error_message TEXT;
ALTER TABLE kb_documents ADD COLUMN chunk_count INTEGER;

Note: kb_documents.agent_id is NOT NULL in the existing schema but KB is now tenant-scoped (all agents share it). Resolution: use a sentinel UUID (e.g., all-zeros UUID) or make agent_id nullable in migration 013. Making it nullable is cleaner.

Pattern 2: Text Extraction by Format

# Source: standard library usage — no external doc needed

def extract_text(file_bytes: bytes, filename: str) -> str:
    ext = filename.lower().rsplit(".", 1)[-1]

    if ext == "pdf":
        from pypdf import PdfReader
        import io
        reader = PdfReader(io.BytesIO(file_bytes))
        return "\n".join(p.extract_text() or "" for p in reader.pages)

    elif ext in ("docx",):
        from docx import Document
        import io
        doc = Document(io.BytesIO(file_bytes))
        return "\n".join(p.text for p in doc.paragraphs)

    elif ext in ("pptx",):
        from pptx import Presentation
        import io
        prs = Presentation(io.BytesIO(file_bytes))
        lines = []
        for slide in prs.slides:
            for shape in slide.shapes:
                if hasattr(shape, "text"):
                    lines.append(shape.text)
        return "\n".join(lines)

    elif ext in ("xlsx", "xls"):
        import pandas as pd
        import io
        df = pd.read_excel(io.BytesIO(file_bytes))
        return df.to_csv(index=False)

    elif ext == "csv":
        return file_bytes.decode("utf-8", errors="replace")

    elif ext in ("txt", "md"):
        return file_bytes.decode("utf-8", errors="replace")

    else:
        raise ValueError(f"Unsupported file extension: {ext}")

Pattern 3: Chunking Strategy (Claude's Discretion)

Recommendation: Simple recursive chunking with chunk_size=500, overlap=50 (characters, not tokens). This matches the all-MiniLM-L6-v2 model's effective input length (~256 tokens ≈ ~1000 chars) with room to spare.

def chunk_text(text: str, chunk_size: int = 500, overlap: int = 50) -> list[str]:
    """Split text into overlapping chunks."""
    chunks = []
    start = 0
    while start < len(text):
        end = start + chunk_size
        chunks.append(text[start:end])
        start += chunk_size - overlap
    return [c.strip() for c in chunks if c.strip()]

No external library needed. langchain-text-splitters would add ~50MB of dependencies for this single use case.

Pattern 4: Google Calendar OAuth per Tenant (CAP-05)

What: Each tenant authorizes Konstruct to access their Google Calendar. OAuth tokens (access + refresh) stored encrypted in a new calendar_tokens DB table per tenant (or in channel_connections as a google_calendar entry — reuse existing pattern).

Reuse channel_connections table: Add channel_type = 'google_calendar' entry per tenant. Store encrypted token JSON in config JSONB column. This avoids a new migration for a new table.

GET /api/portal/calendar/install?tenant_id={id}
  → Generate HMAC-signed OAuth state (same generate_oauth_state() as Slack)
  → Return Google OAuth URL with state param

GET /api/portal/calendar/callback?code={code}&state={state}
  → Verify HMAC state → extract tenant_id
  → Exchange code for {access_token, refresh_token, expiry}
  → Encrypt token JSON with Fernet
  → Upsert ChannelConnection(channel_type='google_calendar', config={...})
  → Redirect to portal /settings/calendar?connected=true

Google OAuth scopes needed (FULL CRUD per locked decision):

_GOOGLE_CALENDAR_SCOPES = [
    "https://www.googleapis.com/auth/calendar",  # Full read+write
]
# NOT readonly — create events requires full calendar scope

calendar_lookup.py replacement — per-tenant token lookup:

async def calendar_lookup(
    date: str,
    action: str = "list",          # list | create | check_availability
    event_summary: str | None = None,
    event_start: str | None = None,  # ISO 8601 with timezone
    event_end: str | None = None,
    calendar_id: str = "primary",
    tenant_id: str | None = None,   # Injected by executor
    **kwargs: object,
) -> str:
    # 1. Load encrypted token from channel_connections
    # 2. Decrypt with KeyEncryptionService
    # 3. Build google.oauth2.credentials.Credentials from token dict
    # 4. Auto-refresh if expired (google-auth handles this)
    # 5. Call Calendar API (list or insert)
    # 6. Format result as natural language

Token refresh: google.oauth2.credentials.Credentials auto-refreshes using the stored refresh_token when access_token is expired. After any refresh, write the updated token back to channel_connections.config.

Pattern 5: URL Ingestion via Firecrawl (CAP-03)

from firecrawl import FirecrawlApp

async def scrape_url(url: str) -> str:
    app = FirecrawlApp(api_key=settings.firecrawl_api_key)
    result = app.scrape_url(url, params={"formats": ["markdown"]})
    return result.get("markdown", "")

Claude's Discretion recommendation: Use Firecrawl cloud API for v1. Add FIRECRAWL_API_KEY to .env. Self-host only when data sovereignty is required.

Pattern 6: YouTube Ingestion (CAP-03)

from youtube_transcript_api import YouTubeTranscriptApi
from youtube_transcript_api.formatters import TextFormatter

def get_youtube_transcript(video_url: str) -> str:
    # Extract video ID from URL
    video_id = _extract_video_id(video_url)

    # Try to fetch existing captions (no API key needed)
    ytt_api = YouTubeTranscriptApi()
    try:
        transcript = ytt_api.fetch(video_id)
        formatter = TextFormatter()
        return formatter.format_transcript(transcript)
    except Exception:
        # Fall back to Whisper transcription if captions unavailable
        raise ValueError("No captions available and Whisper not configured")

Claude's Discretion recommendation: For v1, skip Whisper entirely — only ingest YouTube videos that have existing captions (auto-generated counts). Add Whisper as a future enhancement. Return a user-friendly error when captions are unavailable.

Anti-Patterns to Avoid

  • Synchronous text extraction in FastAPI endpoint: Extracting PDF/DOCX text blocks the event loop. Always delegate to the Celery task.
  • Storing raw file bytes in PostgreSQL: Use MinIO for file storage; only store the MinIO key in kb_documents.
  • Re-embedding on every search: Embed the search query in kb_search.py (already done), not at document query time.
  • Loading SentenceTransformer per Celery task invocation: Already solved via the lazy singleton in embedder.py. Import embed_texts from the same module.
  • Using service account for Google Calendar: The stub uses GOOGLE_SERVICE_ACCOUNT_KEY (wrong for per-tenant user data). Replace with per-tenant OAuth tokens.
  • Storing Google refresh tokens in env vars: Must be per-tenant in DB, encrypted with Fernet.
  • Making agent_id NOT NULL on KB documents: KB is now tenant-scoped (per locked decision). Migration 013 must make agent_id nullable. The kb_search.py tool already accepts agent_id but does not filter by it.

Don't Hand-Roll

Problem Don't Build Use Instead Why
PDF text extraction Custom PDF parser pypdf PDF binary format is extremely complex; pypdf handles encryption, compressed streams, multi-page
DOCX parsing XML unzipper python-docx DOCX is a zip of XML schemas; python-docx handles versioning, embedded tables, styles
YouTube caption fetching YouTube Data API scraper youtube-transcript-api No API key needed, handles 10+ subtitle track formats, works with auto-generated captions
OAuth token refresh Custom token refresh logic google.oauth2.credentials.Credentials google-auth handles expiry, refresh, and HTTP headers automatically
URL → clean text httpx + BeautifulSoup firecrawl-py Firecrawl handles JS rendering, anti-bot bypass, returns clean markdown
Text chunking Custom sentence splitter Simple recursive char splitter (20 lines) No library needed; langchain-text-splitters adds bloat for a single use case

Key insight: Document parsing libraries handle edge cases that take months to rediscover (corrupted headers, nested tables, character encoding, password-protected files). The only thing worth writing custom is the chunking algorithm, which is genuinely trivial.


Common Pitfalls

Pitfall 1: kb_documents.agent_id is NOT NULL in Migration 004

What goes wrong: Inserting a KB document without an agent_id will fail with a DB constraint error. The locked decision says KB is per-tenant (not per-agent), so there is no agent_id context at upload time.

Why it happens: The original Phase 2 schema assumed per-agent knowledge bases. The locked decision changed this to per-tenant.

How to avoid: Migration 013 must ALTER TABLE kb_documents ALTER COLUMN agent_id DROP NOT NULL. Update the ORM model in shared/models/kb.py to match.

Warning signs: IntegrityError: null value in column "agent_id" when uploading a KB document.

Pitfall 2: Celery Tasks Are Always sync def with asyncio.run()

What goes wrong: Writing async def ingest_document(...) as a Celery task causes RuntimeError: no running event loop or silent task hang.

Why it happens: Celery workers are not async-native. This is a hard architectural constraint documented in tasks.py.

How to avoid: ingest_document must be def ingest_document(...) with asyncio.run() for any async DB operations.

Warning signs: Task appears in the Celery queue but never completes; no exception in logs.

Pitfall 3: Google OAuth Callback Must Not Require Auth

What goes wrong: If the /api/portal/calendar/callback endpoint has Depends(require_tenant_admin), Google's redirect will fail because the callback URL has no session cookie.

Why it happens: OAuth callbacks are external redirects — they arrive unauthenticated.

How to avoid: The callback endpoint must be unauthenticated (no RBAC dependency). Tenant identity is recovered from the HMAC-signed state parameter, same as the Slack callback pattern in channels.py.

Warning signs: HTTP 401 or redirect loop on the callback URL.

Pitfall 4: Google Access Token Expiry + Write-Back

What goes wrong: A calendar tool call fails with 401 after the access token (1-hour TTL) expires, even though the refresh token is stored.

Why it happens: google.oauth2.credentials.Credentials auto-refreshes in-memory but does not persist the new token to the database.

How to avoid: After every Google API call, check credentials.token — if it changed (i.e., a refresh occurred), write the updated token JSON back to channel_connections.config. Use an after_refresh callback or check the token before and after.

Warning signs: Calendar tool works once, then fails 1 hour later.

Pitfall 5: pypdf Returns Empty String for Scanned PDFs

What goes wrong: page.extract_text() returns "" for image-based scanned PDFs. The document is ingested with zero chunks and returns no results in KB search.

Why it happens: pypdf only reads embedded text — it cannot OCR images.

How to avoid: After extraction, check if text length < 100 characters. If so, set status='error' with error_message="This PDF contains images only. Text extraction requires OCR, which is not yet supported.".

Warning signs: Document status shows "ready" but KB search returns nothing.

Pitfall 6: ChannelTypeEnum Does Not Include google_calendar

What goes wrong: Inserting a ChannelConnection with channel_type='google_calendar' fails if ChannelTypeEnum only includes messaging channels.

Why it happens: ChannelTypeEnum was defined in Phase 1 for messaging channels only.

How to avoid: Check shared/models/tenant.py — if ChannelTypeEnum is a Python Enum using sa.Enum, adding a new value requires a DB migration. Per the Phase 1 ADR, channel_type is stored as TEXT with a CHECK constraint, which makes adding new values trivial.

Warning signs: LookupError or IntegrityError when inserting the Google Calendar connection.


Code Examples

Upload Endpoint Pattern (FastAPI multipart)

# Source: FastAPI official docs — https://fastapi.tiangolo.com/tutorial/request-files/
from fastapi import UploadFile, File, Form
import uuid

@kb_router.post("/{tenant_id}/documents", status_code=201)
async def upload_document(
    tenant_id: uuid.UUID,
    file: UploadFile = File(...),
    caller: PortalCaller = Depends(require_tenant_admin),
    session: AsyncSession = Depends(get_session),
) -> dict:
    file_bytes = await file.read()
    # 1. Upload to MinIO
    # 2. Insert KbDocument(status='processing')
    # 3. ingest_document.delay(str(doc.id), str(tenant_id))
    # 4. Return 201 with doc.id

Google Calendar Token Storage Pattern

# Reuse existing ChannelConnection + HMAC OAuth state from channels.py
# After OAuth callback:
token_data = {
    "token": credentials.token,
    "refresh_token": credentials.refresh_token,
    "token_uri": credentials.token_uri,
    "client_id": settings.google_client_id,
    "client_secret": settings.google_client_secret,
    "scopes": list(credentials.scopes),
    "expiry": credentials.expiry.isoformat() if credentials.expiry else None,
}
enc_svc = _get_encryption_service()
encrypted_token = enc_svc.encrypt(json.dumps(token_data))

conn = ChannelConnection(
    tenant_id=tenant_id,
    channel_type="google_calendar",   # TEXT column — no enum migration needed
    workspace_id=str(tenant_id),      # Sentinel: tenant ID as workspace ID
    config={"token": encrypted_token},
)

Celery Ingestion Task Structure

# Source: tasks.py architectural pattern (always sync def + asyncio.run())
@celery_app.task(bind=True, max_retries=3)
def ingest_document(self, document_id: str, tenant_id: str) -> None:
    """Background document ingestion — extract, chunk, embed, store."""
    try:
        asyncio.run(_ingest_document_async(document_id, tenant_id))
    except Exception as exc:
        asyncio.run(_mark_document_error(document_id, str(exc)))
        raise self.retry(exc=exc, countdown=60)

Google Calendar Event Creation

# Source: https://developers.google.com/workspace/calendar/api/guides/create-events
event_body = {
    "summary": event_summary,
    "start": {"dateTime": event_start, "timeZone": "UTC"},
    "end": {"dateTime": event_end, "timeZone": "UTC"},
}
event = service.events().insert(calendarId="primary", body=event_body).execute()
return f"Event created: {event.get('summary')} at {event.get('start', {}).get('dateTime')}"

State of the Art

Old Approach Current Approach When Changed Impact
calendar_lookup.py uses service account (global) Per-tenant OAuth tokens (per locked decision) Phase 10 Agents access each tenant's own calendar, not a shared service account
KB is per-agent (agent_id NOT NULL) KB is per-tenant (agent_id nullable) Phase 10 locked decision All agents in a tenant share one knowledge base
youtube-transcript-api v0.x synchronous only v1.2.4 (Jan 2026) uses YouTubeTranscriptApi() instance 2025 Minor API change — instantiate the class, call .fetch(video_id)

Deprecated/outdated:

  • calendar_lookup.py service account path: To be replaced entirely. The GOOGLE_SERVICE_ACCOUNT_KEY env var check should be removed.
  • agent_id NOT NULL on kb_documents: Migration 013 removes this constraint.

Open Questions

  1. Firecrawl API key management

    • What we know: firecrawl-py SDK connects to cloud API by default; self-hosted option available
    • What's unclear: Whether to add FIRECRAWL_API_KEY as a platform-wide setting in shared/config.py or as a tenant BYO credential
    • Recommendation: Add as platform-wide FIRECRAWL_API_KEY in settings (same pattern as BRAVE_API_KEY); make it optional with graceful degradation
  2. ChannelTypeEnum compatibility for google_calendar

    • What we know: Phase 1 ADR chose TEXT + CHECK over sa.Enum to avoid migration DDL conflicts
    • What's unclear: Whether there's a CHECK constraint that needs updating, or if it's open TEXT
    • Recommendation: Inspect channel_connections table DDL in migration 001 before writing migration 013
  3. Document re-index flow

    • What we know: CONTEXT.md mentions a re-index option in the KB portal
    • What's unclear: Whether re-index deletes all existing chunks first or appends
    • Recommendation: Delete all kb_chunks for the document, then re-run ingest_document.delay() — simplest and idempotent
  4. Whisper fallback for YouTube

    • What we know: openai-whisper requires model download (~140MB minimum) and GPU for reasonable speed
    • What's unclear: Whether v1 should include Whisper at all given the infrastructure cost
    • Recommendation: Omit Whisper for v1; return error when captions unavailable; add to v2 requirements

Validation Architecture

Test Framework

Property Value
Framework pytest + pytest-asyncio (existing)
Config file pytest.ini or pyproject.toml [tool.pytest] at repo root
Quick run command pytest tests/unit -x -q
Full suite command pytest tests/unit tests/integration -x -q

Phase Requirements → Test Map

Req ID Behavior Test Type Automated Command File Exists?
CAP-01 web_search() returns Brave results when key is set; gracefully degrades when key is missing unit pytest tests/unit/test_web_search.py -x Wave 0
CAP-02 kb_search() returns ranked chunks for a query after ingestion integration pytest tests/integration/test_kb_search.py -x Wave 0
CAP-03 File upload endpoint accepts PDF/DOCX/TXT, creates KbDocument with status=processing, triggers Celery task unit+integration pytest tests/unit/test_kb_upload.py tests/integration/test_kb_ingestion.py -x Wave 0
CAP-04 http_request() returns correct response; rejects invalid methods; handles timeout unit pytest tests/unit/test_http_request.py -x Wave 0
CAP-05 Calendar tool reads tenant token from DB, calls Google API, returns formatted events unit (mock Google) pytest tests/unit/test_calendar_lookup.py -x Wave 0
CAP-06 Tool results in agent responses are natural language, not raw JSON unit (prompt check) pytest tests/unit/test_tool_response_format.py -x Wave 0
CAP-07 Every tool invocation writes an audit_events row with tool name + args summary integration Covered by existing tests/integration/test_audit.py — extend with tool invocation cases (extend)

Sampling Rate

  • Per task commit: pytest tests/unit -x -q
  • Per wave merge: pytest tests/unit tests/integration -x -q
  • Phase gate: Full suite green before /gsd:verify-work

Wave 0 Gaps

  • tests/unit/test_web_search.py — covers CAP-01 (mock httpx, test key-missing degradation + success path)
  • tests/unit/test_kb_upload.py — covers CAP-03 upload endpoint (mock MinIO, mock Celery task dispatch)
  • tests/unit/test_kb_ingestion.py — covers text extraction functions per format (PDF, DOCX, TXT, CSV)
  • tests/integration/test_kb_search.py — covers CAP-02 (real pgvector, insert test chunks, verify similarity search)
  • tests/integration/test_kb_ingestion.py — covers CAP-03 end-to-end (upload → task → chunks in DB)
  • tests/unit/test_http_request.py — covers CAP-04 (mock httpx, test method validation, timeout)
  • tests/unit/test_calendar_lookup.py — covers CAP-05 (mock Google API, mock DB token lookup)

Sources

Primary (HIGH confidence)

Secondary (MEDIUM confidence)

  • PyPI: youtube-transcript-api v1.2.4 (Jan 2026) — version + API confirmed
  • PyPI: firecrawl-py — cloud + self-hosted documented
  • WebSearch 2025: pypdf for PDF extraction — confirmed as lightweight, no C-deps option
  • WebSearch 2025: Celery sync def constraint confirmed via tasks.py docstring cross-reference

Tertiary (LOW confidence)

  • Chunking parameters (500 chars, 50 overlap) — from community RAG practice, not benchmarked for this dataset
  • Firecrawl cloud vs self-hosted recommendation — based on project stage, not measured performance comparison

Metadata

Confidence breakdown:

  • Standard stack: HIGH — all libraries verified via PyPI + official docs
  • Architecture: HIGH — pattern directly extends existing Phase 1-3 Slack OAuth and Celery task patterns in codebase
  • Pitfalls: HIGH — agent_id NOT NULL issue is verified directly from migration 004 source code; token write-back is documented in google-auth source
  • Chunking strategy: MEDIUM — recommended values are community defaults, not project-specific benchmarks

Research date: 2026-03-26 Valid until: 2026-06-26 (stable domain; Google OAuth API is very stable)