Files
konstruct/.planning/phases/10-agent-capabilities/10-01-SUMMARY.md

9.7 KiB

phase, plan, subsystem, tags, requires, provides, affects, tech-stack, key-files, key-decisions, patterns-established, requirements-completed, duration, completed
phase plan subsystem tags requires provides affects tech-stack key-files key-decisions patterns-established requirements-completed duration completed
10-agent-capabilities 01 api
knowledge-base
celery
minio
pgvector
pdf
docx
pptx
embeddings
text-extraction
phase provides
02-agent-features pgvector kb_chunks table, embed_texts, kb_search tool, executor framework
phase provides
01-foundation Celery task infrastructure, MinIO, asyncio.run pattern, RLS session factory
Migration 014
kb_documents status/error_message/chunk_count columns, agent_id nullable
Text extractors for PDF, DOCX, PPTX, XLSX/XLS, CSV, TXT, MD
KB management API
upload file, ingest URL/YouTube, list, delete, reindex endpoints
Celery ingest_document task
download → extract → chunk → embed → store pipeline
Executor tenant_id/agent_id injection into all tool handlers
brave_api_key + firecrawl_api_key + google_client_id/secret + minio_kb_bucket in shared config
10-02
10-03
10-04
kb-search
agent-tools
added patterns
pypdf (PDF text extraction)
python-docx (DOCX paragraph extraction)
python-pptx (PPTX slide text extraction)
openpyxl (XLSX/XLS reading via pandas)
pandas (spreadsheet to CSV conversion)
firecrawl-py (URL scraping for KB ingestion)
youtube-transcript-api (YouTube video transcripts)
google-api-python-client (Google API client)
google-auth-oauthlib (Google OAuth)
Lazy Celery task import in kb.py to avoid circular dependencies
Executor context injection pattern (tenant_id/agent_id injected after schema validation)
chunk_text sliding window chunker (default 500 chars, 50 overlap)
ingest_document_pipeline
fetch → extract → chunk → embed → store in single async transaction
created modified
migrations/versions/014_kb_status.py
packages/orchestrator/orchestrator/tools/extractors.py
packages/orchestrator/orchestrator/tools/ingest.py
packages/shared/shared/api/kb.py
tests/unit/test_extractors.py
tests/unit/test_kb_upload.py
tests/unit/test_ingestion.py
tests/unit/test_executor_injection.py
packages/shared/shared/models/kb.py (status/error_message/chunk_count columns, agent_id nullable)
packages/shared/shared/models/tenant.py (GOOGLE_CALENDAR added to ChannelTypeEnum)
packages/shared/shared/config.py (brave_api_key, firecrawl_api_key, google_client_id/secret, minio_kb_bucket)
packages/orchestrator/orchestrator/tools/executor.py (tenant_id/agent_id injection)
packages/orchestrator/orchestrator/tools/builtins/web_search.py (use settings.brave_api_key)
packages/orchestrator/orchestrator/tasks.py (ingest_document Celery task added)
packages/orchestrator/pyproject.toml (new dependencies)
.env.example (BRAVE_API_KEY, FIRECRAWL_API_KEY, GOOGLE_CLIENT_ID/SECRET, MINIO_KB_BUCKET)
Migration numbered 014 (not 013) — 013 was already used by google_calendar channel type migration from prior session
KB is per-tenant not per-agent — agent_id made nullable in kb_documents
Executor injects tenant_id/agent_id as strings after schema validation to avoid schema rejections
Lazy import of ingest_document task in kb.py router via _get_ingest_task() — avoids shared→orchestrator circular dependency
ingest_document_pipeline uses ORM select for document fetch (testable) and raw SQL for chunk inserts (pgvector CAST pattern)
web_search migrated from os.getenv to settings.brave_api_key — consistent with platform-wide config pattern
chunk_text returns empty list for empty/whitespace text, not error — silent skip is safer in async pipeline
PDF extraction returns warning message (not exception) for image-only PDFs with < 100 chars extracted
Context injection pattern: executor injects tenant_id/agent_id as str kwargs after schema validation, before handler call
KB ingestion pipeline: try/except updates doc.status to error with error_message on any failure
Lazy circular dep avoidance: _get_ingest_task() function returns task at call time, imported inside function
CAP-01
CAP-02
CAP-03
CAP-04
CAP-07
11min 2026-03-26

Phase 10 Plan 01: KB Ingestion Pipeline Summary

Document ingestion pipeline for KB search: text extractors (PDF/DOCX/PPTX/XLSX/CSV/TXT/MD), Celery async ingest task, executor tenant context injection, and KB management REST API

Performance

  • Duration: 11 min
  • Started: 2026-03-26T14:59:19Z
  • Completed: 2026-03-26T15:10:06Z
  • Tasks: 2
  • Files modified: 16

Accomplishments

  • Full document text extraction for 7 format families using pypdf, python-docx, python-pptx, pandas, plus CSV/TXT/MD decode
  • KB management REST API with file upload, URL/YouTube ingest, list, delete, and reindex endpoints
  • Celery ingest_document task runs async pipeline: MinIO download → extract → chunk (500 char sliding window) → embed (all-MiniLM-L6-v2) → store kb_chunks
  • Tool executor now injects tenant_id and agent_id as string kwargs into every tool handler before invocation
  • 31 unit tests pass across all 4 test files

Task Commits

  1. Task 1: Migration 013, ORM updates, config settings, text extractors, KB API router - e8d3e8a (feat)
  2. Task 2: Celery ingestion task, executor tenant_id injection, KB search wiring - 9c7686a (feat)

Files Created/Modified

  • migrations/versions/014_kb_status.py - Migration: add status/error_message/chunk_count to kb_documents, make agent_id nullable
  • packages/shared/shared/models/kb.py - Added status/error_message/chunk_count mapped columns, agent_id nullable
  • packages/shared/shared/models/tenant.py - Added GOOGLE_CALENDAR and WEB to ChannelTypeEnum
  • packages/shared/shared/config.py - Added brave_api_key, firecrawl_api_key, google_client_id, google_client_secret, minio_kb_bucket
  • packages/shared/shared/api/kb.py - New KB management API router (5 endpoints)
  • packages/orchestrator/orchestrator/tools/extractors.py - Text extraction for all 7 formats
  • packages/orchestrator/orchestrator/tools/ingest.py - chunk_text + ingest_document_pipeline
  • packages/orchestrator/orchestrator/tasks.py - Added ingest_document Celery task
  • packages/orchestrator/orchestrator/tools/executor.py - tenant_id/agent_id injection after schema validation
  • packages/orchestrator/orchestrator/tools/builtins/web_search.py - Migrated to settings.brave_api_key
  • packages/orchestrator/pyproject.toml - Added 8 new dependencies
  • .env.example - Added BRAVE_API_KEY, FIRECRAWL_API_KEY, GOOGLE_CLIENT_ID/SECRET, MINIO_KB_BUCKET

Decisions Made

  • Migration numbered 014 (not 013) — 013 was already used by a google_calendar channel type migration from a prior session
  • KB is per-tenant not per-agent — agent_id made nullable in kb_documents
  • Executor injects tenant_id/agent_id as strings after schema validation to avoid triggering schema rejections
  • Lazy import of ingest_document task in kb.py via _get_ingest_task() function — avoids shared→orchestrator circular dependency at module load time
  • ingest_document_pipeline uses ORM select(KnowledgeBaseDocument) for document fetch (testable via mock) and raw SQL for chunk INSERTs (pgvector CAST pattern)

Deviations from Plan

Auto-fixed Issues

1. [Rule 3 - Blocking] Migration renumbered from 013 to 014

  • Found during: Task 1 (Migration creation)
  • Issue: Migration 013 already existed (013_google_calendar_channel.py) from a prior phase session
  • Fix: Renamed migration file to 014_kb_status.py with revision=014, down_revision=013
  • Files modified: migrations/versions/014_kb_status.py
  • Verification: File renamed, revision chain intact
  • Committed in: e8d3e8a (Task 1 commit)

2. [Rule 2 - Missing Critical] Added WEB to ChannelTypeEnum alongside GOOGLE_CALENDAR

  • Found during: Task 1 (tenant.py update)
  • Issue: WEB channel type was missing from the enum (google_calendar was not the only new type)
  • Fix: Added both WEB = "web" and GOOGLE_CALENDAR = "google_calendar" to ChannelTypeEnum
  • Files modified: packages/shared/shared/models/tenant.py
  • Committed in: e8d3e8a (Task 1 commit)

3. [Rule 1 - Bug] FastAPI Depends overrides required for KB upload tests

  • Found during: Task 1 (test_kb_upload.py)
  • Issue: Initial test approach used patch() to mock auth deps but FastAPI calls Depends directly — 422 returned
  • Fix: Updated test to use app.dependency_overrides (correct FastAPI testing pattern)
  • Files modified: tests/unit/test_kb_upload.py
  • Committed in: e8d3e8a (Task 1 commit)

Total deviations: 3 auto-fixed (1 blocking, 1 missing critical, 1 bug) Impact on plan: All fixes necessary for correctness. No scope creep.

Issues Encountered

None beyond the deviations documented above.

User Setup Required

New environment variables needed:

  • BRAVE_API_KEY — Brave Search API key (https://brave.com/search/api/)
  • FIRECRAWL_API_KEY — Firecrawl API key for URL scraping (https://firecrawl.dev)
  • GOOGLE_CLIENT_ID / GOOGLE_CLIENT_SECRET — Google OAuth credentials
  • MINIO_KB_BUCKET — MinIO bucket for KB documents (default: kb-documents)

Next Phase Readiness

  • KB ingestion pipeline is fully functional and tested
  • kb_search tool already wired to query kb_chunks via pgvector (existing from Phase 2)
  • Executor now injects tenant context — all context-aware tools (kb_search, calendar) will work correctly
  • Ready for 10-02 (calendar tool) and 10-03 (any remaining agent capability work)

Self-Check: PASSED

All files found on disk. All commits verified in git log.


Phase: 10-agent-capabilities Completed: 2026-03-26