docs(10-01): complete KB ingestion pipeline plan

2026-03-26 09:11:56 -06:00
parent a64634ff90
commit e56b5f885b
5 changed files with 339 additions and 22 deletions
--- a/.planning/phases/10-agent-capabilities/10-01-SUMMARY.md
+++ b/.planning/phases/10-agent-capabilities/10-01-SUMMARY.md
@@ -0,0 +1,188 @@
+---
+phase: 10-agent-capabilities
+plan: 01
+subsystem: api
+tags: [knowledge-base, celery, minio, pgvector, pdf, docx, pptx, embeddings, text-extraction]
+
+# Dependency graph
+requires:
+  - phase: 02-agent-features
+    provides: pgvector kb_chunks table, embed_texts, kb_search tool, executor framework
+  - phase: 01-foundation
+    provides: Celery task infrastructure, MinIO, asyncio.run pattern, RLS session factory
+
+provides:
+  - Migration 014: kb_documents status/error_message/chunk_count columns, agent_id nullable
+  - Text extractors for PDF, DOCX, PPTX, XLSX/XLS, CSV, TXT, MD
+  - KB management API: upload file, ingest URL/YouTube, list, delete, reindex endpoints
+  - Celery ingest_document task: download → extract → chunk → embed → store pipeline
+  - Executor tenant_id/agent_id injection into all tool handlers
+  - brave_api_key + firecrawl_api_key + google_client_id/secret + minio_kb_bucket in shared config
+
+affects: [10-02, 10-03, 10-04, kb-search, agent-tools]
+
+# Tech tracking
+tech-stack:
+  added:
+    - pypdf (PDF text extraction)
+    - python-docx (DOCX paragraph extraction)
+    - python-pptx (PPTX slide text extraction)
+    - openpyxl (XLSX/XLS reading via pandas)
+    - pandas (spreadsheet to CSV conversion)
+    - firecrawl-py (URL scraping for KB ingestion)
+    - youtube-transcript-api (YouTube video transcripts)
+    - google-api-python-client (Google API client)
+    - google-auth-oauthlib (Google OAuth)
+  patterns:
+    - Lazy Celery task import in kb.py to avoid circular dependencies
+    - Executor context injection pattern (tenant_id/agent_id injected after schema validation)
+    - chunk_text sliding window chunker (default 500 chars, 50 overlap)
+    - ingest_document_pipeline: fetch → extract → chunk → embed → store in single async transaction
+
+key-files:
+  created:
+    - migrations/versions/014_kb_status.py
+    - packages/orchestrator/orchestrator/tools/extractors.py
+    - packages/orchestrator/orchestrator/tools/ingest.py
+    - packages/shared/shared/api/kb.py
+    - tests/unit/test_extractors.py
+    - tests/unit/test_kb_upload.py
+    - tests/unit/test_ingestion.py
+    - tests/unit/test_executor_injection.py
+  modified:
+    - packages/shared/shared/models/kb.py (status/error_message/chunk_count columns, agent_id nullable)
+    - packages/shared/shared/models/tenant.py (GOOGLE_CALENDAR added to ChannelTypeEnum)
+    - packages/shared/shared/config.py (brave_api_key, firecrawl_api_key, google_client_id/secret, minio_kb_bucket)
+    - packages/orchestrator/orchestrator/tools/executor.py (tenant_id/agent_id injection)
+    - packages/orchestrator/orchestrator/tools/builtins/web_search.py (use settings.brave_api_key)
+    - packages/orchestrator/orchestrator/tasks.py (ingest_document Celery task added)
+    - packages/orchestrator/pyproject.toml (new dependencies)
+    - .env.example (BRAVE_API_KEY, FIRECRAWL_API_KEY, GOOGLE_CLIENT_ID/SECRET, MINIO_KB_BUCKET)
+
+key-decisions:
+  - "Migration numbered 014 (not 013) — 013 was already used by google_calendar channel type migration from prior session"
+  - "KB is per-tenant not per-agent — agent_id made nullable in kb_documents"
+  - "Executor injects tenant_id/agent_id as strings after schema validation to avoid schema rejections"
+  - "Lazy import of ingest_document task in kb.py router via _get_ingest_task() — avoids shared→orchestrator circular dependency"
+  - "ingest_document_pipeline uses ORM select for document fetch (testable) and raw SQL for chunk inserts (pgvector CAST pattern)"
+  - "web_search migrated from os.getenv to settings.brave_api_key — consistent with platform-wide config pattern"
+  - "chunk_text returns empty list for empty/whitespace text, not error — silent skip is safer in async pipeline"
+  - "PDF extraction returns warning message (not exception) for image-only PDFs with < 100 chars extracted"
+
+patterns-established:
+  - "Context injection pattern: executor injects tenant_id/agent_id as str kwargs after schema validation, before handler call"
+  - "KB ingestion pipeline: try/except updates doc.status to error with error_message on any failure"
+  - "Lazy circular dep avoidance: _get_ingest_task() function returns task at call time, imported inside function"
+
+requirements-completed: [CAP-01, CAP-02, CAP-03, CAP-04, CAP-07]
+
+# Metrics
+duration: 11min
+completed: 2026-03-26
+---
+
+# Phase 10 Plan 01: KB Ingestion Pipeline Summary
+
+**Document ingestion pipeline for KB search: text extractors (PDF/DOCX/PPTX/XLSX/CSV/TXT/MD), Celery async ingest task, executor tenant context injection, and KB management REST API**
+
+## Performance
+
+- **Duration:** 11 min
+- **Started:** 2026-03-26T14:59:19Z
+- **Completed:** 2026-03-26T15:10:06Z
+- **Tasks:** 2
+- **Files modified:** 16
+
+## Accomplishments
+
+- Full document text extraction for 7 format families using pypdf, python-docx, python-pptx, pandas, plus CSV/TXT/MD decode
+- KB management REST API with file upload, URL/YouTube ingest, list, delete, and reindex endpoints
+- Celery `ingest_document` task runs async pipeline: MinIO download → extract → chunk (500 char sliding window) → embed (all-MiniLM-L6-v2) → store kb_chunks
+- Tool executor now injects `tenant_id` and `agent_id` as string kwargs into every tool handler before invocation
+- 31 unit tests pass across all 4 test files
+
+## Task Commits
+
+1. **Task 1: Migration 013, ORM updates, config settings, text extractors, KB API router** - `e8d3e8a` (feat)
+2. **Task 2: Celery ingestion task, executor tenant_id injection, KB search wiring** - `9c7686a` (feat)
+
+## Files Created/Modified
+
+- `migrations/versions/014_kb_status.py` - Migration: add status/error_message/chunk_count to kb_documents, make agent_id nullable
+- `packages/shared/shared/models/kb.py` - Added status/error_message/chunk_count mapped columns, agent_id nullable
+- `packages/shared/shared/models/tenant.py` - Added GOOGLE_CALENDAR and WEB to ChannelTypeEnum
+- `packages/shared/shared/config.py` - Added brave_api_key, firecrawl_api_key, google_client_id, google_client_secret, minio_kb_bucket
+- `packages/shared/shared/api/kb.py` - New KB management API router (5 endpoints)
+- `packages/orchestrator/orchestrator/tools/extractors.py` - Text extraction for all 7 formats
+- `packages/orchestrator/orchestrator/tools/ingest.py` - chunk_text + ingest_document_pipeline
+- `packages/orchestrator/orchestrator/tasks.py` - Added ingest_document Celery task
+- `packages/orchestrator/orchestrator/tools/executor.py` - tenant_id/agent_id injection after schema validation
+- `packages/orchestrator/orchestrator/tools/builtins/web_search.py` - Migrated to settings.brave_api_key
+- `packages/orchestrator/pyproject.toml` - Added 8 new dependencies
+- `.env.example` - Added BRAVE_API_KEY, FIRECRAWL_API_KEY, GOOGLE_CLIENT_ID/SECRET, MINIO_KB_BUCKET
+
+## Decisions Made
+
+- Migration numbered 014 (not 013) — 013 was already used by a google_calendar channel type migration from a prior session
+- KB is per-tenant not per-agent — agent_id made nullable in kb_documents
+- Executor injects tenant_id/agent_id as strings after schema validation to avoid triggering schema rejections
+- Lazy import of ingest_document task in kb.py via `_get_ingest_task()` function — avoids shared→orchestrator circular dependency at module load time
+- `ingest_document_pipeline` uses ORM `select(KnowledgeBaseDocument)` for document fetch (testable via mock) and raw SQL for chunk INSERTs (pgvector CAST pattern)
+
+## Deviations from Plan
+
+### Auto-fixed Issues
+
+**1. [Rule 3 - Blocking] Migration renumbered from 013 to 014**
+- **Found during:** Task 1 (Migration creation)
+- **Issue:** Migration 013 already existed (`013_google_calendar_channel.py`) from a prior phase session
+- **Fix:** Renamed migration file to `014_kb_status.py` with revision=014, down_revision=013
+- **Files modified:** migrations/versions/014_kb_status.py
+- **Verification:** File renamed, revision chain intact
+- **Committed in:** e8d3e8a (Task 1 commit)
+
+**2. [Rule 2 - Missing Critical] Added WEB to ChannelTypeEnum alongside GOOGLE_CALENDAR**
+- **Found during:** Task 1 (tenant.py update)
+- **Issue:** WEB channel type was missing from the enum (google_calendar was not the only new type)
+- **Fix:** Added both `WEB = "web"` and `GOOGLE_CALENDAR = "google_calendar"` to ChannelTypeEnum
+- **Files modified:** packages/shared/shared/models/tenant.py
+- **Committed in:** e8d3e8a (Task 1 commit)
+
+**3. [Rule 1 - Bug] FastAPI Depends overrides required for KB upload tests**
+- **Found during:** Task 1 (test_kb_upload.py)
+- **Issue:** Initial test approach used `patch()` to mock auth deps but FastAPI calls Depends directly — 422 returned
+- **Fix:** Updated test to use `app.dependency_overrides` (correct FastAPI testing pattern)
+- **Files modified:** tests/unit/test_kb_upload.py
+- **Committed in:** e8d3e8a (Task 1 commit)
+
+---
+
+**Total deviations:** 3 auto-fixed (1 blocking, 1 missing critical, 1 bug)
+**Impact on plan:** All fixes necessary for correctness. No scope creep.
+
+## Issues Encountered
+
+None beyond the deviations documented above.
+
+## User Setup Required
+
+New environment variables needed:
+- `BRAVE_API_KEY` — Brave Search API key (https://brave.com/search/api/)
+- `FIRECRAWL_API_KEY` — Firecrawl API key for URL scraping (https://firecrawl.dev)
+- `GOOGLE_CLIENT_ID` / `GOOGLE_CLIENT_SECRET` — Google OAuth credentials
+- `MINIO_KB_BUCKET` — MinIO bucket for KB documents (default: `kb-documents`)
+
+## Next Phase Readiness
+
+- KB ingestion pipeline is fully functional and tested
+- kb_search tool already wired to query kb_chunks via pgvector (existing from Phase 2)
+- Executor now injects tenant context — all context-aware tools (kb_search, calendar) will work correctly
+- Ready for 10-02 (calendar tool) and 10-03 (any remaining agent capability work)
+
+## Self-Check: PASSED
+
+All files found on disk. All commits verified in git log.
+
+---
+*Phase: 10-agent-capabilities*
+*Completed: 2026-03-26*