Files

Adolfo Delorenzo e56b5f885b docs(10-01): complete KB ingestion pipeline plan

2026-03-26 09:11:56 -06:00

9.7 KiB

Raw Blame History

phase, plan, subsystem, tags, requires, provides, affects, tech-stack, key-files, key-decisions, patterns-established, requirements-completed, duration, completed

phase

plan

subsystem

tags

requires

provides

affects

tech-stack

key-files

key-decisions

patterns-established

requirements-completed

duration

completed

10-agent-capabilities

api

knowledge-base

celery

minio

pgvector

pdf

docx

pptx

embeddings

text-extraction

phase	provides
02-agent-features	pgvector kb_chunks table, embed_texts, kb_search tool, executor framework

phase	provides
01-foundation	Celery task infrastructure, MinIO, asyncio.run pattern, RLS session factory

Migration 014
kb_documents status/error_message/chunk_count columns, agent_id nullable

Text extractors for PDF, DOCX, PPTX, XLSX/XLS, CSV, TXT, MD

KB management API
upload file, ingest URL/YouTube, list, delete, reindex endpoints

Celery ingest_document task
download → extract → chunk → embed → store pipeline

Executor tenant_id/agent_id injection into all tool handlers

brave_api_key + firecrawl_api_key + google_client_id/secret + minio_kb_bucket in shared config

10-02

10-03

10-04

kb-search

agent-tools

added

patterns

pypdf (PDF text extraction)

python-docx (DOCX paragraph extraction)

python-pptx (PPTX slide text extraction)

openpyxl (XLSX/XLS reading via pandas)

pandas (spreadsheet to CSV conversion)

firecrawl-py (URL scraping for KB ingestion)

youtube-transcript-api (YouTube video transcripts)

google-api-python-client (Google API client)

google-auth-oauthlib (Google OAuth)

Lazy Celery task import in kb.py to avoid circular dependencies

Executor context injection pattern (tenant_id/agent_id injected after schema validation)

chunk_text sliding window chunker (default 500 chars, 50 overlap)

ingest_document_pipeline
fetch → extract → chunk → embed → store in single async transaction

created

modified

migrations/versions/014_kb_status.py

packages/orchestrator/orchestrator/tools/extractors.py

packages/orchestrator/orchestrator/tools/ingest.py

packages/shared/shared/api/kb.py

tests/unit/test_extractors.py

tests/unit/test_kb_upload.py

tests/unit/test_ingestion.py

tests/unit/test_executor_injection.py

packages/shared/shared/models/kb.py (status/error_message/chunk_count columns, agent_id nullable)

packages/shared/shared/models/tenant.py (GOOGLE_CALENDAR added to ChannelTypeEnum)

packages/shared/shared/config.py (brave_api_key, firecrawl_api_key, google_client_id/secret, minio_kb_bucket)

packages/orchestrator/orchestrator/tools/executor.py (tenant_id/agent_id injection)

packages/orchestrator/orchestrator/tools/builtins/web_search.py (use settings.brave_api_key)

packages/orchestrator/orchestrator/tasks.py (ingest_document Celery task added)

packages/orchestrator/pyproject.toml (new dependencies)

.env.example (BRAVE_API_KEY, FIRECRAWL_API_KEY, GOOGLE_CLIENT_ID/SECRET, MINIO_KB_BUCKET)

Migration numbered 014 (not 013) — 013 was already used by google_calendar channel type migration from prior session

KB is per-tenant not per-agent — agent_id made nullable in kb_documents

Executor injects tenant_id/agent_id as strings after schema validation to avoid schema rejections

Lazy import of ingest_document task in kb.py router via _get_ingest_task() — avoids shared→orchestrator circular dependency

ingest_document_pipeline uses ORM select for document fetch (testable) and raw SQL for chunk inserts (pgvector CAST pattern)

web_search migrated from os.getenv to settings.brave_api_key — consistent with platform-wide config pattern

chunk_text returns empty list for empty/whitespace text, not error — silent skip is safer in async pipeline

PDF extraction returns warning message (not exception) for image-only PDFs with < 100 chars extracted

Context injection pattern: executor injects tenant_id/agent_id as str kwargs after schema validation, before handler call

KB ingestion pipeline: try/except updates doc.status to error with error_message on any failure

Lazy circular dep avoidance: _get_ingest_task() function returns task at call time, imported inside function

CAP-01

CAP-02

CAP-03

CAP-04

CAP-07

11min

2026-03-26

Phase 10 Plan 01: KB Ingestion Pipeline Summary

Document ingestion pipeline for KB search: text extractors (PDF/DOCX/PPTX/XLSX/CSV/TXT/MD), Celery async ingest task, executor tenant context injection, and KB management REST API

Performance

Duration: 11 min
Started: 2026-03-26T14:59:19Z
Completed: 2026-03-26T15:10:06Z
Tasks: 2
Files modified: 16

Accomplishments

Full document text extraction for 7 format families using pypdf, python-docx, python-pptx, pandas, plus CSV/TXT/MD decode
KB management REST API with file upload, URL/YouTube ingest, list, delete, and reindex endpoints
Celery ingest_document task runs async pipeline: MinIO download → extract → chunk (500 char sliding window) → embed (all-MiniLM-L6-v2) → store kb_chunks
Tool executor now injects tenant_id and agent_id as string kwargs into every tool handler before invocation
31 unit tests pass across all 4 test files

Task Commits

Task 1: Migration 013, ORM updates, config settings, text extractors, KB API router - e8d3e8a (feat)
Task 2: Celery ingestion task, executor tenant_id injection, KB search wiring - 9c7686a (feat)

Files Created/Modified

migrations/versions/014_kb_status.py - Migration: add status/error_message/chunk_count to kb_documents, make agent_id nullable
packages/shared/shared/models/kb.py - Added status/error_message/chunk_count mapped columns, agent_id nullable
packages/shared/shared/models/tenant.py - Added GOOGLE_CALENDAR and WEB to ChannelTypeEnum
packages/shared/shared/config.py - Added brave_api_key, firecrawl_api_key, google_client_id, google_client_secret, minio_kb_bucket
packages/shared/shared/api/kb.py - New KB management API router (5 endpoints)
packages/orchestrator/orchestrator/tools/extractors.py - Text extraction for all 7 formats
packages/orchestrator/orchestrator/tools/ingest.py - chunk_text + ingest_document_pipeline
packages/orchestrator/orchestrator/tasks.py - Added ingest_document Celery task
packages/orchestrator/orchestrator/tools/executor.py - tenant_id/agent_id injection after schema validation
packages/orchestrator/orchestrator/tools/builtins/web_search.py - Migrated to settings.brave_api_key
packages/orchestrator/pyproject.toml - Added 8 new dependencies
.env.example - Added BRAVE_API_KEY, FIRECRAWL_API_KEY, GOOGLE_CLIENT_ID/SECRET, MINIO_KB_BUCKET

Decisions Made

Migration numbered 014 (not 013) — 013 was already used by a google_calendar channel type migration from a prior session
KB is per-tenant not per-agent — agent_id made nullable in kb_documents
Executor injects tenant_id/agent_id as strings after schema validation to avoid triggering schema rejections
Lazy import of ingest_document task in kb.py via _get_ingest_task() function — avoids shared→orchestrator circular dependency at module load time
ingest_document_pipeline uses ORM select(KnowledgeBaseDocument) for document fetch (testable via mock) and raw SQL for chunk INSERTs (pgvector CAST pattern)

Deviations from Plan

Auto-fixed Issues

1. [Rule 3 - Blocking] Migration renumbered from 013 to 014

Found during: Task 1 (Migration creation)
Issue: Migration 013 already existed (013_google_calendar_channel.py) from a prior phase session
Fix: Renamed migration file to 014_kb_status.py with revision=014, down_revision=013
Files modified: migrations/versions/014_kb_status.py
Verification: File renamed, revision chain intact
Committed in: e8d3e8a (Task 1 commit)

2. [Rule 2 - Missing Critical] Added WEB to ChannelTypeEnum alongside GOOGLE_CALENDAR

Found during: Task 1 (tenant.py update)
Issue: WEB channel type was missing from the enum (google_calendar was not the only new type)
Fix: Added both WEB = "web" and GOOGLE_CALENDAR = "google_calendar" to ChannelTypeEnum
Files modified: packages/shared/shared/models/tenant.py
Committed in: e8d3e8a (Task 1 commit)

3. [Rule 1 - Bug] FastAPI Depends overrides required for KB upload tests

Found during: Task 1 (test_kb_upload.py)
Issue: Initial test approach used patch() to mock auth deps but FastAPI calls Depends directly — 422 returned
Fix: Updated test to use app.dependency_overrides (correct FastAPI testing pattern)
Files modified: tests/unit/test_kb_upload.py
Committed in: e8d3e8a (Task 1 commit)

Total deviations: 3 auto-fixed (1 blocking, 1 missing critical, 1 bug) Impact on plan: All fixes necessary for correctness. No scope creep.

Issues Encountered

None beyond the deviations documented above.

User Setup Required

New environment variables needed:

BRAVE_API_KEY — Brave Search API key (https://brave.com/search/api/)
FIRECRAWL_API_KEY — Firecrawl API key for URL scraping (https://firecrawl.dev)
GOOGLE_CLIENT_ID / GOOGLE_CLIENT_SECRET — Google OAuth credentials
MINIO_KB_BUCKET — MinIO bucket for KB documents (default: kb-documents)

Next Phase Readiness

KB ingestion pipeline is fully functional and tested
kb_search tool already wired to query kb_chunks via pgvector (existing from Phase 2)
Executor now injects tenant context — all context-aware tools (kb_search, calendar) will work correctly
Ready for 10-02 (calendar tool) and 10-03 (any remaining agent capability work)

Self-Check: PASSED

All files found on disk. All commits verified in git log.

Phase: 10-agent-capabilities Completed: 2026-03-26

9.7 KiB Raw Blame History