docs(10-01): complete KB ingestion pipeline plan
This commit is contained in:
188
.planning/phases/10-agent-capabilities/10-01-SUMMARY.md
Normal file
188
.planning/phases/10-agent-capabilities/10-01-SUMMARY.md
Normal file
@@ -0,0 +1,188 @@
|
||||
---
|
||||
phase: 10-agent-capabilities
|
||||
plan: 01
|
||||
subsystem: api
|
||||
tags: [knowledge-base, celery, minio, pgvector, pdf, docx, pptx, embeddings, text-extraction]
|
||||
|
||||
# Dependency graph
|
||||
requires:
|
||||
- phase: 02-agent-features
|
||||
provides: pgvector kb_chunks table, embed_texts, kb_search tool, executor framework
|
||||
- phase: 01-foundation
|
||||
provides: Celery task infrastructure, MinIO, asyncio.run pattern, RLS session factory
|
||||
|
||||
provides:
|
||||
- Migration 014: kb_documents status/error_message/chunk_count columns, agent_id nullable
|
||||
- Text extractors for PDF, DOCX, PPTX, XLSX/XLS, CSV, TXT, MD
|
||||
- KB management API: upload file, ingest URL/YouTube, list, delete, reindex endpoints
|
||||
- Celery ingest_document task: download → extract → chunk → embed → store pipeline
|
||||
- Executor tenant_id/agent_id injection into all tool handlers
|
||||
- brave_api_key + firecrawl_api_key + google_client_id/secret + minio_kb_bucket in shared config
|
||||
|
||||
affects: [10-02, 10-03, 10-04, kb-search, agent-tools]
|
||||
|
||||
# Tech tracking
|
||||
tech-stack:
|
||||
added:
|
||||
- pypdf (PDF text extraction)
|
||||
- python-docx (DOCX paragraph extraction)
|
||||
- python-pptx (PPTX slide text extraction)
|
||||
- openpyxl (XLSX/XLS reading via pandas)
|
||||
- pandas (spreadsheet to CSV conversion)
|
||||
- firecrawl-py (URL scraping for KB ingestion)
|
||||
- youtube-transcript-api (YouTube video transcripts)
|
||||
- google-api-python-client (Google API client)
|
||||
- google-auth-oauthlib (Google OAuth)
|
||||
patterns:
|
||||
- Lazy Celery task import in kb.py to avoid circular dependencies
|
||||
- Executor context injection pattern (tenant_id/agent_id injected after schema validation)
|
||||
- chunk_text sliding window chunker (default 500 chars, 50 overlap)
|
||||
- ingest_document_pipeline: fetch → extract → chunk → embed → store in single async transaction
|
||||
|
||||
key-files:
|
||||
created:
|
||||
- migrations/versions/014_kb_status.py
|
||||
- packages/orchestrator/orchestrator/tools/extractors.py
|
||||
- packages/orchestrator/orchestrator/tools/ingest.py
|
||||
- packages/shared/shared/api/kb.py
|
||||
- tests/unit/test_extractors.py
|
||||
- tests/unit/test_kb_upload.py
|
||||
- tests/unit/test_ingestion.py
|
||||
- tests/unit/test_executor_injection.py
|
||||
modified:
|
||||
- packages/shared/shared/models/kb.py (status/error_message/chunk_count columns, agent_id nullable)
|
||||
- packages/shared/shared/models/tenant.py (GOOGLE_CALENDAR added to ChannelTypeEnum)
|
||||
- packages/shared/shared/config.py (brave_api_key, firecrawl_api_key, google_client_id/secret, minio_kb_bucket)
|
||||
- packages/orchestrator/orchestrator/tools/executor.py (tenant_id/agent_id injection)
|
||||
- packages/orchestrator/orchestrator/tools/builtins/web_search.py (use settings.brave_api_key)
|
||||
- packages/orchestrator/orchestrator/tasks.py (ingest_document Celery task added)
|
||||
- packages/orchestrator/pyproject.toml (new dependencies)
|
||||
- .env.example (BRAVE_API_KEY, FIRECRAWL_API_KEY, GOOGLE_CLIENT_ID/SECRET, MINIO_KB_BUCKET)
|
||||
|
||||
key-decisions:
|
||||
- "Migration numbered 014 (not 013) — 013 was already used by google_calendar channel type migration from prior session"
|
||||
- "KB is per-tenant not per-agent — agent_id made nullable in kb_documents"
|
||||
- "Executor injects tenant_id/agent_id as strings after schema validation to avoid schema rejections"
|
||||
- "Lazy import of ingest_document task in kb.py router via _get_ingest_task() — avoids shared→orchestrator circular dependency"
|
||||
- "ingest_document_pipeline uses ORM select for document fetch (testable) and raw SQL for chunk inserts (pgvector CAST pattern)"
|
||||
- "web_search migrated from os.getenv to settings.brave_api_key — consistent with platform-wide config pattern"
|
||||
- "chunk_text returns empty list for empty/whitespace text, not error — silent skip is safer in async pipeline"
|
||||
- "PDF extraction returns warning message (not exception) for image-only PDFs with < 100 chars extracted"
|
||||
|
||||
patterns-established:
|
||||
- "Context injection pattern: executor injects tenant_id/agent_id as str kwargs after schema validation, before handler call"
|
||||
- "KB ingestion pipeline: try/except updates doc.status to error with error_message on any failure"
|
||||
- "Lazy circular dep avoidance: _get_ingest_task() function returns task at call time, imported inside function"
|
||||
|
||||
requirements-completed: [CAP-01, CAP-02, CAP-03, CAP-04, CAP-07]
|
||||
|
||||
# Metrics
|
||||
duration: 11min
|
||||
completed: 2026-03-26
|
||||
---
|
||||
|
||||
# Phase 10 Plan 01: KB Ingestion Pipeline Summary
|
||||
|
||||
**Document ingestion pipeline for KB search: text extractors (PDF/DOCX/PPTX/XLSX/CSV/TXT/MD), Celery async ingest task, executor tenant context injection, and KB management REST API**
|
||||
|
||||
## Performance
|
||||
|
||||
- **Duration:** 11 min
|
||||
- **Started:** 2026-03-26T14:59:19Z
|
||||
- **Completed:** 2026-03-26T15:10:06Z
|
||||
- **Tasks:** 2
|
||||
- **Files modified:** 16
|
||||
|
||||
## Accomplishments
|
||||
|
||||
- Full document text extraction for 7 format families using pypdf, python-docx, python-pptx, pandas, plus CSV/TXT/MD decode
|
||||
- KB management REST API with file upload, URL/YouTube ingest, list, delete, and reindex endpoints
|
||||
- Celery `ingest_document` task runs async pipeline: MinIO download → extract → chunk (500 char sliding window) → embed (all-MiniLM-L6-v2) → store kb_chunks
|
||||
- Tool executor now injects `tenant_id` and `agent_id` as string kwargs into every tool handler before invocation
|
||||
- 31 unit tests pass across all 4 test files
|
||||
|
||||
## Task Commits
|
||||
|
||||
1. **Task 1: Migration 013, ORM updates, config settings, text extractors, KB API router** - `e8d3e8a` (feat)
|
||||
2. **Task 2: Celery ingestion task, executor tenant_id injection, KB search wiring** - `9c7686a` (feat)
|
||||
|
||||
## Files Created/Modified
|
||||
|
||||
- `migrations/versions/014_kb_status.py` - Migration: add status/error_message/chunk_count to kb_documents, make agent_id nullable
|
||||
- `packages/shared/shared/models/kb.py` - Added status/error_message/chunk_count mapped columns, agent_id nullable
|
||||
- `packages/shared/shared/models/tenant.py` - Added GOOGLE_CALENDAR and WEB to ChannelTypeEnum
|
||||
- `packages/shared/shared/config.py` - Added brave_api_key, firecrawl_api_key, google_client_id, google_client_secret, minio_kb_bucket
|
||||
- `packages/shared/shared/api/kb.py` - New KB management API router (5 endpoints)
|
||||
- `packages/orchestrator/orchestrator/tools/extractors.py` - Text extraction for all 7 formats
|
||||
- `packages/orchestrator/orchestrator/tools/ingest.py` - chunk_text + ingest_document_pipeline
|
||||
- `packages/orchestrator/orchestrator/tasks.py` - Added ingest_document Celery task
|
||||
- `packages/orchestrator/orchestrator/tools/executor.py` - tenant_id/agent_id injection after schema validation
|
||||
- `packages/orchestrator/orchestrator/tools/builtins/web_search.py` - Migrated to settings.brave_api_key
|
||||
- `packages/orchestrator/pyproject.toml` - Added 8 new dependencies
|
||||
- `.env.example` - Added BRAVE_API_KEY, FIRECRAWL_API_KEY, GOOGLE_CLIENT_ID/SECRET, MINIO_KB_BUCKET
|
||||
|
||||
## Decisions Made
|
||||
|
||||
- Migration numbered 014 (not 013) — 013 was already used by a google_calendar channel type migration from a prior session
|
||||
- KB is per-tenant not per-agent — agent_id made nullable in kb_documents
|
||||
- Executor injects tenant_id/agent_id as strings after schema validation to avoid triggering schema rejections
|
||||
- Lazy import of ingest_document task in kb.py via `_get_ingest_task()` function — avoids shared→orchestrator circular dependency at module load time
|
||||
- `ingest_document_pipeline` uses ORM `select(KnowledgeBaseDocument)` for document fetch (testable via mock) and raw SQL for chunk INSERTs (pgvector CAST pattern)
|
||||
|
||||
## Deviations from Plan
|
||||
|
||||
### Auto-fixed Issues
|
||||
|
||||
**1. [Rule 3 - Blocking] Migration renumbered from 013 to 014**
|
||||
- **Found during:** Task 1 (Migration creation)
|
||||
- **Issue:** Migration 013 already existed (`013_google_calendar_channel.py`) from a prior phase session
|
||||
- **Fix:** Renamed migration file to `014_kb_status.py` with revision=014, down_revision=013
|
||||
- **Files modified:** migrations/versions/014_kb_status.py
|
||||
- **Verification:** File renamed, revision chain intact
|
||||
- **Committed in:** e8d3e8a (Task 1 commit)
|
||||
|
||||
**2. [Rule 2 - Missing Critical] Added WEB to ChannelTypeEnum alongside GOOGLE_CALENDAR**
|
||||
- **Found during:** Task 1 (tenant.py update)
|
||||
- **Issue:** WEB channel type was missing from the enum (google_calendar was not the only new type)
|
||||
- **Fix:** Added both `WEB = "web"` and `GOOGLE_CALENDAR = "google_calendar"` to ChannelTypeEnum
|
||||
- **Files modified:** packages/shared/shared/models/tenant.py
|
||||
- **Committed in:** e8d3e8a (Task 1 commit)
|
||||
|
||||
**3. [Rule 1 - Bug] FastAPI Depends overrides required for KB upload tests**
|
||||
- **Found during:** Task 1 (test_kb_upload.py)
|
||||
- **Issue:** Initial test approach used `patch()` to mock auth deps but FastAPI calls Depends directly — 422 returned
|
||||
- **Fix:** Updated test to use `app.dependency_overrides` (correct FastAPI testing pattern)
|
||||
- **Files modified:** tests/unit/test_kb_upload.py
|
||||
- **Committed in:** e8d3e8a (Task 1 commit)
|
||||
|
||||
---
|
||||
|
||||
**Total deviations:** 3 auto-fixed (1 blocking, 1 missing critical, 1 bug)
|
||||
**Impact on plan:** All fixes necessary for correctness. No scope creep.
|
||||
|
||||
## Issues Encountered
|
||||
|
||||
None beyond the deviations documented above.
|
||||
|
||||
## User Setup Required
|
||||
|
||||
New environment variables needed:
|
||||
- `BRAVE_API_KEY` — Brave Search API key (https://brave.com/search/api/)
|
||||
- `FIRECRAWL_API_KEY` — Firecrawl API key for URL scraping (https://firecrawl.dev)
|
||||
- `GOOGLE_CLIENT_ID` / `GOOGLE_CLIENT_SECRET` — Google OAuth credentials
|
||||
- `MINIO_KB_BUCKET` — MinIO bucket for KB documents (default: `kb-documents`)
|
||||
|
||||
## Next Phase Readiness
|
||||
|
||||
- KB ingestion pipeline is fully functional and tested
|
||||
- kb_search tool already wired to query kb_chunks via pgvector (existing from Phase 2)
|
||||
- Executor now injects tenant context — all context-aware tools (kb_search, calendar) will work correctly
|
||||
- Ready for 10-02 (calendar tool) and 10-03 (any remaining agent capability work)
|
||||
|
||||
## Self-Check: PASSED
|
||||
|
||||
All files found on disk. All commits verified in git log.
|
||||
|
||||
---
|
||||
*Phase: 10-agent-capabilities*
|
||||
*Completed: 2026-03-26*
|
||||
120
.planning/phases/10-agent-capabilities/10-02-SUMMARY.md
Normal file
120
.planning/phases/10-agent-capabilities/10-02-SUMMARY.md
Normal file
@@ -0,0 +1,120 @@
|
||||
---
|
||||
phase: 10-agent-capabilities
|
||||
plan: "02"
|
||||
subsystem: agent-capabilities
|
||||
tags: [calendar, oauth, google, tools, cap-05, cap-06]
|
||||
dependency_graph:
|
||||
requires: [10-01]
|
||||
provides: [CAP-05, CAP-06]
|
||||
affects: [orchestrator, gateway, shared-api]
|
||||
tech_stack:
|
||||
added: [google-auth, google-api-python-client]
|
||||
patterns: [per-tenant-oauth, token-refresh-writeback, natural-language-tool-results]
|
||||
key_files:
|
||||
created:
|
||||
- packages/shared/shared/api/calendar_auth.py
|
||||
- tests/unit/test_calendar_auth.py
|
||||
- tests/unit/test_calendar_lookup.py
|
||||
- migrations/versions/013_google_calendar_channel.py
|
||||
modified:
|
||||
- packages/orchestrator/orchestrator/tools/builtins/calendar_lookup.py
|
||||
- packages/orchestrator/orchestrator/tools/registry.py
|
||||
- packages/orchestrator/orchestrator/agents/builder.py
|
||||
- packages/shared/shared/api/__init__.py
|
||||
- packages/gateway/gateway/main.py
|
||||
decisions:
|
||||
- "calendar_lookup receives _session param for test injection — production obtains session from async_session_factory"
|
||||
- "Token write-back is non-fatal: refresh failure logged but API result still returned"
|
||||
- "requires_confirmation=False for calendar CRUD — user intent (asking agent to book) is the confirmation"
|
||||
- "build() imported at module level for patchability in tests (try/except ImportError handles missing dep)"
|
||||
- "Tool result formatting instruction added to build_system_prompt when agent has tool_assignments (CAP-06)"
|
||||
metrics:
|
||||
duration: ~10m
|
||||
completed: "2026-03-26"
|
||||
tasks: 2
|
||||
files: 9
|
||||
---
|
||||
|
||||
# Phase 10 Plan 02: Google Calendar OAuth and Calendar Tool CRUD Summary
|
||||
|
||||
Per-tenant Google Calendar OAuth install/callback with encrypted token storage, full CRUD calendar tool replacing the service account stub, and natural language tool result formatting (CAP-05, CAP-06).
|
||||
|
||||
## Tasks Completed
|
||||
|
||||
### Task 1: Google Calendar OAuth endpoints and calendar tool replacement (TDD)
|
||||
|
||||
**Files created/modified:**
|
||||
- `packages/shared/shared/api/calendar_auth.py` — OAuth install/callback/status endpoints
|
||||
- `packages/orchestrator/orchestrator/tools/builtins/calendar_lookup.py` — Per-tenant OAuth calendar tool
|
||||
- `migrations/versions/013_google_calendar_channel.py` — Add google_calendar to CHECK constraint
|
||||
- `tests/unit/test_calendar_auth.py` — 6 tests for OAuth endpoints
|
||||
- `tests/unit/test_calendar_lookup.py` — 10 tests for calendar tool
|
||||
|
||||
**Commit:** `08572fc`
|
||||
|
||||
What was built:
|
||||
- `calendar_auth_router` at `/api/portal/calendar` with 3 endpoints:
|
||||
- `GET /install?tenant_id=` — generates HMAC-signed state, returns Google OAuth URL with offline/consent
|
||||
- `GET /callback?code=&state=` — verifies HMAC state, exchanges code for tokens, upserts ChannelConnection
|
||||
- `GET /{tenant_id}/status` — returns `{"connected": bool}`
|
||||
- `calendar_lookup.py` fully replaced — no more `GOOGLE_SERVICE_ACCOUNT_KEY` dependency:
|
||||
- `action="list"` — fetches events for date, formats as `- HH:MM: Event title`
|
||||
- `action="check_availability"` — lists busy slots or "entire day is free"
|
||||
- `action="create"` — creates event with summary/start/end, returns confirmation
|
||||
- Token auto-refresh: google-auth refreshes expired access tokens, updated token written back to DB
|
||||
- Returns informative messages for missing tenant_id, no connection, and errors
|
||||
|
||||
### Task 2: Mount new API routers and update tool schema + prompt builder
|
||||
|
||||
**Files modified:**
|
||||
- `packages/shared/shared/api/__init__.py` — export `kb_router` and `calendar_auth_router`
|
||||
- `packages/gateway/gateway/main.py` — mount kb_router and calendar_auth_router
|
||||
- `packages/orchestrator/orchestrator/tools/registry.py` — updated calendar_lookup schema with CRUD params
|
||||
- `packages/orchestrator/orchestrator/agents/builder.py` — add tool result formatting instruction (CAP-06)
|
||||
|
||||
**Commit:** `a64634f`
|
||||
|
||||
What was done:
|
||||
- KB and Calendar Auth routers mounted on gateway under Phase 10 section
|
||||
- calendar_lookup schema updated: `action` (enum), `event_summary`, `event_start`, `event_end` added
|
||||
- `required` updated to `["date", "action"]`
|
||||
- `build_system_prompt()` now appends "Never show raw data or JSON to user" when agent has tool_assignments
|
||||
- Confirmed CAP-04 (http_request): in registry, works, no changes needed
|
||||
- Confirmed CAP-07 (audit logging): executor.py calls `audit_logger.log_tool_call()` on every tool invocation
|
||||
|
||||
## Deviations from Plan
|
||||
|
||||
### Auto-fixed Issues
|
||||
|
||||
**1. [Rule 2 - Missing functionality] Module-level imports for patchability**
|
||||
- **Found during:** Task 1 TDD GREEN phase
|
||||
- **Issue:** `KeyEncryptionService` and `googleapiclient.build` imported lazily (inside function), making them unpatchable in tests with standard `patch()` calls
|
||||
- **Fix:** Added module-level imports with try/except ImportError guard for the google library optional dep; `settings` and `KeyEncryptionService` imported at module level
|
||||
- **Files modified:** `packages/orchestrator/orchestrator/tools/builtins/calendar_lookup.py`
|
||||
- **Commit:** `08572fc`
|
||||
|
||||
**2. [Rule 1 - Bug] Test patched non-existent module attribute**
|
||||
- **Found during:** Task 1 TDD GREEN phase
|
||||
- **Issue:** Tests patched `get_async_session` and `KeyEncryptionService` before those names existed at module level; tests also needed `settings` patched to bypass `platform_encryption_key` check
|
||||
- **Fix:** Updated tests to pass `_session` directly (no need to patch `get_async_session`), extracted `_make_mock_settings()` helper, added `patch(_PATCH_SETTINGS)` to all action tests
|
||||
- **Files modified:** `tests/unit/test_calendar_lookup.py`
|
||||
- **Commit:** `08572fc`
|
||||
|
||||
**3. [Already done] google_client_id/secret in Settings and GOOGLE_CALENDAR in ChannelTypeEnum**
|
||||
- These were already committed in plan 10-01 — no action needed for this plan
|
||||
|
||||
## Requirements Satisfied
|
||||
|
||||
- **CAP-05:** Calendar availability checking and event creation — per-tenant OAuth, list/check_availability/create actions
|
||||
- **CAP-06:** Natural language tool results — formatting instruction added to system prompt; calendar_lookup returns human-readable strings, not raw JSON
|
||||
|
||||
## Self-Check: PASSED
|
||||
|
||||
All files verified:
|
||||
- FOUND: packages/shared/shared/api/calendar_auth.py
|
||||
- FOUND: packages/orchestrator/orchestrator/tools/builtins/calendar_lookup.py
|
||||
- FOUND: migrations/versions/013_google_calendar_channel.py
|
||||
- FOUND: tests/unit/test_calendar_auth.py
|
||||
- FOUND: tests/unit/test_calendar_lookup.py
|
||||
- FOUND: commit 08572fc (Task 1)
|
||||
- FOUND: commit a64634f (Task 2)
|
||||
Reference in New Issue
Block a user