feat(10-01): KB ingestion pipeline - migration, extractors, API router

- Migration 014: add status/error_message/chunk_count to kb_documents, make agent_id nullable
- Add GOOGLE_CALENDAR to ChannelTypeEnum in tenant.py
- Add brave_api_key, firecrawl_api_key, google_client_id/secret, minio_kb_bucket to config
- Add text extractors for PDF, DOCX, PPTX, XLSX/XLS, CSV, TXT, MD
- Add KB management API router with upload, list, delete, URL ingest, reindex endpoints
- Install pypdf, python-docx, python-pptx, openpyxl, pandas, firecrawl-py, youtube-transcript-api
- Update .env.example with new env vars
- Unit tests: test_extractors.py (10 tests) and test_kb_upload.py (7 tests) all pass
This commit is contained in:
2026-03-26 09:05:29 -06:00
parent eae4b0324d
commit e8d3e8a108
11 changed files with 1745 additions and 28 deletions

View File

@@ -62,6 +62,21 @@ DEBUG=false
# Tenant rate limits (requests per minute defaults)
DEFAULT_RATE_LIMIT_RPM=60
# -----------------------------------------------------------------------------
# Web Search / Knowledge Base Scraping
# BRAVE_API_KEY: Get from https://brave.com/search/api/
# FIRECRAWL_API_KEY: Get from https://firecrawl.dev
# -----------------------------------------------------------------------------
BRAVE_API_KEY=
FIRECRAWL_API_KEY=
# Google OAuth (Calendar integration)
GOOGLE_CLIENT_ID=
GOOGLE_CLIENT_SECRET=
# MinIO KB bucket (for knowledge base documents)
MINIO_KB_BUCKET=kb-documents
# -----------------------------------------------------------------------------
# Web Push Notifications (VAPID keys)
# Generate with: cd packages/portal && npx web-push generate-vapid-keys