docs: complete project research

2026-03-22 00:12:58 -06:00
parent 320da9df87
commit 376982f16f
5 changed files with 1655 additions and 0 deletions
--- a/.planning/research/SUMMARY.md
+++ b/.planning/research/SUMMARY.md
@@ -0,0 +1,264 @@
+# Project Research Summary
+
+**Project:** Konstruct
+**Domain:** Channel-native AI workforce platform (multi-tenant SaaS)
+**Researched:** 2026-03-22
+**Confidence:** HIGH
+
+## Executive Summary
+
+Konstruct is building a novel but technically achievable product: AI employees that live natively inside messaging channels (Slack, WhatsApp) rather than behind a separate dashboard UI. Research confirms this channel-native positioning is a genuine market gap — every major competitor (Lindy, Relevance AI, Sintra, Agentforce) requires a separate UI, which forces behavior change and limits adoption. The recommended build approach is a microservices-in-monorepo architecture using FastAPI + Celery + PostgreSQL (with RLS) as the backbone, with LiteLLM as the universal LLM abstraction layer. This stack is mature, well-documented, and directly suited to the multi-tenant, high-concurrency messaging workload.
+
+The single most important architectural decision is the immediate-acknowledge, async-process pattern: Slack requires an HTTP 200 within 3 seconds, and LLM calls take 5-30+ seconds. The Channel Gateway must acknowledge immediately and delegate all LLM work to Celery workers. Getting this wrong causes event retry storms and Slack flagging the integration as unreliable. Tenant isolation is the second non-negotiable: PostgreSQL RLS must be enforced with `FORCE ROW LEVEL SECURITY` on every table, Redis keys must be namespaced per tenant, and vector searches must always include a `WHERE tenant_id = $1` filter. These are architectural decisions that cannot be retrofitted — they must be built correctly in Phase 1 before any agent feature work begins.
+
+The key risk for Konstruct is not technical but strategic: the roadmap is ambitious and it is tempting to build multi-agent teams, voice channels, and a marketplace before validating that SMBs will pay for a single reliable channel-native AI employee. Research strongly recommends validating the core thesis (one AI employee, Slack + WhatsApp, billing) with at least 20 paying tenants before expanding scope. WhatsApp carries a separate compliance risk: Meta's January 2026 policy bans general-purpose chatbots on the Business API, requires per-tenant phone number provisioning, and the Business Verification process takes 1-6 weeks — this must be initiated in Phase 1 even though WhatsApp goes live in Phase 2.
+
+---
+
+## Key Findings
+
+### Recommended Stack
+
+The stack specified in CLAUDE.md is well-chosen and verified against current package versions. One update from research: Next.js 16 (not 14) is the current stable release as of March 2026 — starting on 14 would immediately be two major versions behind. Auth.js v5 (not Keycloak) is recommended for v1 portal auth — Keycloak is correct for enterprise SSO needs but massively over-engineered for a beta with a small tenant count. For the agent framework, the research recommendation is to build a custom orchestrator for v1 and evaluate LangGraph seriously for v2 multi-agent teams; both LangGraph and CrewAI add abstraction overhead that constrains the agent model before requirements are clear.
+
+See `/home/adelorenzo/repos/konstruct/.planning/research/STACK.md` for full version matrix.
+
+**Core technologies:**
+- **FastAPI 0.135.1**: API framework — async-native, OpenAPI docs, dependency injection; de facto standard for async Python APIs
+- **Pydantic v2 (2.12.5)**: Data validation — 20x faster than v1 (Rust core); mandatory for all internal message models and API boundaries
+- **SQLAlchemy 2.0 + asyncpg**: ORM + PostgreSQL driver — true async support; use `AsyncSession` exclusively, never legacy 1.x patterns
+- **PostgreSQL 16 + pgvector**: Primary DB + vector store — RLS for multi-tenancy; pgvector for agent memory without a separate service; HNSW indexes required from day one
+- **Redis 7.x**: Cache, pub/sub, rate limiting, Celery broker — all purposes consolidated into one service; namespace all keys by tenant
+- **LiteLLM 1.82.5**: LLM gateway — unified API across Ollama, Anthropic, OpenAI; load balancing, fallback, cost tracking; all LLM calls route through this, never directly to providers
+- **Celery 5.6.2**: Background job queue — all LLM calls, tool execution, and webhook follow-up messages must be dispatched here, not run inline
+- **slack-bolt 1.27.0**: Slack integration — use Events API (HTTP mode) in production; Socket Mode is for local dev only
+- **Next.js 16**: Admin portal — App Router, shadcn/ui, TanStack Query v5, Auth.js v5
+- **uv + ruff + mypy**: Python toolchain — uv workspaces for monorepo, ruff replaces flake8/black/isort, mypy --strict required
+
+### Expected Features
+
+See `/home/adelorenzo/repos/konstruct/.planning/research/FEATURES.md` for full feature analysis with dependency graph.
+
+**Must have (table stakes — v1 beta):**
+- Natural language conversation in-channel (Slack + WhatsApp) — core product promise
+- Persistent conversational memory (sliding window + pgvector long-term) — goldfish agents churn
+- Human escalation/handoff with full context transfer — required for trust and WhatsApp ToS compliance
+- Single AI employee per tenant: configurable role, persona, tools — proves the core thesis
+- Tool framework with registry + sandboxed execution (minimum 2-3 built-in tools) — agents must DO things
+- Multi-tenant PostgreSQL RLS isolation — table stakes for accepting multiple real customers
+- Admin portal: tenant onboarding, agent config, channel connection wizard — operators need UI, not config files
+- Stripe subscription billing — no billing = no product
+- Rate limiting per tenant + per channel — platform protection
+- Audit log for all agent actions — debugging, trust-building, compliance foundation
+- Agent-level cost tracking — SMB operators need cost predictability
+
+**Should have (competitive differentiators — v1.x after validation):**
+- True channel-native presence (agents live IN the channel) — the primary differentiator; architecture is built for this
+- BYO API key support — validated demand from privacy-conscious customers
+- Cross-channel agent identity (same agent on Slack + WhatsApp, unified memory) — architectural decision must be made correctly in v1 even if feature ships in v1.x
+- Sentiment-based auto-escalation — requires real conversation volume to tune
+- Additional channels: Mattermost, Telegram, Microsoft Teams
+
+**Defer (v2+):**
+- Multi-agent coordinator + specialist teams — complex; single-agent must be proven first
+- AI company hierarchy (teams of teams)
+- Self-hosted deployment (Helm chart)
+- Schema-per-tenant isolation (Team tier upgrade from RLS)
+- Agent marketplace / pre-built role templates
+- White-labeling for agencies
+- Voice/telephony channels — completely different stack
+
+**Anti-features to avoid entirely:**
+- General-purpose chatbots on WhatsApp — Meta banned this effective January 2026; risks account suspension
+- Streaming token output — Slack/WhatsApp don't support partial message streaming; adds complexity for zero user benefit
+- Cross-tenant agent communication — security violation, compliance liability
+- Dashboard-first UX for end-users — defeats the channel-native value proposition
+
+### Architecture Approach
+
+The architecture follows a strict four-layer pipeline: Channel Gateway (ingress + normalization) → Message Router (tenant resolution + rate limiting) → Agent Orchestrator (Celery workers) → LLM Backend Pool (LiteLLM). The Channel Gateway is intentionally thin — it verifies webhook signatures, normalizes messages to the `KonstructMessage` format, and enqueues to Celery. No business logic lives in the gateway. This separation is what enables the 3-second Slack acknowledgment requirement and allows each layer to scale independently. Agent memory uses a two-tier approach: Redis sliding window for short-term context (last ~20 messages) and pgvector for semantic retrieval of long-term history, with a background Celery task flushing Redis state to pgvector asynchronously.
+
+See `/home/adelorenzo/repos/konstruct/.planning/research/ARCHITECTURE.md` for full component breakdown, data flows, and anti-patterns.
+
+**Major components:**
+1. **Channel Gateway** — Verify signatures, normalize to KonstructMessage, return HTTP 200 within 3s, enqueue to Celery; strictly stateless
+2. **Message Router** — Tenant resolution (channel org → tenant_id), Redis rate limiting, idempotency check, context loading
+3. **Agent Orchestrator (Celery workers)** — Persona + memory + tool assembly, LLM call dispatch, tool execution, response routing back to channel
+4. **LLM Backend Pool** — LiteLLM router exposing a single `/complete` endpoint; handles provider selection, fallback, cost tracking; orchestrator never calls providers directly
+5. **Tool Executor** — Tool registry (name → handler), schema-validated execution, per-tool authorization enforcement, audit logging
+6. **Memory Layer** — Redis sliding window (short-term) + pgvector HNSW semantic search (long-term)
+7. **Admin Portal** — Next.js 16 operator dashboard; reads/writes through authenticated FastAPI REST API only, never direct DB access
+8. **Billing Service** — Stripe webhook handler; updates tenant subscription state and enforces feature limits
+
+**Build order is dependency-constrained (steps 1-6 must be sequential):**
+Shared models + DB schema → PostgreSQL + Redis + Docker Compose → Channel Gateway (Slack only) → Message Router → LLM Backend Pool → Agent Orchestrator (single agent, no tools) → Memory Layer → Tool Framework → WhatsApp adapter → Admin Portal → Billing integration.
+
+### Critical Pitfalls
+
+See `/home/adelorenzo/repos/konstruct/.planning/research/PITFALLS.md` for full detail, recovery strategies, and "looks done but isn't" checklist.
+
+1. **Cross-tenant data leakage** — Enable `FORCE ROW LEVEL SECURITY` on every table (not just creating policies), always include `WHERE tenant_id = $1` in pgvector queries, namespace all Redis keys as `{tenant_id}:{key_type}:{resource_id}`, and write integration tests with two-tenant fixtures that verify no cross-tenant access path. This cannot be retrofitted — build it in Phase 1 before any data is written.
+
+2. **WhatsApp Business Account suspension** — Provision one phone number per tenant (not shared), enforce opt-in verification before activating WhatsApp, apply for Business Verification in Phase 1 (1-6 week approval timeline), monitor quality rating daily. One tenant's bad behavior on a shared number suspends all tenants.
+
+3. **LiteLLM request log table degradation** — LiteLLM logs every request to PostgreSQL; the table hits performance-impacting size (~1M rows) in ~10 days at 100k req/day. Implement a daily Celery beat rotation job from day one. Set `LITELLM_LOG_LEVEL=ERROR` in production. Do not use LiteLLM's built-in caching layer (documented 10+ second cache-hit latency bug). Pin to a tested version.
+
+4. **Celery + FastAPI async/event loop conflict** — Celery tasks must be synchronous `def` functions, not `async def`. Writing tasks as async causes silent hangs or `RuntimeError: This event loop is already running`. Establish the correct pattern in Phase 1 scaffolding so all subsequent tasks follow by example.
+
+5. **PostgreSQL RLS bypassed by superuser connections** — RLS does not apply to superusers or table owners unless `FORCE ROW LEVEL SECURITY` is also applied. The application must connect as a dedicated limited role (not `postgres`). This is a silent failure — tests pass but isolation provides zero protection.
+
+6. **Context rot** — Agent answer quality degrades after ~20-40 conversation turns when full history is dumped into context. Implement sliding window + summarization from Phase 2; the data model must support summarization from day one (plan this in Phase 1 even if implemented in Phase 2).
+
+7. **Prompt injection through tool arguments** — Enforce authorization and schema validation at the tool execution layer, not just at agent configuration. Every tool call must validate: does this tenant's agent have permission to call this tool with these arguments? Treat LLM output as untrusted input to the tool executor.
+
+8. **Over-building before validation** — The most common AI SaaS failure mode. Do not add v2 features to Phase 1 scope. Define specific validation signals before Phase 1 starts. Resist expanding scope until 20+ paying tenants validate the channel-native thesis.
+
+---
+
+## Implications for Roadmap
+
+### Phase 1: Foundation and Tenant Safety
+
+**Rationale:** The dependency graph is unambiguous — shared models, database schema, and tenant isolation must exist before any agent work begins. These decisions cannot be retrofitted. Five of the eight critical pitfalls are "Phase 1" issues. Getting isolation wrong in Phase 1 means a security incident in Phase 2.
+
+**Delivers:** A working end-to-end message flow (Slack → LLM response → Slack reply) with proper multi-tenant isolation, rate limiting, and the correct async processing pattern.
+
+**Addresses:** Multi-tenant isolation, rate limiting, audit logging, LiteLLM backend pool, single agent per tenant (no tools yet), Slack integration, basic agent configuration
+
+**Avoids:**
+- Cross-tenant data leakage (RLS + FORCE, Redis namespacing, pgvector tenant filters)
+- RLS bypass via superuser (create application DB role from day one)
+- Celery async event loop conflict (establish sync task pattern in scaffolding)
+- LiteLLM log table degradation (rotation job from day one)
+- WhatsApp suspension (apply for Business Verification now, even though WhatsApp activates in Phase 2)
+
+**Key deliverable:** A Slack message triggers an LLM response delivered back to the thread. Tenant A cannot see Tenant B's data. Verified by integration tests.
+
+**Research flag:** No additional research needed — all patterns are well-documented. Use the build order from ARCHITECTURE.md (steps 1-6).
+
+---
+
+### Phase 2: Feature Completeness
+
+**Rationale:** Once the end-to-end pipeline works with a single agent, add the features that make it a real product: conversational memory, tools, WhatsApp, and the operator-facing admin portal. These have internal dependencies — the DB schema must be stable (after memory and tools define their data models) before the portal is built.
+
+**Delivers:** A deployable beta with Slack + WhatsApp channels, persistent agent memory, a tool framework with 2-3 built-in tools, human escalation/handoff, the admin portal, and Stripe billing.
+
+**Addresses:** Conversational memory (sliding window + pgvector), tool framework (registry + execution + 2-3 built-in tools), WhatsApp integration, human escalation/handoff, admin portal (tenant onboarding, agent config, channel connection wizard), Stripe subscription billing, agent-level cost tracking, structured onboarding flow
+
+**Avoids:**
+- Context rot (implement sliding window + summarization, test at turn 30+)
+- Prompt injection (schema-validate all tool arguments at execution layer)
+- WhatsApp suspension (per-tenant phone numbers, opt-in enforcement, quality monitoring)
+- Agent going silent on errors (every tool failure must produce a user-visible fallback message)
+
+**Key deliverable:** An operator can onboard via the portal, connect Slack and WhatsApp, configure an AI employee, and paying customers interact with it through both channels.
+
+**Research flag:** Tool framework execution security and WhatsApp opt-in enforcement design may benefit from `/gsd:research-phase` — specifically the sandboxing approach and Meta opt-in verification requirements.
+
+---
+
+### Phase 3: Polish and Launch
+
+**Rationale:** With a validated beta (20+ paying tenants), polish the experience, add differentiating features validated by real usage, and prepare for public launch. Do not start this phase until beta validation signals are met.
+
+**Delivers:** Additional channels (Mattermost, Telegram, Microsoft Teams), BYO API key support, cross-channel agent identity (unified memory across Slack + WhatsApp), agent analytics dashboard, sentiment-based auto-escalation, self-hosted deployment option (Helm chart + Docker Compose package), public launch.
+
+**Addresses:** Channel expansion (Mattermost, Telegram, Teams), BYO API keys (encrypted with per-tenant KEK), cross-channel agent identity, sentiment-based escalation, pre-built tool integrations (Zendesk, HubSpot, Google Calendar), agent analytics in portal, self-hosted Helm chart
+
+**Avoids:**
+- Scope creep: only add features validated by beta user behavior
+- BYO key security: establish envelope encryption architecture in v1 even if the feature ships in Phase 3
+
+**Key deliverable:** Public launch with proven channel-native thesis, multiple channel options, and self-hosted option for compliance-sensitive customers.
+
+**Research flag:** BYO key encryption architecture (envelope encryption, per-tenant KEK rotation) needs explicit design before implementation — this is a security-critical feature.
+
+---
+
+### Phase 4: Scale and Enterprise
+
+**Rationale:** Post-launch growth requires infrastructure changes that are expensive to retrofit: Kubernetes migration, schema-per-tenant isolation for the Team tier, multi-agent coordinator teams, and enterprise compliance groundwork.
+
+**Delivers:** Kubernetes production deployment, multi-agent coordinator + specialist team pattern, AI company hierarchy (teams of teams), schema-per-tenant isolation for Team tier, agent marketplace / pre-built role templates, SOC 2 preparation, enterprise tier with dedicated isolation.
+
+**Addresses:** All v2+ features from the feature matrix (P3 items)
+
+**Research flag:** Multi-agent coordinator pattern is the most architecturally complex feature in the roadmap. `/gsd:research-phase` strongly recommended — inter-agent communication bus design, shared context store, and delegation audit trail need dedicated research before building.
+
+---
+
+### Phase Ordering Rationale
+
+- **Security-first ordering:** RLS, Redis namespacing, and tenant isolation tests precede all feature work because retrofitting isolation after data exists is high-risk and expensive.
+- **Async pipeline before features:** The Celery async pattern must be established and validated before memory, tools, or portal are built. Retrofitting a broken async pattern into an existing codebase is painful.
+- **DB schema stability gate:** The admin portal and billing integration are explicitly deferred until memory and tools define their data models. This matches the ARCHITECTURE.md build order (steps 7-8 before steps 10-11).
+- **WhatsApp Business Verification timeline:** Applying for verification in Phase 1 accounts for the 1-6 week approval timeline so that WhatsApp can go live in Phase 2 without a blocking wait.
+- **Validate before expanding:** Phase 3 and Phase 4 are explicitly contingent on validation signals from the beta. The scope boundary between Phase 2 (beta) and Phase 3 (launch) is a validation gate, not a calendar date.
+
+### Research Flags
+
+Phases needing deeper research during planning:
+- **Phase 2:** WhatsApp opt-in enforcement implementation and Meta Business Verification process details — the official API for opt-in tracking is not fully documented in the current research.
+- **Phase 2:** Tool sandboxing approach — the research identifies the requirement (sandboxed execution) but does not prescribe a specific sandboxing mechanism (subprocess isolation, container-per-tool, etc.).
+- **Phase 3:** BYO API key envelope encryption architecture — security-critical, needs dedicated design before any implementation.
+- **Phase 4:** Multi-agent coordinator pattern and inter-agent communication bus — the most architecturally novel component; no established playbook exists for SMB-scale multi-agent orchestration.
+
+Phases with well-documented patterns (skip `/gsd:research-phase`):
+- **Phase 1:** All patterns are well-documented in official sources (Slack Events API, PostgreSQL RLS, Celery, LiteLLM). Use the ARCHITECTURE.md build order directly.
+- **Phase 2 (portal + billing):** Next.js App Router + shadcn/ui + Auth.js v5 + Stripe are all well-documented with established patterns.
+
+---
+
+## Confidence Assessment
+
+| Area | Confidence | Notes |
+|------|------------|-------|
+| Stack | HIGH | All versions verified against PyPI and official sources as of March 2026. Auth.js v5 is MEDIUM (official docs, not directly benchmarked). LangGraph recommendation to avoid for v1 is HIGH. |
+| Features | MEDIUM-HIGH | Table stakes and anti-features are well-validated. Competitor feature analysis is from industry blogs (MEDIUM). WhatsApp 2026 policy constraint is HIGH (verified against Meta official). |
+| Architecture | HIGH | Core patterns (immediate-acknowledge, RLS, LiteLLM router, Celery dispatch) are verified against official Slack docs, LiteLLM docs, PostgreSQL/Crunchy docs. WhatsApp-specific patterns are MEDIUM (community sources). |
+| Pitfalls | HIGH | Cross-verified across official docs, GitHub issues, production post-mortems, and practitioner accounts. LiteLLM production issues are particularly well-evidenced. |
+
+**Overall confidence:** HIGH
+
+### Gaps to Address
+
+- **Agent memory key design:** Research confirms that agent memory must be keyed to `agent_id` (not channel session ID) to support cross-channel identity. The specific data model for this needs to be finalized in Phase 1 schema design, even if cross-channel identity ships in Phase 3.
+- **WhatsApp opt-in verification API:** Research confirms the requirement but does not specify the exact Meta API calls for verifying and recording user opt-in. Validate against Meta's official Business API documentation before Phase 2 implementation.
+- **Tool sandboxing approach:** Research identifies sandboxed execution as a requirement but leaves the specific mechanism unspecified. Options (subprocess, Docker-per-tool, restricted Python execution) need a design decision before Phase 2 tool framework implementation.
+- **Pricing model:** Research flags per-message pricing as a deterrent (Atlassian Rovo case study) in favor of flat per-agent pricing, but this is an open product decision noted in CLAUDE.md. Resolve before billing goes live in Phase 2.
+- **Qdrant migration path:** Research confirms pgvector is sufficient for v1 but will require migration to Qdrant above ~1M embeddings per tenant. The ARCHITECTURE.md already anticipates this. Establish an abstraction layer in Phase 1 that makes this migration non-disruptive (use a repository pattern for vector operations).
+
+---
+
+## Sources
+
+### Primary (HIGH confidence)
+- PyPI registry (verified March 2026) — all stack versions
+- [FastAPI official docs](https://fastapi.tiangolo.com/) — async patterns, dependency injection
+- [LiteLLM docs](https://docs.litellm.ai/) — router architecture, multi-tenant, routing strategy
+- [Slack official docs](https://docs.slack.dev/apis/events-api/) — HTTP vs Socket Mode comparison
+- [pgvector GitHub](https://github.com/pgvector/pgvector) — HNSW indexing, production readiness
+- [Crunchy Data: Row Level Security for Tenants](https://www.crunchydata.com/blog/row-level-security-for-tenants-in-postgres) — RLS patterns
+- [Stripe official docs](https://stripe.com/blog/a-framework-for-pricing-ai-products) — billing model guidance
+- [OWASP LLM01:2025 Prompt Injection](https://genai.owasp.org/llmrisk/llm01-prompt-injection/) — tool injection security
+- [Meta WhatsApp Business API policy](https://respond.io/blog/whatsapp-general-purpose-chatbots-ban) — 2026 compliance constraints
+- [Anthropic Engineering: Effective Context Engineering](https://www.anthropic.com/engineering/effective-context-engineering-for-ai-agents) — context rot prevention
+- [HBR: Why Agentic AI Projects Fail](https://hbr.org/2025/10/why-agentic-ai-projects-fail-and-how-to-set-yours-up-for-success) — anti-pattern validation
+
+### Secondary (MEDIUM confidence)
+- [uv workspace docs](https://docs.astral.sh/uv/concepts/projects/workspaces/) — monorepo setup
+- [Auth.js docs](https://authjs.dev/) — v5 App Router compatibility
+- [Redis AI Agent Memory Architecture](https://redis.io/blog/ai-agent-memory-stateful-systems/) — memory patterns
+- [AWS: Multi-Tenant Data Isolation with PostgreSQL RLS](https://aws.amazon.com/blogs/database/multi-tenant-data-isolation-with-postgresql-row-level-security/)
+- [TeamDay.ai: AI Employees Market Map 2026](https://www.teamday.ai/blog/ai-employees-market-map-2026) — competitor analysis
+- [Paperclip.ing](https://paperclip.ing/) — cost tracking model reference
+- DEV Community: LiteLLM production issues — documented performance degradation evidence
+
+### Tertiary (LOW confidence)
+- Community WhatsApp webhook architecture posts — implementation patterns need validation against official Meta docs
+- Multi-tenant AI agent vendor blogs — patterns corroborated by higher-confidence sources
+
+---
+*Research completed: 2026-03-22*
+*Ready for roadmap: yes*