konstruct/.planning/research/STACK.md

# Stack Research

**Domain:** Channel-native AI workforce platform (multi-tenant SaaS)
**Researched:** 2026-03-22
**Confidence:** HIGH (all versions verified against PyPI and official sources)

---

## Recommended Stack

### Core Backend Technologies

| Technology | Version | Purpose | Why Recommended |
|------------|---------|---------|-----------------|
| Python | 3.12+ | Runtime | Specified in CLAUDE.md. Mature async ecosystem, best ML/AI library support. 3.12 is the LTS sweet spot — 3.13 is out but ecosystem support lags. |
| FastAPI | 0.135.1 | API framework | Async-native, automatic OpenAPI docs, built-in dependency injection, excellent for multi-service microservices. The de facto choice for async Python APIs. |
| Pydantic v2 | 2.12.5 | Data validation | Mandatory for FastAPI. v2 is 20x faster than v1 (Rust core). Strict mode enforces type safety at runtime boundaries. Use for all internal message models. |
| SQLAlchemy | 2.0.48 | ORM / query builder | 2.0 is a complete rewrite with true async support. Use `AsyncSession` + `create_async_engine`. The 1.x API is deprecated — do not use legacy patterns. |
| Alembic | 1.18.4 | Database migrations | Standard companion to SQLAlchemy. Requires `env.py` modification for async engine (synchronous migration runner wraps async calls). |
| asyncpg | 0.31.0 | PostgreSQL async driver | Required for SQLAlchemy async support with PostgreSQL. Significantly faster than psycopg2 for high-concurrency workloads. |
| PostgreSQL | 16 | Primary database | Specified in CLAUDE.md. RLS (Row Level Security) is the v1 multi-tenancy mechanism. pgvector extension adds vector search without a separate service. |
| Redis | 7.x | Cache, pub/sub, rate limiting | Session state, per-tenant rate limit counters, pub/sub for real-time event routing. Consider Valkey as a drop-in replacement if Redis license changes concern you. |

### LLM Integration

| Technology | Version | Purpose | Why Recommended |
|------------|---------|---------|-----------------|
| LiteLLM | 1.82.5 | LLM gateway / router | Unified API across 100+ providers (Anthropic, OpenAI, Ollama, vLLM). Built-in load balancing, cost tracking, fallback routing, and virtual keys. Routes to Ollama locally and commercial APIs without code changes. Now at GA maturity with production users at scale. |
| Ollama | latest | Local LLM inference | Dev environment local inference. Serves models via OpenAI-compatible API on port 11434 — LiteLLM proxies to it transparently. |
| pgvector | 0.4.2 (Python client) | Vector search / agent memory | Co-located with PostgreSQL — no separate vector DB service for v1. Supports HNSW indexing (added 0.7.0) for sub-10ms queries at <1M vectors. Extension version 0.8.2 is production-ready and included on all major hosted PostgreSQL services. |

### Messaging Channel SDKs

| Technology | Version | Purpose | Why Recommended |
|------------|---------|---------|-----------------|
| slack-bolt | 1.27.0 | Slack integration | Official Slack SDK. Supports both Events API (webhook) and Socket Mode (WebSocket). Use **Events API mode** in production (requires public HTTPS endpoint) — Socket Mode is for dev only. |
| WhatsApp Business Cloud API | Meta-hosted | WhatsApp integration | No official Python SDK from Meta. Use `httpx` (async HTTP) to call the REST API directly. Webhooks arrive as POST to your FastAPI endpoint. `py-whatsapp-cloudbot` provides lightweight FastAPI helpers but is a thin wrapper — direct httpx is preferred for control. |

### Task Queue

| Technology | Version | Purpose | Why Recommended |
|------------|---------|---------|-----------------|
| Celery | 5.6.2 | Background job processing | Use for LLM inference calls, tool execution, webhook delivery, and anything that shouldn't block the request/response cycle. Celery 5.x is stable and production-proven at scale. Dramatiq is simpler and more reliable per-message, but Celery's ecosystem (Flower monitoring, beat scheduler, chord/chain primitives) is more complete for complex workflows you'll need in v2+. |
| Redis (Celery broker) | 7.x | Celery message broker | Use Redis as both broker and result backend. Redis is already in the stack for other purposes — no additional service needed. |

### Admin Portal (Next.js)

| Technology | Version | Purpose | Why Recommended |
|------------|---------|---------|-----------------|
| Next.js | 16.x (latest stable) | Portal framework | Note: CLAUDE.md specifies 14+, but Next.js 16 is the current stable release as of March 2026. App Router is mature. Use 16 to avoid building on a version that's already behind. Turbopack is now default for faster builds. |
| TypeScript | 5.x | Type safety | Strict mode required (matching CLAUDE.md). |
| Tailwind CSS | 4.x | Styling | shadcn/ui requires Tailwind. v4 dropped JIT (always-on now) and uses CSS-native variables. |
| shadcn/ui | latest | Component library | Copy-to-project component model means no version lock-in. Components are owned code. The standard choice for Next.js admin portals in 2025-2026. Use the CLI to scaffold. |
| TanStack Query | 5.x | Server state management | Handles fetching, caching, and invalidation for API data. Pairs well with App Router — use for client-side mutations and real-time data. |
| React Hook Form + Zod | latest | Form validation | Standard pairing for shadcn/ui forms. Zod schemas can be shared with backend (TypeScript definitions generated from Pydantic if needed). |

### Authentication

| Technology | Version | Purpose | Why Recommended |
|------------|---------|---------|-----------------|
| Auth.js (formerly NextAuth.js) | v5 | Portal authentication | v5 is a complete rewrite compatible with Next.js App Router. Self-hosted, no per-MAU pricing. Supports credential, OAuth, and magic link flows. Database sessions stored in PostgreSQL via adapter. Use over Clerk for cost control and data sovereignty at scale. |
| FastAPI JWT middleware | custom | Backend API auth | Validate JWTs issued by Auth.js in FastAPI middleware. Use `python-jose` or `PyJWT` for token verification. |

### Billing

| Technology | Version | Purpose | Why Recommended |
|------------|---------|---------|-----------------|
| stripe | 14.4.1 | Subscription billing | Industry standard. Python SDK handles webhook signature verification, subscription lifecycle events, and checkout sessions. Idempotent webhook handlers are required — Stripe resends on failure. |

### Development Tools

| Tool | Purpose | Notes |
|------|---------|-------|
| uv | Python package manager and monorepo workspaces | Replaces pip + virtualenv + pip-tools. `uv workspace` supports the monorepo structure in CLAUDE.md. Single shared lockfile across packages. Significantly faster than pip. |
| ruff | Linting + formatting | Replaces flake8, isort, and black in one tool. 100x faster than black. Configure in `pyproject.toml`. Use as both linter and formatter. |
| mypy | Static type checking (strict mode) | Run with `--strict` flag. Mandatory per CLAUDE.md. Slower than Pyright but more accurate for SQLAlchemy and Pydantic type inference. |
| pytest + pytest-asyncio | Testing | Async test support required for FastAPI endpoints. Use `httpx.AsyncClient` as the test client (not the sync TestClient). |
| Docker Compose | Local dev orchestration | All services (PostgreSQL, Redis, Ollama) in compose. FastAPI services run with `uvicorn --reload` outside compose for hot reload. |
| slowapi | FastAPI rate limiting | Redis-backed token bucket rate limiting middleware. Integrates directly with FastAPI. Use for per-tenant and per-channel rate limits. |

---

## Installation

```bash
# Initialize Python monorepo with uv
uv init konstruct
cd konstruct

# Add workspace packages
uv workspace add packages/gateway
uv workspace add packages/router
uv workspace add packages/orchestrator
uv workspace add packages/llm-pool
uv workspace add packages/shared

# Core backend dependencies (per package)
uv add fastapi[standard] pydantic[email] sqlalchemy[asyncio] asyncpg alembic
uv add litellm redis celery[redis] pgvector stripe
uv add slack-bolt python-jose[cryptography] httpx slowapi

# Dev dependencies
uv add --dev ruff mypy pytest pytest-asyncio pytest-httpx

# Portal (Node.js)
cd packages/portal
npx create-next-app@latest . --typescript --tailwind --eslint --app
npx shadcn@latest init
npm install @tanstack/react-query react-hook-form zod next-auth
```

---

## Alternatives Considered

| Recommended | Alternative | When to Use Alternative |
|-------------|-------------|-------------------------|
| Celery | Dramatiq | Dramatiq is the better choice if you want simpler per-message reliability and don't need complex workflow primitives (chords, chains). Switch to Dramatiq if Celery's configuration complexity becomes a team burden in v2. |
| Auth.js v5 | Clerk | Choose Clerk if you need built-in multi-tenant Organizations, passkeys, or faster time-to-market on auth. Tradeoff: per-MAU pricing and vendor lock-in. |
| pgvector | Qdrant | Migrate to Qdrant when vector count exceeds ~1M or when vector search latency under HNSW becomes a bottleneck. The CLAUDE.md already anticipates this upgrade path. |
| Redis | Valkey | Valkey is a Redis fork with a fully open license. Drop-in replacement. Consider if Redis licensing (BSL) becomes a concern. |
| LiteLLM SDK | Direct Anthropic/OpenAI SDK | Use direct SDKs only if you're locked to a single provider with no fallback needs. LiteLLM adds negligible overhead while enabling provider portability. |
| Next.js 16 | Remix | Remix is excellent for form-heavy apps. Next.js wins for the admin portal pattern (server components, strong Vercel ecosystem, shadcn/ui first-class support). |
| httpx (WhatsApp) | whatsapp-cloud-api libraries | None of the community Python WhatsApp SDKs have significant maintenance or production adoption. The Cloud API is a simple REST API — raw httpx with your own models is more maintainable. |

---

## What NOT to Use

| Avoid | Why | Use Instead |
|-------|-----|-------------|
| LangGraph or CrewAI (v1) | Both frameworks add significant abstraction overhead for a single-agent-per-tenant model. LangGraph's graph primitives shine for complex multi-agent stateful orchestration (v2 scenario). In v1, they'd constrain the agent model to their abstractions before requirements are clear. | Custom orchestrator with direct LiteLLM calls. Evaluate LangGraph seriously for v2 multi-agent teams. |
| SQLAlchemy 1.x patterns | The 1.x `session.query()` style and `Session` (sync) are deprecated in 2.0. Mixing sync and async patterns causes subtle bugs in FastAPI async endpoints. | SQLAlchemy 2.0 with `AsyncSession` and `select()` query style exclusively. |
| Socket Mode (Slack) in production | Socket Mode uses a persistent outbound WebSocket — no inbound port needed, but it ties a worker to a long-lived connection. This breaks horizontal scaling. | Events API with a public webhook endpoint. Use Socket Mode only for local dev (bypasses ngrok need during testing). |
| psycopg2 | Synchronous PostgreSQL driver. Blocks the event loop in async FastAPI handlers — kills concurrency. | asyncpg (via SQLAlchemy async engine). |
| Flake8 + Black + isort (separately) | Three tools with overlapping responsibilities, separate configs, and order-of-operation conflicts. The CLAUDE.md already specifies ruff. | ruff, which replaces all three with a single configuration block in pyproject.toml. |
| Flask | Flask is synchronous by default. Adding async support is possible but bolted on. For a platform that processes LLM calls and webhooks concurrently, you need async-native from the start. | FastAPI. |
| Next.js 14 specifically | CLAUDE.md says "14+" but Next.js 16 is the current stable release (March 2026). Starting on 14 means immediately being two major versions behind. | Next.js 16 (latest stable). |
| Keycloak (v1) | Correct for enterprise SSO/SAML needs but massively over-engineered for a v1 beta with a small number of tenants. Adds significant operational complexity. | Auth.js v5 with PostgreSQL session storage. Add Keycloak in v2+ if enterprise SSO is a customer requirement. |

---

## Stack Patterns by Variant

**For Slack Events API webhook handling:**
- Use `slack-bolt` in async mode with FastAPI as the ASGI host
- `AsyncApp` + `AsyncBoltAdapter` for `starlette`
- Mount the bolt app at `/slack/events` in your FastAPI router

**For WhatsApp webhook handling:**
- Expose a GET endpoint for Meta's verification handshake (returns `hub.challenge`)
- Expose a POST endpoint for incoming messages
- Verify `X-Hub-Signature-256` header with `hmac` before processing
- Parse the nested JSON payload manually — no SDK needed

**For tenant context in SQLAlchemy + RLS:**
- Set `app.tenant_id` session variable on each connection before query execution
- Use SQLAlchemy event listeners (`@event.listens_for(engine, "connect")`) or middleware injection
- The `sqlalchemy-tenants` library provides a clean abstraction if hand-rolling this becomes repetitive

**For LLM call patterns:**
- All LLM calls go through LiteLLM proxy — never call provider APIs directly
- LiteLLM handles retries, fallback, and cost tracking
- Dispatch via Celery task so the HTTP response returns immediately
- Stream tokens back to the user via WebSocket or Server-Sent Events for real-time feel

**For Celery + async FastAPI coexistence:**
- Celery workers are synchronous processes — wrap async code with `asyncio.run()` inside task functions
- Alternatively, use `celery[gevent]` for cooperative multitasking in workers
- Do not share the SQLAlchemy `AsyncEngine` between the FastAPI app and Celery workers — create separate engines per process

---

## Version Compatibility

| Package | Compatible With | Notes |
|---------|-----------------|-------|
| FastAPI 0.135.x | Pydantic 2.x | FastAPI 0.100+ requires Pydantic v2. v1 is not supported. |
| SQLAlchemy 2.0.x | asyncpg 0.31.x | Both support PostgreSQL 16. Use `asyncpg` as the dialect driver. |
| Alembic 1.18.x | SQLAlchemy 2.0.x | Compatible. Modify `env.py` to use `run_async_migrations()` pattern for async engine. |
| Celery 5.6.x | Redis 7.x | Celery 5.x uses Redis protocol — compatible with Redis 6+ and Valkey. |
| slack-bolt 1.27.x | Python 3.12 | Fully supported. |
| LiteLLM 1.82.x | Python 3.12 | Fully supported. |
| Next.js 16.x | Auth.js v5 | Auth.js v5 was rewritten specifically for Next.js App Router compatibility. |
| pgvector 0.4.2 (Python) | pgvector 0.8.2 (PostgreSQL extension) | Python client 0.4.x works with extension 0.7.x+. HNSW index requires extension 0.7.0+. |

---

## Sources

- PyPI (verified March 2026): FastAPI 0.135.1, SQLAlchemy 2.0.48, Pydantic 2.12.5, Alembic 1.18.4, asyncpg 0.31.0, Celery 5.6.2, Dramatiq 2.1.0, stripe 14.4.1, pgvector 0.4.2, LiteLLM 1.82.5, slack-bolt 1.27.0
- [FastAPI official docs](https://fastapi.tiangolo.com/) — async patterns, dependency injection
- [LiteLLM docs](https://docs.litellm.ai/) — provider support, routing configuration
- [pgvector GitHub](https://github.com/pgvector/pgvector) — HNSW indexing, production readiness
- [uv workspace docs](https://docs.astral.sh/uv/concepts/projects/workspaces/) — monorepo setup
- [Slack Bolt Python GitHub](https://github.com/slackapi/bolt-python) — Events API vs Socket Mode
- [Auth.js docs](https://authjs.dev/) — v5 App Router compatibility (MEDIUM confidence — not directly fetched)
- [sqlalchemy-tenants](https://github.com/Telemaco019/sqlalchemy-tenants) — RLS + SQLAlchemy integration pattern
- Next.js 16 confirmed as latest stable via npm registry search (March 2026)
- LangGraph 1.0 GA confirmed via community sources (MEDIUM confidence — agent framework recommendation is HIGH confidence to avoid it for v1)

---

*Stack research for: Konstruct — channel-native AI workforce platform*
*Researched: 2026-03-22*