docs(04-rbac): research phase RBAC domain

This commit is contained in:
2026-03-24 13:27:22 -06:00
parent dc758e9e3a
commit 0dc21c6ee5

View File

@@ -0,0 +1,684 @@
# Phase 4: RBAC - Research
**Researched:** 2026-03-24
**Domain:** Role-Based Access Control — FastAPI authorization middleware, Auth.js v5 JWT role claims, PostgreSQL schema migration, SMTP email via Python stdlib, Next.js 16 proxy-layer redirects
**Confidence:** HIGH
---
<user_constraints>
## User Constraints (from CONTEXT.md)
### Locked Decisions
**Role Definitions & Boundaries**
- Platform admin: Full access to all tenants, all agents, all users, platform settings. Uses the same portal with elevated access (no separate admin panel).
- Customer admin: Full control over their tenant — agents (CRUD), channels, billing (self-service via Stripe), BYO API keys, user management (invite/remove users). Can manage multiple tenants (agency/reseller use case).
- Customer operator: View agents, view conversations, view usage dashboards, send test messages to agents. Cannot create/edit/delete agents, no billing access, no API key management, no user management. Fixed role — granular permissions deferred to v2.
- Operators can send test messages to agents — useful for QA without giving edit access.
- Customer admins manage their own billing (subscribe, upgrade, cancel) — self-service, not admin-gated.
- Customer admins manage their own BYO API keys — self-service.
**Invitation & Onboarding Flow**
- Customer admin creates user in portal (name, email, role selection: admin or operator)
- System sends invite email via SMTP direct (no third-party transactional email service)
- Invite link valid for 48 hours — expired links show a clear message
- Customer admin can resend expired invites with a new 48-hour window (resend button on pending invites list)
- All user creation goes through the invite flow — even platform admins must use invites, no direct account creation with temporary passwords. Consistent and auditable.
- Activation page: Claude's discretion (set password only recommended — minimal friction)
**Portal Experience Per Role**
- Role-specific landing pages after login:
- Platform admin → platform overview (all tenants, global stats)
- Customer admin → tenant dashboard (their agents, usage summary)
- Customer operator → agent list (read-only view of their tenant's agents)
- Users with multiple tenants get a tenant switcher dropdown in the sidebar/header — switch without logging out
- Restricted nav items are hidden (not disabled/grayed) — operators don't see Billing, API Keys, User Management in sidebar
- Unauthorized URL access (e.g., operator navigates to /billing) → silent redirect to their home dashboard (no 403 error page)
- API endpoints return 403 Forbidden for unauthorized actions — defense in depth, not just hidden UI
**Platform Admin Capabilities**
- Impersonation: platform admin can "view as" a tenant — all impersonation actions logged in audit trail
- Global user management page: see all users across all tenants, filter by tenant/role, manage invites
- Platform admin sees the same portal as customers but with elevated access and a tenant picker (existing from Phase 1)
### Claude's Discretion
- Activation page design (set password only vs full profile setup)
- Invite email template content and styling
- SMTP configuration approach (env vars vs portal settings)
- Impersonation UI pattern (banner at top, dropdown, etc.)
- How role is stored in JWT (claim name, encoding)
- Database schema for user-tenant association (join table vs embedded)
- Tenant switcher dropdown visual design
### Deferred Ideas (OUT OF SCOPE)
- Granular operator permissions (configurable by customer admin) — v2 RBAC enhancement
- SSO/SAML for enterprise tenants — future authentication method
- Activity log visible to customer admins (who did what in their tenant) — separate observability phase
</user_constraints>
---
<phase_requirements>
## Phase Requirements
| ID | Description | Research Support |
|----|-------------|-----------------|
| RBAC-01 | Platform admin role with full access to all tenants, agents, users, and platform settings | FastAPI `Depends(require_platform_admin)` dependency; JWT claim `role=platform_admin`; no RLS tenant scoping for platform_admin queries |
| RBAC-02 | Customer admin role scoped to a single tenant with full control over agents, channels, billing, API keys, and user management | `Depends(require_tenant_admin)` with tenant membership check; many-to-many `user_tenant_roles` join table; scoped to caller's tenant_id |
| RBAC-03 | Customer operator role scoped to a single tenant with read-only access to agents, conversations, and usage dashboards | `Depends(require_tenant_member)` dependency; HTTP verbs restricted (GET only) for operator paths; test-message endpoint operator-allowed explicitly |
| RBAC-04 | Customer admin can invite users by email — invitee receives activation link to set password | `portal_invitations` table with HMAC-signed token + 48h expiry; Python stdlib `smtplib`/`email.mime` for SMTP; bcrypt password set on accept |
| RBAC-05 | Portal navigation, pages, and UI elements adapt based on user role | Auth.js v5 JWT carries `role` + `tenant_ids`; Nav component filters by role from `useSession()`; proxy.ts redirects unauthorized paths to role home |
| RBAC-06 | API endpoints enforce role-based authorization — unauthorized actions return 403 Forbidden, not just hidden UI | FastAPI `HTTPException(status_code=403)` from role-checking dependencies on all portal router endpoints |
</phase_requirements>
---
## Summary
Phase 4 adds RBAC on top of an already working auth system (Auth.js v5 JWT + FastAPI bcrypt verify). The existing `PortalUser` model has a boolean `is_admin` flag that must be replaced with a proper role enum (`platform_admin`, `customer_admin`, `customer_operator`). Because a customer admin can belong to multiple tenants (agency use case), user-tenant association requires a join table (`user_tenant_roles`) rather than a foreign key on `portal_users`. The invitation system uses time-limited HMAC-signed tokens stored in a `portal_invitations` table and delivered via Python's built-in `smtplib` — no third-party dependency.
Authorization enforcement splits into two layers: the Next.js 16 `proxy.ts` handles optimistic role-based redirects (reading role from the JWT cookie, no DB round-trip), and FastAPI `Depends()` decorators enforce the hard server-side rules returning 403. The proxy layer is the correct place for silent redirects per the official Next.js 16 auth guide. FastAPI dependency injection is the correct place for 403 enforcement — this is an additive layer on top of PostgreSQL RLS, not a replacement for it.
The impersonation feature needs one new JWT claim (`impersonating_tenant_id`) plus an AuditEvent row on every impersonated action. The tenant switcher is purely client-side state: update `active_tenant_id` in the JWT and re-issue a new token without a full page reload.
**Primary recommendation:** Migrate `portal_users.is_admin` to a `role` enum in a single Alembic migration. Add `user_tenant_roles` join table. Add `portal_invitations` table. Wire FastAPI `Depends()` guards. Then update Auth.js JWT callbacks and proxy.ts last.
---
## Standard Stack
### Core
| Library | Version | Purpose | Why Standard |
|---------|---------|---------|--------------|
| SQLAlchemy 2.0 | already in use (>=2.0.36) | ORM for new RBAC tables | Already established in codebase |
| Alembic | already in use (>=1.14.0) | DB migration for role enum + join table | Already established in codebase |
| FastAPI | already in use (>=0.115.0) | `Depends()` for role-checking decorators | Already established in codebase |
| bcrypt | already in use (>=4.0.0) | Password hashing for invite activation | Already established in codebase |
| Python stdlib: `smtplib`, `email.mime` | stdlib (3.12) | SMTP email sending for invite emails | No new dependency; locked decision to avoid third-party transactional email |
| Python stdlib: `hmac`, `hashlib`, `secrets` | stdlib (3.12) | HMAC-signed invite token generation | No new dependency; cryptographically safe |
| Auth.js v5 | ^5.0.0-beta.30 (already in use) | JWT JWT callbacks for `role` + `tenant_ids` claims | Already established in codebase |
| Next.js 16 `proxy.ts` | 16.2.1 (already in use) | Role-based redirect in proxy layer | Official Next.js 16 pattern (confirmed in bundled docs) |
| `useSession` from next-auth/react | already in use | Read role/tenant from JWT in client components | Already established pattern |
### Supporting
| Library | Version | Purpose | When to Use |
|---------|---------|---------|-------------|
| `cryptography` (Fernet) | already in use (>=42.0.0) | Alternative token signing approach | Not recommended here — HMAC+secrets is simpler for short-lived invite tokens; Fernet used for BYO key encryption |
| `pydantic[email]` | already in use (>=2.12.0) | Email format validation on invite request | Already in shared pyproject.toml |
### Alternatives Considered
| Instead of | Could Use | Tradeoff |
|------------|-----------|----------|
| Python stdlib smtplib | `aiosmtplib` | Async SMTP, but adds a dependency. smtplib works fine when called from a Celery task (sync context). Use aiosmtplib only if sending directly from an async FastAPI route without Celery. |
| HMAC token in URL | JWT invite token | JWT adds sub-second crypto overhead and library; HMAC+secrets is more transparent. Both are safe for 48h tokens. |
| Join table `user_tenant_roles` | `tenant_ids: list` on `portal_users` | PostgreSQL array on the user row is simpler but cannot store per-tenant role without extra complexity. Join table is the correct relational approach. |
**Installation:**
No new Python packages required — all needed libraries are already in `packages/shared/pyproject.toml` or Python stdlib.
Portal: no new npm packages required.
---
## Architecture Patterns
### Recommended Project Structure
New files needed:
```
packages/
├── shared/
│ └── shared/
│ ├── models/
│ │ └── auth.py # Add role enum, UserTenantRole model, Invitation model
│ └── api/
│ ├── portal.py # Add RBAC guards to all existing endpoints
│ ├── rbac.py # NEW: FastAPI Depends() guards (require_platform_admin, etc.)
│ └── invitations.py # NEW: Invite CRUD + accept endpoints
migrations/
│ └── versions/
│ └── 006_rbac_roles.py # NEW: role enum + user_tenant_roles + portal_invitations
packages/portal/
├── lib/
│ ├── auth.ts # Update JWT callbacks: role + tenant_ids + active_tenant_id
│ └── auth-types.ts # NEW: TypeScript types for role, augmented session
├── proxy.ts # Update: role-based redirects
├── components/
│ ├── nav.tsx # Update: role-filtered nav items
│ ├── tenant-switcher.tsx # NEW: dropdown for multi-tenant users
│ └── impersonation-banner.tsx # NEW: visible banner when impersonating
└── app/(dashboard)/
├── users/ # NEW: per-tenant user management page
│ └── page.tsx
├── admin/ # NEW: platform admin — global users, all tenants
│ └── users/
│ └── page.tsx
└── invite/ # NEW: public invite acceptance page
└── [token]/
└── page.tsx
```
### Pattern 1: FastAPI Role-Checking Dependency
**What:** A dependency factory that reads the `X-Portal-User-Role` and `X-Portal-Tenant-Id` headers injected by the Next.js proxy, then validates the caller's permission.
**When to use:** On every portal API endpoint that has role requirements.
The existing portal calls FastAPI with no auth headers — Phase 4 must add a mechanism to pass the authenticated user's role and tenant context from the JWT to FastAPI. Two established approaches:
**Option A (recommended): Next.js proxy forwards role headers**
The Next.js API routes (or Server Actions) extract the JWT session via `auth()` and add `X-Portal-User-Id`, `X-Portal-User-Role`, and `X-Portal-Tenant-Id` headers to requests forwarded to FastAPI. FastAPI reads these trusted headers (only accepts them from the internal network / trusted origin).
**Option B: FastAPI validates the Auth.js JWT directly**
FastAPI re-validates the Auth.js JWT using the shared `AUTH_SECRET`. This is more secure in theory but adds `python-jose` or `PyJWT` as a new dependency and couples FastAPI to Auth.js token format.
**Recommendation: Option A** — consistent with how the existing portal API proxy works, simpler, and the internal network boundary already provides the trust layer. This is the same pattern used by the existing billing/channel endpoints.
```python
# Source: FastAPI dependency injection pattern (established in codebase)
# packages/shared/shared/api/rbac.py
from fastapi import Header, HTTPException, status
from typing import Annotated
import uuid
class PortalCaller:
"""Extracted caller context from trusted proxy headers."""
def __init__(self, user_id: uuid.UUID, role: str, tenant_id: uuid.UUID | None = None):
self.user_id = user_id
self.role = role
self.tenant_id = tenant_id # None for platform_admin calls not scoped to a tenant
async def get_portal_caller(
x_portal_user_id: Annotated[str, Header()],
x_portal_user_role: Annotated[str, Header()],
x_portal_tenant_id: Annotated[str | None, Header()] = None,
) -> PortalCaller:
try:
user_id = uuid.UUID(x_portal_user_id)
except ValueError:
raise HTTPException(status_code=status.HTTP_401_UNAUTHORIZED, detail="Invalid caller identity")
tenant_id = uuid.UUID(x_portal_tenant_id) if x_portal_tenant_id else None
return PortalCaller(user_id=user_id, role=x_portal_user_role, tenant_id=tenant_id)
async def require_platform_admin(caller: Annotated[PortalCaller, Depends(get_portal_caller)]) -> PortalCaller:
if caller.role != "platform_admin":
raise HTTPException(status_code=status.HTTP_403_FORBIDDEN, detail="Platform admin required")
return caller
async def require_tenant_admin(
tenant_id: uuid.UUID, # from path param
caller: Annotated[PortalCaller, Depends(get_portal_caller)],
session: AsyncSession = Depends(get_session),
) -> PortalCaller:
if caller.role == "platform_admin":
return caller # platform_admin bypasses tenant check
if caller.role != "customer_admin":
raise HTTPException(status_code=status.HTTP_403_FORBIDDEN, detail="Admin role required")
# Verify caller has admin role in this specific tenant
membership = await session.execute(
select(UserTenantRole).where(
UserTenantRole.user_id == caller.user_id,
UserTenantRole.tenant_id == tenant_id,
UserTenantRole.role == "customer_admin",
)
)
if membership.scalar_one_or_none() is None:
raise HTTPException(status_code=status.HTTP_403_FORBIDDEN, detail="Not a member of this tenant")
return caller
```
### Pattern 2: Auth.js v5 JWT with Role + Tenant Claims
**What:** Extend the existing JWT callback to store `role`, `tenant_ids`, and `active_tenant_id`.
**When to use:** Once on login, and when tenant switcher changes active tenant.
```typescript
// Source: Auth.js v5 JWT callback pattern — extends existing lib/auth.ts
// The authorize() response from FastAPI /auth/verify now returns role + tenant_ids
async jwt({ token, user }) {
if (user) {
const u = user as AuthVerifyResponse;
token.role = u.role; // "platform_admin" | "customer_admin" | "customer_operator"
token.tenant_ids = u.tenant_ids; // string[] — all tenants this user belongs to
token.active_tenant_id = u.tenant_ids[0] ?? null; // default to first tenant
}
return token;
},
async session({ session, token }) {
session.user.id = token.sub ?? "";
session.user.role = token.role as string;
session.user.tenant_ids = token.tenant_ids as string[];
session.user.active_tenant_id = token.active_tenant_id as string | null;
return session;
},
```
### Pattern 3: Next.js 16 Proxy Role-Based Redirect
**What:** Extend `proxy.ts` to redirect unauthorized paths based on JWT role claim.
**When to use:** For silent redirects when an operator navigates to a restricted page.
Per the official Next.js 16 docs bundled in this repo (`node_modules/next/dist/docs/01-app/02-guides/authentication.md`): proxy should do **optimistic checks only** — read role from the JWT cookie without DB queries. Secure enforcement is FastAPI's responsibility.
The `redirect` in proxy.ts uses `NextResponse.redirect`, which is already in use in `proxy.ts`.
```typescript
// Extend existing proxy.ts
const PLATFORM_ADMIN_ONLY = ["/admin", "/tenants"];
const CUSTOMER_ADMIN_ONLY = ["/billing", "/settings/api-keys", "/users"];
const OPERATOR_HOME = "/agents";
const CUSTOMER_ADMIN_HOME = "/dashboard";
const PLATFORM_ADMIN_HOME = "/dashboard";
// After session check, add role-based redirect:
const role = (session?.user as { role?: string })?.role;
if (role === "customer_operator") {
const isRestricted = [...PLATFORM_ADMIN_ONLY, ...CUSTOMER_ADMIN_ONLY].some(
(path) => pathname.startsWith(path)
);
if (isRestricted) {
return NextResponse.redirect(new URL(OPERATOR_HOME, request.url));
}
}
```
### Pattern 4: Invite Token Generation and Validation
**What:** HMAC-SHA256 signed, URL-safe token with 48-hour expiry embedded in the invite URL.
**When to use:** Creating and accepting invite links.
```python
# Source: Python stdlib hmac + secrets (same approach used for WhatsApp HMAC in Phase 2)
import hmac
import hashlib
import secrets
import time
INVITE_SECRET = settings.invite_secret # From .env — 32+ random bytes
INVITE_TTL_SECONDS = 48 * 3600
def generate_invite_token(invitation_id: str) -> str:
"""Generate a URL-safe HMAC-signed token embedding invite ID + timestamp."""
timestamp = str(int(time.time()))
payload = f"{invitation_id}:{timestamp}"
sig = hmac.new(
INVITE_SECRET.encode(),
payload.encode(),
hashlib.sha256,
).hexdigest()
# Encode as base64url for URL safety
import base64
raw = f"{payload}:{sig}"
return base64.urlsafe_b64encode(raw.encode()).decode().rstrip("=")
def validate_invite_token(token: str) -> str:
"""Returns invitation_id if valid, raises ValueError if expired or tampered."""
import base64
# Pad base64
padded = token + "=" * (-len(token) % 4)
raw = base64.urlsafe_b64decode(padded).decode()
invitation_id, timestamp, provided_sig = raw.rsplit(":", 2)
# Constant-time comparison
expected_payload = f"{invitation_id}:{timestamp}"
expected_sig = hmac.new(
INVITE_SECRET.encode(),
expected_payload.encode(),
hashlib.sha256,
).hexdigest()
if not hmac.compare_digest(expected_sig, provided_sig):
raise ValueError("Invalid token signature")
if int(time.time()) - int(timestamp) > INVITE_TTL_SECONDS:
raise ValueError("Invite token expired")
return invitation_id
```
### Pattern 5: SMTP Email via Python stdlib
**What:** Send invite emails using Python's `smtplib` + `email.mime`. Called from a Celery task (sync context — consistent with established codebase pattern that all Celery tasks are `sync def`).
**When to use:** Sending invite emails.
```python
# Source: Python stdlib email + smtplib
import smtplib
from email.mime.text import MIMEText
from email.mime.multipart import MIMEMultipart
def send_invite_email(
to_email: str,
invitee_name: str,
tenant_name: str,
invite_url: str,
smtp_host: str,
smtp_port: int,
smtp_username: str,
smtp_password: str,
from_email: str,
) -> None:
"""Sync function — call from Celery task, not async FastAPI handler."""
msg = MIMEMultipart("alternative")
msg["Subject"] = f"You've been invited to join {tenant_name} on Konstruct"
msg["From"] = from_email
msg["To"] = to_email
text_body = f"""
Hi {invitee_name},
You've been invited to join {tenant_name} on Konstruct as an AI workforce administrator.
Accept your invitation and set up your account here:
{invite_url}
This link expires in 48 hours.
— The Konstruct Team
"""
msg.attach(MIMEText(text_body, "plain"))
with smtplib.SMTP(smtp_host, smtp_port) as server:
server.starttls()
server.login(smtp_username, smtp_password)
server.sendmail(from_email, to_email, msg.as_string())
```
### Anti-Patterns to Avoid
- **Checking role only in the UI:** Nav hiding is cosmetic. Every API endpoint that mutates data must also check role via FastAPI `Depends()`. The decision text explicitly states "defense in depth, not just hidden UI."
- **Using RLS for RBAC enforcement:** RLS enforces tenant isolation (which tenant's data). RBAC enforces what the user can DO within a tenant. These are separate layers — RLS is additive protection, not a substitute for endpoint guards.
- **Storing role in `portal_users` as a single column:** Customer admins can belong to multiple tenants with potentially different roles per tenant (admin in tenant A, operator in tenant B). The join table `user_tenant_roles` is required.
- **Database lookup in proxy.ts:** The official Next.js 16 docs explicitly warn: proxy should only read from cookies, not make DB calls. The proxy layer is for optimistic redirects only.
- **Skipping impersonation audit logging:** Impersonated actions must emit `AuditEvent` rows with `action_type='impersonation'` and the platform admin's user_id in `event_metadata`. This is a locked decision.
- **Async def for Celery email task:** The codebase has a hard constraint: all Celery tasks are `sync def` with `asyncio.run()`. The SMTP send function must follow this pattern.
---
## Don't Hand-Roll
| Problem | Don't Build | Use Instead | Why |
|---------|-------------|-------------|-----|
| HMAC token timing-safe comparison | Custom string compare | `hmac.compare_digest()` | Prevents timing attacks — already used in WhatsApp signature verification (Phase 2) |
| Password hashing | Custom hash | `bcrypt` (already in use) | bcrypt already used for all PortalUser passwords |
| Email format validation | Regex | `pydantic[email]` (already in use) | Already declared in shared pyproject.toml |
| JWT claims augmentation | Custom token issuer | Auth.js v5 JWT callbacks (already in use) | Cleanest extension point for existing JWT strategy |
| Role enum validation | Custom if/else | PostgreSQL `CHECK` constraint + Python `enum.Enum` | DB-level constraint catches bugs at persistence layer |
**Key insight:** No new dependencies needed. All building blocks (HMAC, bcrypt, smtplib, FastAPI Depends, SQLAlchemy enum, Auth.js JWT callbacks) are already present in the codebase.
---
## Common Pitfalls
### Pitfall 1: `platform_admin` Bypassing Tenant Scope Must Be Explicit
**What goes wrong:** A `require_tenant_admin` dependency that checks tenant membership will block platform admins from cross-tenant operations unless the code explicitly short-circuits for `role == "platform_admin"`.
**Why it happens:** The membership check looks up `user_tenant_roles` — platform admin may not have rows in that table for most tenants.
**How to avoid:** Every `require_tenant_*` dependency must have: `if caller.role == "platform_admin": return caller` as the first check.
**Warning signs:** Platform admin getting 403 on cross-tenant endpoints.
### Pitfall 2: Auth.js v5 TypeScript Type Augmentation Required
**What goes wrong:** TypeScript errors when accessing `session.user.role` because the default Auth.js `User` and `Session` types don't include `role` or `tenant_ids`.
**Why it happens:** Auth.js v5 uses module augmentation for type extensions, not direct type overriding.
**How to avoid:** Create `lib/auth-types.ts` that extends Auth.js types:
```typescript
// lib/auth-types.ts
declare module "next-auth" {
interface User {
role?: string;
tenant_ids?: string[];
active_tenant_id?: string | null;
}
interface Session {
user: User & { id: string; role: string; tenant_ids: string[]; active_tenant_id: string | null };
}
}
declare module "next-auth/jwt" {
interface JWT {
role?: string;
tenant_ids?: string[];
active_tenant_id?: string | null;
}
}
```
**Warning signs:** TypeScript compilation errors on `session.user.role` in proxy.ts or nav components.
### Pitfall 3: Invitation Token Expiry Check Must Be at Accept Time, Not Just Display Time
**What goes wrong:** Checking only `invitation.expires_at < now()` in the UI still allows a race where a valid-looking token is submitted after expiry.
**Why it happens:** Frontend-only expiry check is not authoritative.
**How to avoid:** The FastAPI `/invitations/accept` endpoint must re-validate the token timestamp and check `portal_invitations.status == 'pending'` in an atomic DB operation. Mark invitation as `accepted` in the same transaction as creating the user account.
**Warning signs:** Accepted invites still show in pending list; double-activation possible if link clicked twice.
### Pitfall 4: Celery Task for Email Must Not Use `async def`
**What goes wrong:** An `async def` Celery task that calls `smtplib` (sync) or tries to use `await` in the task body — Celery's worker does not run an event loop natively.
**Why it happens:** Developer instinct to make everything async in an async codebase.
**How to avoid:** Celery tasks are always `sync def`. If async DB access is needed inside the task, use `asyncio.run()` (established pattern from Phase 1 — all existing Celery tasks do this).
**Warning signs:** `RuntimeError: no running event loop` in Celery worker logs.
### Pitfall 5: JWT Token Size Limit
**What goes wrong:** Adding `tenant_ids` (list of UUIDs) to the JWT makes the cookie exceed browser limits (~4KB) for users with many tenants.
**Why it happens:** JWT cookies are bounded by HTTP cookie size.
**How to avoid:** Store only `active_tenant_id` (single UUID) in the JWT. For users with multiple tenants, store the full list in a compact form (array of UUIDs, not full objects). Realistically, v1 users will have 1-3 tenants; this is a precaution, not an immediate crisis.
**Warning signs:** Auth.js session errors for users with >20 tenant memberships.
### Pitfall 6: `portal_users` Table Has No RLS — User Enumeration Risk
**What goes wrong:** The `/users` endpoint for global user management (platform admin only) queries `portal_users` without RLS. Without the `require_platform_admin` guard, any authenticated user could enumerate all users.
**Why it happens:** `portal_users` intentionally has no RLS (noted in the existing model comment: "RLS is NOT applied to this table"). Authorization is application-layer only.
**How to avoid:** Every endpoint that touches `portal_users` without a tenant filter MUST use `require_platform_admin`. Per-tenant user management endpoints use `require_tenant_admin` + filter by `user_tenant_roles.tenant_id`.
**Warning signs:** Customer admin able to see users from other tenants.
### Pitfall 7: `is_admin` → `role` Migration Must Handle Existing Data
**What goes wrong:** Alembic migration drops `is_admin` and adds `role` enum without migrating existing rows — existing platform admins lose access.
**Why it happens:** Schema-only migration without data backfill.
**How to avoid:** Migration must: (1) add `role` column with default `'customer_admin'`, (2) UPDATE rows where `is_admin = true` to `role = 'platform_admin'`, (3) then drop `is_admin`. Use a single migration step — do not split across multiple migrations.
**Warning signs:** Existing users cannot log in after migration.
---
## Code Examples
### Database Schema: New Tables
```python
# Source: SQLAlchemy 2.0 ORM pattern — established in packages/shared/shared/models/
import enum
class UserRole(str, enum.Enum):
PLATFORM_ADMIN = "platform_admin"
CUSTOMER_ADMIN = "customer_admin"
CUSTOMER_OPERATOR = "customer_operator"
class UserTenantRole(Base):
"""
Associates a portal user with a tenant and their role in that tenant.
A user can have different roles in different tenants (agency use case).
platform_admin users do not require rows here — they bypass tenant checks.
"""
__tablename__ = "user_tenant_roles"
__table_args__ = (
UniqueConstraint("user_id", "tenant_id", name="uq_user_tenant"),
)
id: Mapped[uuid.UUID] = mapped_column(UUID(as_uuid=True), primary_key=True, default=uuid.uuid4)
user_id: Mapped[uuid.UUID] = mapped_column(UUID(as_uuid=True), ForeignKey("portal_users.id", ondelete="CASCADE"), nullable=False, index=True)
tenant_id: Mapped[uuid.UUID] = mapped_column(UUID(as_uuid=True), ForeignKey("tenants.id", ondelete="CASCADE"), nullable=False, index=True)
role: Mapped[str] = mapped_column(String(50), nullable=False) # TEXT with CHECK constraint — avoids SQLAlchemy Enum DDL issues (per Phase 1 decision)
created_at: Mapped[datetime] = mapped_column(DateTime(timezone=True), nullable=False, server_default=func.now())
class PortalInvitation(Base):
"""
Pending email invitations. Token is HMAC-signed and expires after 48 hours.
Status: 'pending' | 'accepted' | 'expired'
"""
__tablename__ = "portal_invitations"
id: Mapped[uuid.UUID] = mapped_column(UUID(as_uuid=True), primary_key=True, default=uuid.uuid4)
email: Mapped[str] = mapped_column(String(255), nullable=False, index=True)
name: Mapped[str] = mapped_column(String(255), nullable=False)
tenant_id: Mapped[uuid.UUID] = mapped_column(UUID(as_uuid=True), ForeignKey("tenants.id", ondelete="CASCADE"), nullable=False, index=True)
role: Mapped[str] = mapped_column(String(50), nullable=False) # customer_admin | customer_operator
invited_by: Mapped[uuid.UUID] = mapped_column(UUID(as_uuid=True), ForeignKey("portal_users.id"), nullable=False)
token_hash: Mapped[str] = mapped_column(String(255), nullable=False, unique=True) # SHA-256 hash of raw token
status: Mapped[str] = mapped_column(String(20), nullable=False, default="pending")
expires_at: Mapped[datetime] = mapped_column(DateTime(timezone=True), nullable=False)
created_at: Mapped[datetime] = mapped_column(DateTime(timezone=True), nullable=False, server_default=func.now())
```
### `portal_users` Migration: `is_admin` → `role`
```python
# Source: Alembic migration pattern — established in migrations/versions/
def upgrade() -> None:
# 1. Add role column (nullable initially to allow backfill)
op.add_column("portal_users", sa.Column("role", sa.String(50), nullable=True))
# 2. Backfill: existing is_admin=True → platform_admin, others → customer_admin
op.execute("""
UPDATE portal_users
SET role = CASE WHEN is_admin = TRUE THEN 'platform_admin' ELSE 'customer_admin' END
""")
# 3. Add NOT NULL constraint now that all rows have a value
op.alter_column("portal_users", "role", nullable=False)
# 4. Add CHECK constraint (TEXT enum pattern — avoids SQLAlchemy Enum DDL issues per Phase 1 decision)
op.execute("""
ALTER TABLE portal_users
ADD CONSTRAINT ck_portal_users_role
CHECK (role IN ('platform_admin', 'customer_admin', 'customer_operator'))
""")
# 5. Drop is_admin column
op.drop_column("portal_users", "is_admin")
```
---
## State of the Art
| Old Approach | Current Approach | When Changed | Impact |
|--------------|------------------|--------------|--------|
| `is_admin: bool` on PortalUser | `role: str` enum + `user_tenant_roles` join table | Phase 4 | Enables multi-tenant membership and typed roles |
| No API authorization | FastAPI `Depends(require_*)` guards on every endpoint | Phase 4 | All portal endpoints get 403 enforcement |
| No invite flow; direct registration via `/auth/register` | Invite-only user creation; `/auth/register` endpoint deprecated/removed | Phase 4 | All users created through auditable invite flow |
| `is_admin` in JWT | `role` + `tenant_ids` + `active_tenant_id` in JWT | Phase 4 | Proxy can redirect by role; tenant switcher uses active_tenant_id |
**Deprecated/outdated after Phase 4:**
- `portal_users.is_admin`: Replaced by `portal_users.role` + `user_tenant_roles`.
- `/api/portal/auth/register` endpoint: Replaced by invite-only flow. Should be removed or locked to platform_admin only with an immediate deprecation comment.
- `AuthVerifyResponse.is_admin` field: Replaced by `role` + `tenant_ids` + `active_tenant_id`.
---
## Open Questions
1. **Tenant switcher: re-issue JWT or use URL state?**
- What we know: The locked decision says "switch without logging out" and "no page reload." Auth.js v5 JWT strategy means the token is a signed cookie.
- What's unclear: Auth.js v5 does not natively support updating a token mid-session without a new sign-in. The `update()` function from `useSession()` can trigger a JWT refresh callback if implemented.
- Recommendation: Use Auth.js v5 `update()` session method which triggers the `jwt` callback with `trigger: "update"` — pass `{ active_tenant_id: newTenantId }` as the update payload. This is the supported pattern for mid-session JWT updates in Auth.js v5.
2. **`/auth/register` endpoint — remove or gate?**
- What we know: All user creation goes through invites per locked decision. The existing `/auth/register` endpoint allows direct account creation.
- What's unclear: Whether there's a seeding/bootstrap use case for initial platform admin creation.
- Recommendation: Keep the endpoint but gate it behind `require_platform_admin` with a deprecation notice. Initial platform admin seeded via a one-time script or environment variable bootstrap (not via the portal).
3. **SMTP configuration approach**
- What we know: SMTP direct is locked; configuration approach is discretionary.
- Recommendation: Store SMTP config in `.env` / `settings` (same pattern as all other secrets — `SMTP_HOST`, `SMTP_PORT`, `SMTP_USERNAME`, `SMTP_PASSWORD`, `SMTP_FROM_EMAIL`). No portal settings UI needed for v1.
---
## Validation Architecture
### Test Framework
| Property | Value |
|----------|-------|
| Framework | pytest 8.3+ with pytest-asyncio 0.25+ |
| Config file | `pyproject.toml` (`[tool.pytest.ini_options]`) |
| Quick run command | `pytest tests/unit -x` |
| Full suite command | `pytest tests/ -x` |
### Phase Requirements → Test Map
| Req ID | Behavior | Test Type | Automated Command | File Exists? |
|--------|----------|-----------|-------------------|-------------|
| RBAC-01 | Platform admin gets 200 on cross-tenant endpoints; non-admin gets 403 | unit | `pytest tests/unit/test_rbac_guards.py -x` | Wave 0 |
| RBAC-02 | Customer admin gets 200 on own-tenant endpoints; gets 403 on other tenants | unit | `pytest tests/unit/test_rbac_guards.py -x` | Wave 0 |
| RBAC-03 | Customer operator gets 403 on mutating endpoints; gets 200 on GET endpoints | unit | `pytest tests/unit/test_rbac_guards.py -x` | Wave 0 |
| RBAC-04 | Invite creation, token generation, token validation (TTL + HMAC), accept flow | unit | `pytest tests/unit/test_invitations.py -x` | Wave 0 |
| RBAC-04 | Full invite→accept integration: invite created, email triggered, user activated | integration | `pytest tests/integration/test_invite_flow.py -x` | Wave 0 |
| RBAC-05 | JWT contains role + tenant_ids after verify; active_tenant_id present | unit | `pytest tests/unit/test_portal_auth.py -x` | Wave 0 (extend existing test_portal_tenants.py pattern) |
| RBAC-06 | Every portal endpoint returns 403 without role headers; returns 200 with correct role | integration | `pytest tests/integration/test_portal_rbac.py -x` | Wave 0 |
### Sampling Rate
- **Per task commit:** `pytest tests/unit -x`
- **Per wave merge:** `pytest tests/ -x`
- **Phase gate:** Full suite green before `/gsd:verify-work`
### Wave 0 Gaps
- [ ] `tests/unit/test_rbac_guards.py` — unit tests for FastAPI `require_platform_admin`, `require_tenant_admin`, `require_tenant_member` dependencies
- [ ] `tests/unit/test_invitations.py` — unit tests for HMAC token generation, expiry validation, token tampering detection
- [ ] `tests/integration/test_invite_flow.py` — end-to-end invite creation → email mock → accept → login
- [ ] `tests/integration/test_portal_rbac.py` — covers RBAC-06: all portal endpoints tested with correct/incorrect role headers
---
## Sources
### Primary (HIGH confidence)
- Official Next.js 16 docs (bundled): `packages/portal/node_modules/next/dist/docs/01-app/02-guides/authentication.md` — proxy-layer auth pattern, optimistic check guidance, Data Access Layer recommendation
- Official Next.js 16 docs (bundled): `packages/portal/node_modules/next/dist/docs/01-app/02-guides/redirecting.md``NextResponse.redirect` in proxy.ts
- Codebase review: `packages/shared/shared/models/auth.py` — current PortalUser schema (is_admin flag)
- Codebase review: `packages/portal/lib/auth.ts` — Auth.js v5 JWT callbacks (existing pattern to extend)
- Codebase review: `packages/shared/shared/api/portal.py` — all existing endpoints needing guards
- Codebase review: `packages/portal/proxy.ts` — proxy.ts structure to extend
- Codebase review: `migrations/versions/001_initial_schema.py` — TEXT+CHECK pattern for enum columns (Phase 1 decision)
- Codebase review: `packages/shared/shared/models/audit.py` — AuditEvent model for impersonation logging
- Codebase review: `.planning/STATE.md` — critical architecture decisions from all prior phases
### Secondary (MEDIUM confidence)
- Python stdlib `smtplib` and `email.mime` documentation — no version dependency, stable since Python 3.x
- Auth.js v5 `update()` session method — documented in Auth.js v5 beta docs; consistent with JWT callback `trigger: "update"` pattern
### Tertiary (LOW confidence)
- Auth.js v5 module augmentation TypeScript pattern — inferred from Auth.js v5 docs and TypeScript convention; confirmed functional in existing portal TypeScript setup
---
## Metadata
**Confidence breakdown:**
- Standard stack: HIGH — all core dependencies already in codebase; no new libraries introduced
- Architecture patterns: HIGH — FastAPI Depends() and Auth.js JWT callbacks are established patterns; schema migration pattern confirmed from prior phases
- Pitfalls: HIGH — directly derived from prior phase decisions logged in STATE.md (TEXT+CHECK for enums, sync Celery tasks, portal_users has no RLS)
**Research date:** 2026-03-24
**Valid until:** 2026-05-24 (stable stack — all libraries at fixed versions in pyproject.toml)