Files
konstruct/.planning/phases/04-rbac/04-RESEARCH.md

38 KiB

Phase 4: RBAC - Research

Researched: 2026-03-24 Domain: Role-Based Access Control — FastAPI authorization middleware, Auth.js v5 JWT role claims, PostgreSQL schema migration, SMTP email via Python stdlib, Next.js 16 proxy-layer redirects Confidence: HIGH


<user_constraints>

User Constraints (from CONTEXT.md)

Locked Decisions

Role Definitions & Boundaries

  • Platform admin: Full access to all tenants, all agents, all users, platform settings. Uses the same portal with elevated access (no separate admin panel).
  • Customer admin: Full control over their tenant — agents (CRUD), channels, billing (self-service via Stripe), BYO API keys, user management (invite/remove users). Can manage multiple tenants (agency/reseller use case).
  • Customer operator: View agents, view conversations, view usage dashboards, send test messages to agents. Cannot create/edit/delete agents, no billing access, no API key management, no user management. Fixed role — granular permissions deferred to v2.
  • Operators can send test messages to agents — useful for QA without giving edit access.
  • Customer admins manage their own billing (subscribe, upgrade, cancel) — self-service, not admin-gated.
  • Customer admins manage their own BYO API keys — self-service.

Invitation & Onboarding Flow

  • Customer admin creates user in portal (name, email, role selection: admin or operator)
  • System sends invite email via SMTP direct (no third-party transactional email service)
  • Invite link valid for 48 hours — expired links show a clear message
  • Customer admin can resend expired invites with a new 48-hour window (resend button on pending invites list)
  • All user creation goes through the invite flow — even platform admins must use invites, no direct account creation with temporary passwords. Consistent and auditable.
  • Activation page: Claude's discretion (set password only recommended — minimal friction)

Portal Experience Per Role

  • Role-specific landing pages after login:
    • Platform admin → platform overview (all tenants, global stats)
    • Customer admin → tenant dashboard (their agents, usage summary)
    • Customer operator → agent list (read-only view of their tenant's agents)
  • Users with multiple tenants get a tenant switcher dropdown in the sidebar/header — switch without logging out
  • Restricted nav items are hidden (not disabled/grayed) — operators don't see Billing, API Keys, User Management in sidebar
  • Unauthorized URL access (e.g., operator navigates to /billing) → silent redirect to their home dashboard (no 403 error page)
  • API endpoints return 403 Forbidden for unauthorized actions — defense in depth, not just hidden UI

Platform Admin Capabilities

  • Impersonation: platform admin can "view as" a tenant — all impersonation actions logged in audit trail
  • Global user management page: see all users across all tenants, filter by tenant/role, manage invites
  • Platform admin sees the same portal as customers but with elevated access and a tenant picker (existing from Phase 1)

Claude's Discretion

  • Activation page design (set password only vs full profile setup)
  • Invite email template content and styling
  • SMTP configuration approach (env vars vs portal settings)
  • Impersonation UI pattern (banner at top, dropdown, etc.)
  • How role is stored in JWT (claim name, encoding)
  • Database schema for user-tenant association (join table vs embedded)
  • Tenant switcher dropdown visual design

Deferred Ideas (OUT OF SCOPE)

  • Granular operator permissions (configurable by customer admin) — v2 RBAC enhancement
  • SSO/SAML for enterprise tenants — future authentication method
  • Activity log visible to customer admins (who did what in their tenant) — separate observability phase </user_constraints>

<phase_requirements>

Phase Requirements

ID Description Research Support
RBAC-01 Platform admin role with full access to all tenants, agents, users, and platform settings FastAPI Depends(require_platform_admin) dependency; JWT claim role=platform_admin; no RLS tenant scoping for platform_admin queries
RBAC-02 Customer admin role scoped to a single tenant with full control over agents, channels, billing, API keys, and user management Depends(require_tenant_admin) with tenant membership check; many-to-many user_tenant_roles join table; scoped to caller's tenant_id
RBAC-03 Customer operator role scoped to a single tenant with read-only access to agents, conversations, and usage dashboards Depends(require_tenant_member) dependency; HTTP verbs restricted (GET only) for operator paths; test-message endpoint operator-allowed explicitly
RBAC-04 Customer admin can invite users by email — invitee receives activation link to set password portal_invitations table with HMAC-signed token + 48h expiry; Python stdlib smtplib/email.mime for SMTP; bcrypt password set on accept
RBAC-05 Portal navigation, pages, and UI elements adapt based on user role Auth.js v5 JWT carries role + tenant_ids; Nav component filters by role from useSession(); proxy.ts redirects unauthorized paths to role home
RBAC-06 API endpoints enforce role-based authorization — unauthorized actions return 403 Forbidden, not just hidden UI FastAPI HTTPException(status_code=403) from role-checking dependencies on all portal router endpoints
</phase_requirements>

Summary

Phase 4 adds RBAC on top of an already working auth system (Auth.js v5 JWT + FastAPI bcrypt verify). The existing PortalUser model has a boolean is_admin flag that must be replaced with a proper role enum (platform_admin, customer_admin, customer_operator). Because a customer admin can belong to multiple tenants (agency use case), user-tenant association requires a join table (user_tenant_roles) rather than a foreign key on portal_users. The invitation system uses time-limited HMAC-signed tokens stored in a portal_invitations table and delivered via Python's built-in smtplib — no third-party dependency.

Authorization enforcement splits into two layers: the Next.js 16 proxy.ts handles optimistic role-based redirects (reading role from the JWT cookie, no DB round-trip), and FastAPI Depends() decorators enforce the hard server-side rules returning 403. The proxy layer is the correct place for silent redirects per the official Next.js 16 auth guide. FastAPI dependency injection is the correct place for 403 enforcement — this is an additive layer on top of PostgreSQL RLS, not a replacement for it.

The impersonation feature needs one new JWT claim (impersonating_tenant_id) plus an AuditEvent row on every impersonated action. The tenant switcher is purely client-side state: update active_tenant_id in the JWT and re-issue a new token without a full page reload.

Primary recommendation: Migrate portal_users.is_admin to a role enum in a single Alembic migration. Add user_tenant_roles join table. Add portal_invitations table. Wire FastAPI Depends() guards. Then update Auth.js JWT callbacks and proxy.ts last.


Standard Stack

Core

Library Version Purpose Why Standard
SQLAlchemy 2.0 already in use (>=2.0.36) ORM for new RBAC tables Already established in codebase
Alembic already in use (>=1.14.0) DB migration for role enum + join table Already established in codebase
FastAPI already in use (>=0.115.0) Depends() for role-checking decorators Already established in codebase
bcrypt already in use (>=4.0.0) Password hashing for invite activation Already established in codebase
Python stdlib: smtplib, email.mime stdlib (3.12) SMTP email sending for invite emails No new dependency; locked decision to avoid third-party transactional email
Python stdlib: hmac, hashlib, secrets stdlib (3.12) HMAC-signed invite token generation No new dependency; cryptographically safe
Auth.js v5 ^5.0.0-beta.30 (already in use) JWT JWT callbacks for role + tenant_ids claims Already established in codebase
Next.js 16 proxy.ts 16.2.1 (already in use) Role-based redirect in proxy layer Official Next.js 16 pattern (confirmed in bundled docs)
useSession from next-auth/react already in use Read role/tenant from JWT in client components Already established pattern

Supporting

Library Version Purpose When to Use
cryptography (Fernet) already in use (>=42.0.0) Alternative token signing approach Not recommended here — HMAC+secrets is simpler for short-lived invite tokens; Fernet used for BYO key encryption
pydantic[email] already in use (>=2.12.0) Email format validation on invite request Already in shared pyproject.toml

Alternatives Considered

Instead of Could Use Tradeoff
Python stdlib smtplib aiosmtplib Async SMTP, but adds a dependency. smtplib works fine when called from a Celery task (sync context). Use aiosmtplib only if sending directly from an async FastAPI route without Celery.
HMAC token in URL JWT invite token JWT adds sub-second crypto overhead and library; HMAC+secrets is more transparent. Both are safe for 48h tokens.
Join table user_tenant_roles tenant_ids: list on portal_users PostgreSQL array on the user row is simpler but cannot store per-tenant role without extra complexity. Join table is the correct relational approach.

Installation: No new Python packages required — all needed libraries are already in packages/shared/pyproject.toml or Python stdlib.

Portal: no new npm packages required.


Architecture Patterns

New files needed:

packages/
├── shared/
│   └── shared/
│       ├── models/
│       │   └── auth.py               # Add role enum, UserTenantRole model, Invitation model
│       └── api/
│           ├── portal.py             # Add RBAC guards to all existing endpoints
│           ├── rbac.py               # NEW: FastAPI Depends() guards (require_platform_admin, etc.)
│           └── invitations.py        # NEW: Invite CRUD + accept endpoints
│
migrations/
│   └── versions/
│       └── 006_rbac_roles.py        # NEW: role enum + user_tenant_roles + portal_invitations
│
packages/portal/
├── lib/
│   ├── auth.ts                       # Update JWT callbacks: role + tenant_ids + active_tenant_id
│   └── auth-types.ts                 # NEW: TypeScript types for role, augmented session
├── proxy.ts                          # Update: role-based redirects
├── components/
│   ├── nav.tsx                       # Update: role-filtered nav items
│   ├── tenant-switcher.tsx           # NEW: dropdown for multi-tenant users
│   └── impersonation-banner.tsx      # NEW: visible banner when impersonating
└── app/(dashboard)/
    ├── users/                        # NEW: per-tenant user management page
    │   └── page.tsx
    ├── admin/                        # NEW: platform admin — global users, all tenants
    │   └── users/
    │       └── page.tsx
    └── invite/                       # NEW: public invite acceptance page
        └── [token]/
            └── page.tsx

Pattern 1: FastAPI Role-Checking Dependency

What: A dependency factory that reads the X-Portal-User-Role and X-Portal-Tenant-Id headers injected by the Next.js proxy, then validates the caller's permission. When to use: On every portal API endpoint that has role requirements.

The existing portal calls FastAPI with no auth headers — Phase 4 must add a mechanism to pass the authenticated user's role and tenant context from the JWT to FastAPI. Two established approaches:

Option A (recommended): Next.js proxy forwards role headers The Next.js API routes (or Server Actions) extract the JWT session via auth() and add X-Portal-User-Id, X-Portal-User-Role, and X-Portal-Tenant-Id headers to requests forwarded to FastAPI. FastAPI reads these trusted headers (only accepts them from the internal network / trusted origin).

Option B: FastAPI validates the Auth.js JWT directly FastAPI re-validates the Auth.js JWT using the shared AUTH_SECRET. This is more secure in theory but adds python-jose or PyJWT as a new dependency and couples FastAPI to Auth.js token format.

Recommendation: Option A — consistent with how the existing portal API proxy works, simpler, and the internal network boundary already provides the trust layer. This is the same pattern used by the existing billing/channel endpoints.

# Source: FastAPI dependency injection pattern (established in codebase)
# packages/shared/shared/api/rbac.py

from fastapi import Header, HTTPException, status
from typing import Annotated
import uuid


class PortalCaller:
    """Extracted caller context from trusted proxy headers."""
    def __init__(self, user_id: uuid.UUID, role: str, tenant_id: uuid.UUID | None = None):
        self.user_id = user_id
        self.role = role
        self.tenant_id = tenant_id  # None for platform_admin calls not scoped to a tenant


async def get_portal_caller(
    x_portal_user_id: Annotated[str, Header()],
    x_portal_user_role: Annotated[str, Header()],
    x_portal_tenant_id: Annotated[str | None, Header()] = None,
) -> PortalCaller:
    try:
        user_id = uuid.UUID(x_portal_user_id)
    except ValueError:
        raise HTTPException(status_code=status.HTTP_401_UNAUTHORIZED, detail="Invalid caller identity")
    tenant_id = uuid.UUID(x_portal_tenant_id) if x_portal_tenant_id else None
    return PortalCaller(user_id=user_id, role=x_portal_user_role, tenant_id=tenant_id)


async def require_platform_admin(caller: Annotated[PortalCaller, Depends(get_portal_caller)]) -> PortalCaller:
    if caller.role != "platform_admin":
        raise HTTPException(status_code=status.HTTP_403_FORBIDDEN, detail="Platform admin required")
    return caller


async def require_tenant_admin(
    tenant_id: uuid.UUID,  # from path param
    caller: Annotated[PortalCaller, Depends(get_portal_caller)],
    session: AsyncSession = Depends(get_session),
) -> PortalCaller:
    if caller.role == "platform_admin":
        return caller  # platform_admin bypasses tenant check
    if caller.role != "customer_admin":
        raise HTTPException(status_code=status.HTTP_403_FORBIDDEN, detail="Admin role required")
    # Verify caller has admin role in this specific tenant
    membership = await session.execute(
        select(UserTenantRole).where(
            UserTenantRole.user_id == caller.user_id,
            UserTenantRole.tenant_id == tenant_id,
            UserTenantRole.role == "customer_admin",
        )
    )
    if membership.scalar_one_or_none() is None:
        raise HTTPException(status_code=status.HTTP_403_FORBIDDEN, detail="Not a member of this tenant")
    return caller

Pattern 2: Auth.js v5 JWT with Role + Tenant Claims

What: Extend the existing JWT callback to store role, tenant_ids, and active_tenant_id. When to use: Once on login, and when tenant switcher changes active tenant.

// Source: Auth.js v5 JWT callback pattern — extends existing lib/auth.ts
// The authorize() response from FastAPI /auth/verify now returns role + tenant_ids

async jwt({ token, user }) {
  if (user) {
    const u = user as AuthVerifyResponse;
    token.role = u.role;                     // "platform_admin" | "customer_admin" | "customer_operator"
    token.tenant_ids = u.tenant_ids;         // string[] — all tenants this user belongs to
    token.active_tenant_id = u.tenant_ids[0] ?? null;  // default to first tenant
  }
  return token;
},
async session({ session, token }) {
  session.user.id = token.sub ?? "";
  session.user.role = token.role as string;
  session.user.tenant_ids = token.tenant_ids as string[];
  session.user.active_tenant_id = token.active_tenant_id as string | null;
  return session;
},

Pattern 3: Next.js 16 Proxy Role-Based Redirect

What: Extend proxy.ts to redirect unauthorized paths based on JWT role claim. When to use: For silent redirects when an operator navigates to a restricted page.

Per the official Next.js 16 docs bundled in this repo (node_modules/next/dist/docs/01-app/02-guides/authentication.md): proxy should do optimistic checks only — read role from the JWT cookie without DB queries. Secure enforcement is FastAPI's responsibility.

The redirect in proxy.ts uses NextResponse.redirect, which is already in use in proxy.ts.

// Extend existing proxy.ts

const PLATFORM_ADMIN_ONLY = ["/admin", "/tenants"];
const CUSTOMER_ADMIN_ONLY = ["/billing", "/settings/api-keys", "/users"];
const OPERATOR_HOME = "/agents";
const CUSTOMER_ADMIN_HOME = "/dashboard";
const PLATFORM_ADMIN_HOME = "/dashboard";

// After session check, add role-based redirect:
const role = (session?.user as { role?: string })?.role;

if (role === "customer_operator") {
  const isRestricted = [...PLATFORM_ADMIN_ONLY, ...CUSTOMER_ADMIN_ONLY].some(
    (path) => pathname.startsWith(path)
  );
  if (isRestricted) {
    return NextResponse.redirect(new URL(OPERATOR_HOME, request.url));
  }
}

Pattern 4: Invite Token Generation and Validation

What: HMAC-SHA256 signed, URL-safe token with 48-hour expiry embedded in the invite URL. When to use: Creating and accepting invite links.

# Source: Python stdlib hmac + secrets (same approach used for WhatsApp HMAC in Phase 2)

import hmac
import hashlib
import secrets
import time

INVITE_SECRET = settings.invite_secret  # From .env — 32+ random bytes
INVITE_TTL_SECONDS = 48 * 3600

def generate_invite_token(invitation_id: str) -> str:
    """Generate a URL-safe HMAC-signed token embedding invite ID + timestamp."""
    timestamp = str(int(time.time()))
    payload = f"{invitation_id}:{timestamp}"
    sig = hmac.new(
        INVITE_SECRET.encode(),
        payload.encode(),
        hashlib.sha256,
    ).hexdigest()
    # Encode as base64url for URL safety
    import base64
    raw = f"{payload}:{sig}"
    return base64.urlsafe_b64encode(raw.encode()).decode().rstrip("=")


def validate_invite_token(token: str) -> str:
    """Returns invitation_id if valid, raises ValueError if expired or tampered."""
    import base64
    # Pad base64
    padded = token + "=" * (-len(token) % 4)
    raw = base64.urlsafe_b64decode(padded).decode()
    invitation_id, timestamp, provided_sig = raw.rsplit(":", 2)

    # Constant-time comparison
    expected_payload = f"{invitation_id}:{timestamp}"
    expected_sig = hmac.new(
        INVITE_SECRET.encode(),
        expected_payload.encode(),
        hashlib.sha256,
    ).hexdigest()
    if not hmac.compare_digest(expected_sig, provided_sig):
        raise ValueError("Invalid token signature")

    if int(time.time()) - int(timestamp) > INVITE_TTL_SECONDS:
        raise ValueError("Invite token expired")

    return invitation_id

Pattern 5: SMTP Email via Python stdlib

What: Send invite emails using Python's smtplib + email.mime. Called from a Celery task (sync context — consistent with established codebase pattern that all Celery tasks are sync def). When to use: Sending invite emails.

# Source: Python stdlib email + smtplib

import smtplib
from email.mime.text import MIMEText
from email.mime.multipart import MIMEMultipart

def send_invite_email(
    to_email: str,
    invitee_name: str,
    tenant_name: str,
    invite_url: str,
    smtp_host: str,
    smtp_port: int,
    smtp_username: str,
    smtp_password: str,
    from_email: str,
) -> None:
    """Sync function — call from Celery task, not async FastAPI handler."""
    msg = MIMEMultipart("alternative")
    msg["Subject"] = f"You've been invited to join {tenant_name} on Konstruct"
    msg["From"] = from_email
    msg["To"] = to_email

    text_body = f"""
Hi {invitee_name},

You've been invited to join {tenant_name} on Konstruct as an AI workforce administrator.

Accept your invitation and set up your account here:
{invite_url}

This link expires in 48 hours.

— The Konstruct Team
"""
    msg.attach(MIMEText(text_body, "plain"))

    with smtplib.SMTP(smtp_host, smtp_port) as server:
        server.starttls()
        server.login(smtp_username, smtp_password)
        server.sendmail(from_email, to_email, msg.as_string())

Anti-Patterns to Avoid

  • Checking role only in the UI: Nav hiding is cosmetic. Every API endpoint that mutates data must also check role via FastAPI Depends(). The decision text explicitly states "defense in depth, not just hidden UI."
  • Using RLS for RBAC enforcement: RLS enforces tenant isolation (which tenant's data). RBAC enforces what the user can DO within a tenant. These are separate layers — RLS is additive protection, not a substitute for endpoint guards.
  • Storing role in portal_users as a single column: Customer admins can belong to multiple tenants with potentially different roles per tenant (admin in tenant A, operator in tenant B). The join table user_tenant_roles is required.
  • Database lookup in proxy.ts: The official Next.js 16 docs explicitly warn: proxy should only read from cookies, not make DB calls. The proxy layer is for optimistic redirects only.
  • Skipping impersonation audit logging: Impersonated actions must emit AuditEvent rows with action_type='impersonation' and the platform admin's user_id in event_metadata. This is a locked decision.
  • Async def for Celery email task: The codebase has a hard constraint: all Celery tasks are sync def with asyncio.run(). The SMTP send function must follow this pattern.

Don't Hand-Roll

Problem Don't Build Use Instead Why
HMAC token timing-safe comparison Custom string compare hmac.compare_digest() Prevents timing attacks — already used in WhatsApp signature verification (Phase 2)
Password hashing Custom hash bcrypt (already in use) bcrypt already used for all PortalUser passwords
Email format validation Regex pydantic[email] (already in use) Already declared in shared pyproject.toml
JWT claims augmentation Custom token issuer Auth.js v5 JWT callbacks (already in use) Cleanest extension point for existing JWT strategy
Role enum validation Custom if/else PostgreSQL CHECK constraint + Python enum.Enum DB-level constraint catches bugs at persistence layer

Key insight: No new dependencies needed. All building blocks (HMAC, bcrypt, smtplib, FastAPI Depends, SQLAlchemy enum, Auth.js JWT callbacks) are already present in the codebase.


Common Pitfalls

Pitfall 1: platform_admin Bypassing Tenant Scope Must Be Explicit

What goes wrong: A require_tenant_admin dependency that checks tenant membership will block platform admins from cross-tenant operations unless the code explicitly short-circuits for role == "platform_admin". Why it happens: The membership check looks up user_tenant_roles — platform admin may not have rows in that table for most tenants. How to avoid: Every require_tenant_* dependency must have: if caller.role == "platform_admin": return caller as the first check. Warning signs: Platform admin getting 403 on cross-tenant endpoints.

Pitfall 2: Auth.js v5 TypeScript Type Augmentation Required

What goes wrong: TypeScript errors when accessing session.user.role because the default Auth.js User and Session types don't include role or tenant_ids. Why it happens: Auth.js v5 uses module augmentation for type extensions, not direct type overriding. How to avoid: Create lib/auth-types.ts that extends Auth.js types:

// lib/auth-types.ts
declare module "next-auth" {
  interface User {
    role?: string;
    tenant_ids?: string[];
    active_tenant_id?: string | null;
  }
  interface Session {
    user: User & { id: string; role: string; tenant_ids: string[]; active_tenant_id: string | null };
  }
}
declare module "next-auth/jwt" {
  interface JWT {
    role?: string;
    tenant_ids?: string[];
    active_tenant_id?: string | null;
  }
}

Warning signs: TypeScript compilation errors on session.user.role in proxy.ts or nav components.

Pitfall 3: Invitation Token Expiry Check Must Be at Accept Time, Not Just Display Time

What goes wrong: Checking only invitation.expires_at < now() in the UI still allows a race where a valid-looking token is submitted after expiry. Why it happens: Frontend-only expiry check is not authoritative. How to avoid: The FastAPI /invitations/accept endpoint must re-validate the token timestamp and check portal_invitations.status == 'pending' in an atomic DB operation. Mark invitation as accepted in the same transaction as creating the user account. Warning signs: Accepted invites still show in pending list; double-activation possible if link clicked twice.

Pitfall 4: Celery Task for Email Must Not Use async def

What goes wrong: An async def Celery task that calls smtplib (sync) or tries to use await in the task body — Celery's worker does not run an event loop natively. Why it happens: Developer instinct to make everything async in an async codebase. How to avoid: Celery tasks are always sync def. If async DB access is needed inside the task, use asyncio.run() (established pattern from Phase 1 — all existing Celery tasks do this). Warning signs: RuntimeError: no running event loop in Celery worker logs.

Pitfall 5: JWT Token Size Limit

What goes wrong: Adding tenant_ids (list of UUIDs) to the JWT makes the cookie exceed browser limits (~4KB) for users with many tenants. Why it happens: JWT cookies are bounded by HTTP cookie size. How to avoid: Store only active_tenant_id (single UUID) in the JWT. For users with multiple tenants, store the full list in a compact form (array of UUIDs, not full objects). Realistically, v1 users will have 1-3 tenants; this is a precaution, not an immediate crisis. Warning signs: Auth.js session errors for users with >20 tenant memberships.

Pitfall 6: portal_users Table Has No RLS — User Enumeration Risk

What goes wrong: The /users endpoint for global user management (platform admin only) queries portal_users without RLS. Without the require_platform_admin guard, any authenticated user could enumerate all users. Why it happens: portal_users intentionally has no RLS (noted in the existing model comment: "RLS is NOT applied to this table"). Authorization is application-layer only. How to avoid: Every endpoint that touches portal_users without a tenant filter MUST use require_platform_admin. Per-tenant user management endpoints use require_tenant_admin + filter by user_tenant_roles.tenant_id. Warning signs: Customer admin able to see users from other tenants.

Pitfall 7: is_adminrole Migration Must Handle Existing Data

What goes wrong: Alembic migration drops is_admin and adds role enum without migrating existing rows — existing platform admins lose access. Why it happens: Schema-only migration without data backfill. How to avoid: Migration must: (1) add role column with default 'customer_admin', (2) UPDATE rows where is_admin = true to role = 'platform_admin', (3) then drop is_admin. Use a single migration step — do not split across multiple migrations. Warning signs: Existing users cannot log in after migration.


Code Examples

Database Schema: New Tables

# Source: SQLAlchemy 2.0 ORM pattern — established in packages/shared/shared/models/

import enum

class UserRole(str, enum.Enum):
    PLATFORM_ADMIN = "platform_admin"
    CUSTOMER_ADMIN = "customer_admin"
    CUSTOMER_OPERATOR = "customer_operator"


class UserTenantRole(Base):
    """
    Associates a portal user with a tenant and their role in that tenant.
    A user can have different roles in different tenants (agency use case).
    platform_admin users do not require rows here — they bypass tenant checks.
    """
    __tablename__ = "user_tenant_roles"
    __table_args__ = (
        UniqueConstraint("user_id", "tenant_id", name="uq_user_tenant"),
    )

    id: Mapped[uuid.UUID] = mapped_column(UUID(as_uuid=True), primary_key=True, default=uuid.uuid4)
    user_id: Mapped[uuid.UUID] = mapped_column(UUID(as_uuid=True), ForeignKey("portal_users.id", ondelete="CASCADE"), nullable=False, index=True)
    tenant_id: Mapped[uuid.UUID] = mapped_column(UUID(as_uuid=True), ForeignKey("tenants.id", ondelete="CASCADE"), nullable=False, index=True)
    role: Mapped[str] = mapped_column(String(50), nullable=False)  # TEXT with CHECK constraint — avoids SQLAlchemy Enum DDL issues (per Phase 1 decision)
    created_at: Mapped[datetime] = mapped_column(DateTime(timezone=True), nullable=False, server_default=func.now())


class PortalInvitation(Base):
    """
    Pending email invitations. Token is HMAC-signed and expires after 48 hours.
    Status: 'pending' | 'accepted' | 'expired'
    """
    __tablename__ = "portal_invitations"

    id: Mapped[uuid.UUID] = mapped_column(UUID(as_uuid=True), primary_key=True, default=uuid.uuid4)
    email: Mapped[str] = mapped_column(String(255), nullable=False, index=True)
    name: Mapped[str] = mapped_column(String(255), nullable=False)
    tenant_id: Mapped[uuid.UUID] = mapped_column(UUID(as_uuid=True), ForeignKey("tenants.id", ondelete="CASCADE"), nullable=False, index=True)
    role: Mapped[str] = mapped_column(String(50), nullable=False)  # customer_admin | customer_operator
    invited_by: Mapped[uuid.UUID] = mapped_column(UUID(as_uuid=True), ForeignKey("portal_users.id"), nullable=False)
    token_hash: Mapped[str] = mapped_column(String(255), nullable=False, unique=True)  # SHA-256 hash of raw token
    status: Mapped[str] = mapped_column(String(20), nullable=False, default="pending")
    expires_at: Mapped[datetime] = mapped_column(DateTime(timezone=True), nullable=False)
    created_at: Mapped[datetime] = mapped_column(DateTime(timezone=True), nullable=False, server_default=func.now())

portal_users Migration: is_adminrole

# Source: Alembic migration pattern — established in migrations/versions/

def upgrade() -> None:
    # 1. Add role column (nullable initially to allow backfill)
    op.add_column("portal_users", sa.Column("role", sa.String(50), nullable=True))

    # 2. Backfill: existing is_admin=True → platform_admin, others → customer_admin
    op.execute("""
        UPDATE portal_users
        SET role = CASE WHEN is_admin = TRUE THEN 'platform_admin' ELSE 'customer_admin' END
    """)

    # 3. Add NOT NULL constraint now that all rows have a value
    op.alter_column("portal_users", "role", nullable=False)

    # 4. Add CHECK constraint (TEXT enum pattern — avoids SQLAlchemy Enum DDL issues per Phase 1 decision)
    op.execute("""
        ALTER TABLE portal_users
        ADD CONSTRAINT ck_portal_users_role
        CHECK (role IN ('platform_admin', 'customer_admin', 'customer_operator'))
    """)

    # 5. Drop is_admin column
    op.drop_column("portal_users", "is_admin")

State of the Art

Old Approach Current Approach When Changed Impact
is_admin: bool on PortalUser role: str enum + user_tenant_roles join table Phase 4 Enables multi-tenant membership and typed roles
No API authorization FastAPI Depends(require_*) guards on every endpoint Phase 4 All portal endpoints get 403 enforcement
No invite flow; direct registration via /auth/register Invite-only user creation; /auth/register endpoint deprecated/removed Phase 4 All users created through auditable invite flow
is_admin in JWT role + tenant_ids + active_tenant_id in JWT Phase 4 Proxy can redirect by role; tenant switcher uses active_tenant_id

Deprecated/outdated after Phase 4:

  • portal_users.is_admin: Replaced by portal_users.role + user_tenant_roles.
  • /api/portal/auth/register endpoint: Replaced by invite-only flow. Should be removed or locked to platform_admin only with an immediate deprecation comment.
  • AuthVerifyResponse.is_admin field: Replaced by role + tenant_ids + active_tenant_id.

Open Questions

  1. Tenant switcher: re-issue JWT or use URL state?

    • What we know: The locked decision says "switch without logging out" and "no page reload." Auth.js v5 JWT strategy means the token is a signed cookie.
    • What's unclear: Auth.js v5 does not natively support updating a token mid-session without a new sign-in. The update() function from useSession() can trigger a JWT refresh callback if implemented.
    • Recommendation: Use Auth.js v5 update() session method which triggers the jwt callback with trigger: "update" — pass { active_tenant_id: newTenantId } as the update payload. This is the supported pattern for mid-session JWT updates in Auth.js v5.
  2. /auth/register endpoint — remove or gate?

    • What we know: All user creation goes through invites per locked decision. The existing /auth/register endpoint allows direct account creation.
    • What's unclear: Whether there's a seeding/bootstrap use case for initial platform admin creation.
    • Recommendation: Keep the endpoint but gate it behind require_platform_admin with a deprecation notice. Initial platform admin seeded via a one-time script or environment variable bootstrap (not via the portal).
  3. SMTP configuration approach

    • What we know: SMTP direct is locked; configuration approach is discretionary.
    • Recommendation: Store SMTP config in .env / settings (same pattern as all other secrets — SMTP_HOST, SMTP_PORT, SMTP_USERNAME, SMTP_PASSWORD, SMTP_FROM_EMAIL). No portal settings UI needed for v1.

Validation Architecture

Test Framework

Property Value
Framework pytest 8.3+ with pytest-asyncio 0.25+
Config file pyproject.toml ([tool.pytest.ini_options])
Quick run command pytest tests/unit -x
Full suite command pytest tests/ -x

Phase Requirements → Test Map

Req ID Behavior Test Type Automated Command File Exists?
RBAC-01 Platform admin gets 200 on cross-tenant endpoints; non-admin gets 403 unit pytest tests/unit/test_rbac_guards.py -x Wave 0
RBAC-02 Customer admin gets 200 on own-tenant endpoints; gets 403 on other tenants unit pytest tests/unit/test_rbac_guards.py -x Wave 0
RBAC-03 Customer operator gets 403 on mutating endpoints; gets 200 on GET endpoints unit pytest tests/unit/test_rbac_guards.py -x Wave 0
RBAC-04 Invite creation, token generation, token validation (TTL + HMAC), accept flow unit pytest tests/unit/test_invitations.py -x Wave 0
RBAC-04 Full invite→accept integration: invite created, email triggered, user activated integration pytest tests/integration/test_invite_flow.py -x Wave 0
RBAC-05 JWT contains role + tenant_ids after verify; active_tenant_id present unit pytest tests/unit/test_portal_auth.py -x Wave 0 (extend existing test_portal_tenants.py pattern)
RBAC-06 Every portal endpoint returns 403 without role headers; returns 200 with correct role integration pytest tests/integration/test_portal_rbac.py -x Wave 0

Sampling Rate

  • Per task commit: pytest tests/unit -x
  • Per wave merge: pytest tests/ -x
  • Phase gate: Full suite green before /gsd:verify-work

Wave 0 Gaps

  • tests/unit/test_rbac_guards.py — unit tests for FastAPI require_platform_admin, require_tenant_admin, require_tenant_member dependencies
  • tests/unit/test_invitations.py — unit tests for HMAC token generation, expiry validation, token tampering detection
  • tests/integration/test_invite_flow.py — end-to-end invite creation → email mock → accept → login
  • tests/integration/test_portal_rbac.py — covers RBAC-06: all portal endpoints tested with correct/incorrect role headers

Sources

Primary (HIGH confidence)

  • Official Next.js 16 docs (bundled): packages/portal/node_modules/next/dist/docs/01-app/02-guides/authentication.md — proxy-layer auth pattern, optimistic check guidance, Data Access Layer recommendation
  • Official Next.js 16 docs (bundled): packages/portal/node_modules/next/dist/docs/01-app/02-guides/redirecting.mdNextResponse.redirect in proxy.ts
  • Codebase review: packages/shared/shared/models/auth.py — current PortalUser schema (is_admin flag)
  • Codebase review: packages/portal/lib/auth.ts — Auth.js v5 JWT callbacks (existing pattern to extend)
  • Codebase review: packages/shared/shared/api/portal.py — all existing endpoints needing guards
  • Codebase review: packages/portal/proxy.ts — proxy.ts structure to extend
  • Codebase review: migrations/versions/001_initial_schema.py — TEXT+CHECK pattern for enum columns (Phase 1 decision)
  • Codebase review: packages/shared/shared/models/audit.py — AuditEvent model for impersonation logging
  • Codebase review: .planning/STATE.md — critical architecture decisions from all prior phases

Secondary (MEDIUM confidence)

  • Python stdlib smtplib and email.mime documentation — no version dependency, stable since Python 3.x
  • Auth.js v5 update() session method — documented in Auth.js v5 beta docs; consistent with JWT callback trigger: "update" pattern

Tertiary (LOW confidence)

  • Auth.js v5 module augmentation TypeScript pattern — inferred from Auth.js v5 docs and TypeScript convention; confirmed functional in existing portal TypeScript setup

Metadata

Confidence breakdown:

  • Standard stack: HIGH — all core dependencies already in codebase; no new libraries introduced
  • Architecture patterns: HIGH — FastAPI Depends() and Auth.js JWT callbacks are established patterns; schema migration pattern confirmed from prior phases
  • Pitfalls: HIGH — directly derived from prior phase decisions logged in STATE.md (TEXT+CHECK for enums, sync Celery tasks, portal_users has no RLS)

Research date: 2026-03-24 Valid until: 2026-05-24 (stable stack — all libraries at fixed versions in pyproject.toml)