fix(02-agent-features): revise plans based on checker feedback

This commit is contained in:
2026-03-23 14:32:20 -06:00
parent 7da5ffb92a
commit b2e86f1046
4 changed files with 319 additions and 68 deletions

View File

@@ -97,45 +97,28 @@ Output: Tool registry + executor, 4 builtin tools, audit logger, DB migration, u
<tasks>
<task type="auto" tdd="true">
<name>Task 1: Audit logger, tool registry, executor, and built-in tools with tests</name>
<name>Task 1: Audit model, KB model, migration, and audit logger with tests</name>
<files>
packages/shared/shared/models/audit.py,
packages/shared/shared/models/kb.py,
packages/orchestrator/orchestrator/audit/__init__.py,
packages/orchestrator/orchestrator/audit/logger.py,
packages/orchestrator/orchestrator/tools/__init__.py,
packages/orchestrator/orchestrator/tools/registry.py,
packages/orchestrator/orchestrator/tools/executor.py,
packages/orchestrator/orchestrator/tools/builtins/__init__.py,
packages/orchestrator/orchestrator/tools/builtins/web_search.py,
packages/orchestrator/orchestrator/tools/builtins/kb_search.py,
packages/orchestrator/orchestrator/tools/builtins/http_request.py,
packages/orchestrator/orchestrator/tools/builtins/calendar_lookup.py,
migrations/versions/003_phase2_audit_kb.py,
tests/unit/test_tool_registry.py,
tests/unit/test_tool_executor.py,
tests/integration/test_audit.py
</files>
<behavior>
- ToolDefinition has name, description, parameters (JSON Schema), requires_confirmation, handler
- BUILTIN_TOOLS contains 4 tools: web_search, kb_search, http_request, calendar_lookup
- get_tools_for_agent filters BUILTIN_TOOLS by agent's configured tool list
- execute_tool validates args against tool's JSON schema before calling handler
- execute_tool with invalid args returns error string and logs the failure
- execute_tool with unknown tool name raises ValueError
- execute_tool with requires_confirmation=True returns a confirmation request instead of executing
- AuditEvent has id, tenant_id, agent_id, user_id, action_type, input_summary, output_summary, latency_ms, metadata (JSONB), created_at
- AuditLogger.log_tool_call writes a row to audit_events with action_type='tool_invocation'
- AuditLogger.log_llm_call writes a row with action_type='llm_call' including latency_ms
- AuditLogger.log_escalation writes a row with action_type='escalation'
- audit_events table rejects UPDATE and DELETE from konstruct_app role
- audit_events are tenant-scoped via RLS
- web_search tool calls Brave Search API and returns structured results
- kb_search tool queries pgvector knowledge base (conversation_embeddings or dedicated kb_chunks table)
- http_request tool makes outbound HTTP with timeout (30s), size cap (1MB), allowed methods (GET/POST/PUT/DELETE)
- calendar_lookup tool queries Google Calendar events.list for availability
- KBChunk model has id, tenant_id, document_id, content, embedding (Vector(384)), chunk_index, created_at
- Migration creates both audit_events and kb tables with appropriate indexes and RLS
</behavior>
<action>
1. Create `packages/shared/shared/models/audit.py`:
- AuditEvent: id (UUID PK), tenant_id (UUID NOT NULL), agent_id (UUID), user_id (TEXT), action_type (TEXT NOT NULL 'llm_call' | 'tool_invocation' | 'escalation'), input_summary (TEXT), output_summary (TEXT), latency_ms (INTEGER), metadata (JSONB, default={}), created_at (TIMESTAMPTZ, server_default=now())
- AuditEvent: id (UUID PK), tenant_id (UUID NOT NULL), agent_id (UUID), user_id (TEXT), action_type (TEXT NOT NULL -- 'llm_call' | 'tool_invocation' | 'escalation'), input_summary (TEXT), output_summary (TEXT), latency_ms (INTEGER), metadata (JSONB, default={}), created_at (TIMESTAMPTZ, server_default=now())
- RLS enabled + forced, same pattern as other tenant-scoped tables
2. Create `packages/shared/shared/models/kb.py`:
@@ -145,7 +128,7 @@ Output: Tool registry + executor, 4 builtin tools, audit logger, DB migration, u
3. Create Alembic migration `003_phase2_audit_kb.py`:
- audit_events table with all columns, index on (tenant_id, created_at DESC), RLS
- REVOKE UPDATE, DELETE ON audit_events FROM konstruct_app immutability enforced at DB level
- REVOKE UPDATE, DELETE ON audit_events FROM konstruct_app -- immutability enforced at DB level
- kb_documents and kb_chunks tables, HNSW index on kb_chunks embedding, RLS
- GRANT SELECT, INSERT on audit_events TO konstruct_app
- GRANT SELECT, INSERT, UPDATE, DELETE on kb_documents and kb_chunks TO konstruct_app
@@ -157,53 +140,94 @@ Output: Tool registry + executor, 4 builtin tools, audit logger, DB migration, u
- async log_escalation(tenant_id, agent_id, user_id, trigger_reason, metadata={})
- All methods write to audit_events table with RLS context set
5. Create `packages/orchestrator/orchestrator/tools/registry.py`:
- ToolDefinition Pydantic model: name, description, parameters (dict — JSON Schema), requires_confirmation (bool, default False), handler (Any, excluded from serialization)
5. Write integration tests (test_audit.py):
- Test that audit events are written to DB with correct fields
- Test that UPDATE/DELETE is rejected (expect error)
- Test RLS isolation between tenants
</action>
<verify>
<automated>cd /home/adelorenzo/repos/konstruct && python -m pytest tests/integration/test_audit.py -x -v</automated>
</verify>
<done>
- AuditEvent and KB ORM models exist with correct schema
- Audit events written to DB for LLM calls, tool invocations, and escalations
- audit_events immutability enforced (UPDATE/DELETE rejected at DB level)
- RLS isolates audit data per tenant
- Migration applies cleanly with both audit and KB tables
</done>
</task>
<task type="auto" tdd="true">
<name>Task 2: Tool registry, executor, and 4 built-in tools with tests</name>
<files>
packages/orchestrator/orchestrator/tools/__init__.py,
packages/orchestrator/orchestrator/tools/registry.py,
packages/orchestrator/orchestrator/tools/executor.py,
packages/orchestrator/orchestrator/tools/builtins/__init__.py,
packages/orchestrator/orchestrator/tools/builtins/web_search.py,
packages/orchestrator/orchestrator/tools/builtins/kb_search.py,
packages/orchestrator/orchestrator/tools/builtins/http_request.py,
packages/orchestrator/orchestrator/tools/builtins/calendar_lookup.py,
tests/unit/test_tool_registry.py,
tests/unit/test_tool_executor.py
</files>
<behavior>
- ToolDefinition has name, description, parameters (JSON Schema), requires_confirmation, handler
- BUILTIN_TOOLS contains 4 tools: web_search, kb_search, http_request, calendar_lookup
- get_tools_for_agent filters BUILTIN_TOOLS by agent's configured tool list
- execute_tool validates args against tool's JSON schema before calling handler
- execute_tool with invalid args returns error string and logs the failure
- execute_tool with unknown tool name raises ValueError
- execute_tool with requires_confirmation=True returns a confirmation request instead of executing
- web_search tool calls Brave Search API and returns structured results
- kb_search tool queries pgvector knowledge base (kb_chunks table)
- http_request tool makes outbound HTTP with timeout (30s), size cap (1MB), allowed methods (GET/POST/PUT/DELETE)
- calendar_lookup tool queries Google Calendar events.list for availability
</behavior>
<action>
1. Create `packages/orchestrator/orchestrator/tools/registry.py`:
- ToolDefinition Pydantic model: name, description, parameters (dict -- JSON Schema), requires_confirmation (bool, default False), handler (Any, excluded from serialization)
- BUILTIN_TOOLS: dict[str, ToolDefinition] with 4 tools
- get_tools_for_agent(agent: Agent) -> dict[str, ToolDefinition]: filters by agent.tools list
- to_litellm_format(tools: dict) -> list[dict]: converts to OpenAI function-calling schema for LiteLLM
6. Create `packages/orchestrator/orchestrator/tools/executor.py`:
2. Create `packages/orchestrator/orchestrator/tools/executor.py`:
- async execute_tool(tool_call: dict, registry: dict, tenant_id, agent_id, audit_logger) -> str
- Validates args via jsonschema.validate() BEFORE calling handler (LLM output is untrusted)
- If requires_confirmation is True, return a confirmation message string instead of executing
- Logs every invocation (success or failure) to audit trail
- Install jsonschema: `uv add jsonschema` in orchestrator package
7. Create 4 built-in tool handlers in `tools/builtins/`:
- web_search.py: async web_search(query: str) -> str. Uses Brave Search API via httpx. Env var: BRAVE_API_KEY. Returns top 3 results formatted as text. Install: `uv add brave-search` or use raw httpx to https://api.search.brave.com/res/v1/web/search
3. Create 4 built-in tool handlers in `tools/builtins/`:
- web_search.py: async web_search(query: str) -> str. Uses Brave Search API via httpx. Env var: BRAVE_API_KEY. Returns top 3 results formatted as text.
- kb_search.py: async kb_search(query: str, tenant_id: str, agent_id: str) -> str. Embeds query, searches kb_chunks via pgvector. Returns top 3 matching chunks as text.
- http_request.py: async http_request(url: str, method: str = "GET", body: str | None = None) -> str. Timeout 30s, response size cap 1MB, allowed methods GET/POST/PUT/DELETE. requires_confirmation=True.
- calendar_lookup.py: async calendar_lookup(date: str, calendar_id: str = "primary") -> str. Uses google-api-python-client events.list(). Requires GOOGLE_SERVICE_ACCOUNT_KEY env var or per-tenant OAuth. Returns formatted availability. requires_confirmation=False (read-only).
8. Write unit tests:
4. Write unit tests:
- test_tool_registry.py: test tool lookup, filtering by agent, LiteLLM format conversion
- test_tool_executor.py: test schema validation (valid args pass, invalid rejected), confirmation flow, unknown tool error, audit logging called
9. Write integration tests:
- test_audit.py: test that audit events are written to DB, test that UPDATE/DELETE is rejected (expect error), test RLS isolation between tenants
- test_tool_executor.py: test schema validation (valid args pass, invalid rejected), confirmation flow, unknown tool error, audit logging called (mock audit_logger)
</action>
<verify>
<automated>cd /home/adelorenzo/repos/konstruct && python -m pytest tests/unit/test_tool_registry.py tests/unit/test_tool_executor.py tests/integration/test_audit.py -x -v</automated>
<automated>cd /home/adelorenzo/repos/konstruct && python -m pytest tests/unit/test_tool_registry.py tests/unit/test_tool_executor.py -x -v</automated>
</verify>
<done>
- 4 built-in tools registered with JSON Schema definitions
- Tool executor validates args and rejects invalid input
- Confirmation-required tools return confirmation message instead of executing
- Audit events written to DB for every tool call and LLM call
- audit_events immutability enforced (UPDATE/DELETE rejected)
- RLS isolates audit data per tenant
- Tool registry converts to LiteLLM function-calling format
- All unit tests pass
</done>
</task>
<task type="auto">
<name>Task 2: Wire tool-call loop into agent runner and orchestrator pipeline</name>
<name>Task 3: Wire tool-call loop into agent runner and orchestrator pipeline</name>
<files>
packages/orchestrator/orchestrator/agents/runner.py,
packages/orchestrator/orchestrator/tasks.py
</files>
<action>
1. Update `runner.py` implement tool-call loop:
1. Update `runner.py` -- implement tool-call loop:
- After LLM response, check if response contains `tool_calls` array (LiteLLM returns this in OpenAI format)
- If tool_calls present: for each tool call, dispatch to execute_tool()
- If tool requires confirmation: stop the loop, return the confirmation message to the user, store pending action in Redis (pending_tool_confirm_key)
@@ -224,7 +248,7 @@ Output: Tool registry + executor, 4 builtin tools, audit logger, DB migration, u
- Forward to litellm.acompletion(tools=tools) when present
- Return tool_calls in response when LLM produces them
CRITICAL: The tool loop happens inside the Celery task (sync context with asyncio.run). Each iteration of the loop is an async function call within the same asyncio.run() block. Do NOT dispatch separate Celery tasks for tool execution it all happens in one task invocation.
CRITICAL: The tool loop happens inside the Celery task (sync context with asyncio.run). Each iteration of the loop is an async function call within the same asyncio.run() block. Do NOT dispatch separate Celery tasks for tool execution -- it all happens in one task invocation.
Seamless tool usage per user decision: The agent's system prompt should NOT include instructions like "announce when using tools." The tool results are injected as context and the LLM naturally incorporates them. The confirmation flow is the only user-visible tool interaction.
</action>

View File

@@ -10,9 +10,6 @@ files_modified:
- packages/gateway/gateway/main.py
- packages/shared/shared/models/message.py
- packages/shared/shared/config.py
- packages/orchestrator/orchestrator/agents/runner.py
- packages/orchestrator/orchestrator/tasks.py
- migrations/versions/004_phase2_media.py
- tests/unit/test_whatsapp_verify.py
- tests/unit/test_whatsapp_normalize.py
- tests/unit/test_whatsapp_scoping.py
@@ -55,10 +52,12 @@ must_haves:
---
<objective>
Build the WhatsApp Business Cloud API adapter in the Channel Gateway: webhook verification, signature checking, message normalization to KonstructMessage, business-function scoping gate, media handling, and outbound message delivery. Extend the shared message model with typed media attachments.
Build the WhatsApp Business Cloud API adapter in the Channel Gateway: webhook verification, signature checking, message normalization to KonstructMessage, business-function scoping gate, media handling (download + MinIO storage), and outbound message delivery. Extend the shared message model with typed media attachments.
Purpose: Adds the second messaging channel, enabling SMBs to deploy their AI employee on WhatsApp the most common business communication channel globally.
Purpose: Adds the second messaging channel, enabling SMBs to deploy their AI employee on WhatsApp -- the most common business communication channel globally.
Output: WhatsApp adapter, media model extension, business-function scoping, passing tests.
Note: Outbound wiring in orchestrator tasks.py (channel-aware response routing) and multimodal LLM interpretation of media are handled in Plan 02-05 to avoid file conflicts with Plans 02-01 and 02-02 which also modify tasks.py.
</objective>
<execution_context>
@@ -146,7 +145,7 @@ From packages/shared/shared/config.py:
3. Create `normalize_whatsapp_event()` in normalize.py:
- Takes: parsed webhook JSON body (dict)
- Extracts: entry[0].changes[0].value this is the Meta Cloud API v20.0 structure
- Extracts: entry[0].changes[0].value -- this is the Meta Cloud API v20.0 structure
- Maps: messages[0].from -> sender.user_id, messages[0].text.body -> content.text
- For media messages (type=image/document): extract media_id, set MediaAttachment with media_type and a placeholder URL (actual download happens in the adapter)
- Sets channel='whatsapp', thread_id=sender_wa_id (WhatsApp conversations are per-phone-number, not threaded)
@@ -171,7 +170,7 @@ From packages/shared/shared/config.py:
</task>
<task type="auto" tdd="true">
<name>Task 2: WhatsApp adapter with business-function scoping, media handling, and outbound delivery</name>
<name>Task 2: WhatsApp adapter with business-function scoping, media download/storage, and outbound delivery</name>
<files>
packages/gateway/gateway/channels/whatsapp.py,
packages/gateway/gateway/main.py,
@@ -190,12 +189,12 @@ From packages/shared/shared/config.py:
<action>
1. Create `packages/gateway/gateway/channels/whatsapp.py`:
- whatsapp_router = APIRouter()
- GET /whatsapp/webhook verification handshake: check hub.mode=="subscribe" and hub.verify_token matches settings, return hub.challenge as PlainTextResponse
- POST /whatsapp/webhook inbound message handler:
- GET /whatsapp/webhook -- verification handshake: check hub.mode=="subscribe" and hub.verify_token matches settings, return hub.challenge as PlainTextResponse
- POST /whatsapp/webhook -- inbound message handler:
a. Read raw body via request.body() BEFORE parsing
b. Verify HMAC-SHA256 signature (X-Hub-Signature-256 header)
c. Parse JSON from raw body
d. Skip non-message events (status updates, read receipts check for messages key)
d. Skip non-message events (status updates, read receipts -- check for messages key)
e. Normalize via normalize_whatsapp_event()
f. Resolve tenant via phone_number_id as workspace_id (same resolve_tenant function as Slack)
g. Check rate limit (reuse existing check_rate_limit)
@@ -203,32 +202,29 @@ From packages/shared/shared/config.py:
i. Business-function scoping check (see below)
j. If media: download from Meta API, upload to MinIO with key {tenant_id}/{agent_id}/{message_id}/{filename}, update MediaAttachment.storage_key and .url (presigned URL)
k. Dispatch handle_message.delay() with msg payload + extras (bot_token from channel_connections.config['access_token'], phone_number_id)
- Always return 200 OK to Meta (even on errors Meta retries on non-200)
- Always return 200 OK to Meta (even on errors -- Meta retries on non-200)
2. Business-function scoping (two-tier gate per user decision):
- Tier 1: is_clearly_off_topic(text, allowed_functions) simple keyword overlap check. If zero overlap with any allowed function keywords, return True. Agent's allowed_functions come from Agent model (add `allowed_functions: list[str] = []` field if not present, or use agent.tools as proxy).
- Tier 1: is_clearly_off_topic(text, allowed_functions) -- simple keyword overlap check. If zero overlap with any allowed function keywords, return True. Agent's allowed_functions come from Agent model (add `allowed_functions: list[str] = []` field if not present, or use agent.tools as proxy).
- If clearly off-topic: send canned redirect via send_whatsapp_message: "{agent.name} is here to help with {', '.join(allowed_functions)}. How can I assist you with one of those?"
- Tier 2: Borderline messages pass to the LLM. The scoping is enforced via the system prompt (which already contains the agent's role and persona). Add to system prompt builder: if channel == 'whatsapp', append "You only handle: {allowed_functions}. If a request is outside these areas, politely redirect the user."
3. Outbound message delivery:
3. Outbound message delivery (used by the adapter for direct responses like off-topic canned replies):
- async send_whatsapp_message(phone_number_id, access_token, recipient_wa_id, text) -> None
- POST to https://graph.facebook.com/v20.0/{phone_number_id}/messages with messaging_product="whatsapp", to=recipient_wa_id, type="text", text={"body": text}
- async send_whatsapp_media(phone_number_id, access_token, recipient_wa_id, media_url, media_type) for outbound media
4. Wire into orchestrator tasks.py:
- In handle_message task, after getting LLM response, check channel type
- If 'whatsapp': call send_whatsapp_message via httpx (same pattern as Slack chat.update — bot token from extras)
- If 'slack': existing chat.update flow
4. Install boto3 for MinIO: `uv add boto3` in gateway package. Use endpoint_url=settings.minio_endpoint for S3-compatible MinIO access.
5. Install boto3 for MinIO: `uv add boto3` in gateway package. Use endpoint_url=settings.minio_endpoint for S3-compatible MinIO access.
5. Register whatsapp_router in gateway main.py: `app.include_router(whatsapp_router)`
6. Register whatsapp_router in gateway main.py: `app.include_router(whatsapp_router)`
7. Write test_whatsapp_scoping.py:
6. Write test_whatsapp_scoping.py:
- Test is_clearly_off_topic with matching keywords -> False
- Test is_clearly_off_topic with zero overlap -> True
- Test canned redirect message format includes agent name and allowed functions
- Test borderline message passes through (not rejected by tier 1)
Note: The orchestrator-side wiring (channel-aware outbound routing in tasks.py) is deferred to Plan 02-05 to avoid file conflicts with Plans 02-01 and 02-02. The WhatsApp adapter can handle direct responses (off-topic canned replies, webhook verification) independently. LLM-generated responses routed back through WhatsApp will be wired in Plan 02-05.
</action>
<verify>
<automated>cd /home/adelorenzo/repos/konstruct && python -m pytest tests/unit/test_whatsapp_verify.py tests/unit/test_whatsapp_normalize.py tests/unit/test_whatsapp_scoping.py -x -v</automated>
@@ -238,7 +234,7 @@ From packages/shared/shared/config.py:
- Signature verification on raw body bytes before JSON parsing
- Business-function scoping: tier 1 rejects clearly off-topic, tier 2 scopes via system prompt
- Media downloaded from Meta API and stored in MinIO with tenant-prefixed keys
- Outbound messages sent via Meta Cloud API
- Outbound text and media messages sent via Meta Cloud API (for adapter-direct responses)
- Gateway routes registered and running
- Canned redirect includes agent name and allowed topics
</done>
@@ -256,7 +252,7 @@ From packages/shared/shared/config.py:
- WhatsApp messages are normalized to KonstructMessage and dispatched through the existing pipeline
- Webhook signature verification prevents unauthorized requests
- Business-function scoping enforces Meta 2026 policy (tier 1 keyword gate + tier 2 LLM scoping)
- Media attachments are downloaded, stored in MinIO, and available for multimodal LLM processing
- Media attachments are downloaded, stored in MinIO, and available for downstream processing
- Per-tenant phone number isolation via phone_number_id in channel_connections
</success_criteria>

View File

@@ -0,0 +1,230 @@
---
phase: 02-agent-features
plan: 05
type: execute
wave: 3
depends_on: ["02-02", "02-03"]
files_modified:
- packages/orchestrator/orchestrator/tasks.py
- packages/orchestrator/orchestrator/agents/builder.py
- packages/orchestrator/orchestrator/agents/runner.py
- packages/gateway/gateway/channels/slack.py
- tests/unit/test_multimodal_messages.py
- tests/unit/test_slack_media.py
autonomous: true
requirements:
- CHAN-03
must_haves:
truths:
- "Agent can RECEIVE images and documents and interpret them via multimodal LLM"
- "Media attachments from WhatsApp are passed to the LLM as image_url content blocks"
- "Media attachments from Slack file_share events are downloaded, stored in MinIO, and passed to the LLM as image_url content blocks"
- "Orchestrator routes LLM responses back through the correct channel (Slack or WhatsApp)"
- "Non-vision models gracefully skip image content blocks instead of erroring"
- "Agent can SEND images and documents back to users on both Slack and WhatsApp"
artifacts:
- path: "packages/orchestrator/orchestrator/agents/builder.py"
provides: "build_messages_with_media() that injects image_url content blocks for media attachments"
exports: ["build_messages_with_media"]
- path: "packages/orchestrator/orchestrator/tasks.py"
provides: "Channel-aware outbound routing (Slack chat.update vs WhatsApp send_whatsapp_message)"
- path: "tests/unit/test_multimodal_messages.py"
provides: "Tests for image_url content block injection and vision model detection"
- path: "tests/unit/test_slack_media.py"
provides: "Tests for Slack file_share event extraction"
key_links:
- from: "packages/orchestrator/orchestrator/agents/builder.py"
to: "MinIO presigned URLs"
via: "generate_presigned_url for each MediaAttachment.storage_key"
pattern: "generate_presigned_url|image_url"
- from: "packages/orchestrator/orchestrator/tasks.py"
to: "gateway/channels/whatsapp.py send_whatsapp_message"
via: "httpx POST for WhatsApp outbound delivery"
pattern: "send_whatsapp_message|channel.*whatsapp"
- from: "packages/gateway/gateway/channels/slack.py"
to: "MinIO storage"
via: "Slack file download -> MinIO upload for file_share events"
pattern: "file_share|files\\.info"
---
<objective>
Wire cross-channel media support and multimodal LLM interpretation into the orchestrator pipeline. Add Slack file_share media extraction, channel-aware outbound routing (Slack vs WhatsApp), and image_url content block injection so the LLM can interpret images and documents sent by users.
Purpose: Completes the locked decision "Agent can RECEIVE images and documents and interpret them via multimodal LLM" and "Bidirectional media support across Slack and WhatsApp." Without this plan, media is stored in MinIO but never interpreted by the LLM, and WhatsApp responses are not routed back to users.
Output: Multimodal message building, Slack media handling, channel-aware outbound routing, passing tests.
</objective>
<execution_context>
@/home/adelorenzo/.claude/get-shit-done/workflows/execute-plan.md
@/home/adelorenzo/.claude/get-shit-done/templates/summary.md
</execution_context>
<context>
@.planning/PROJECT.md
@.planning/ROADMAP.md
@.planning/STATE.md
@.planning/phases/02-agent-features/02-CONTEXT.md
@.planning/phases/02-agent-features/02-RESEARCH.md
@.planning/phases/02-agent-features/02-02-SUMMARY.md
@.planning/phases/02-agent-features/02-03-SUMMARY.md
@packages/orchestrator/orchestrator/tasks.py
@packages/orchestrator/orchestrator/agents/builder.py
@packages/orchestrator/orchestrator/agents/runner.py
@packages/gateway/gateway/channels/slack.py
@packages/gateway/gateway/channels/whatsapp.py
@packages/shared/shared/models/message.py
<interfaces>
<!-- From Plan 02-03: MediaAttachment model and WhatsApp outbound functions -->
From packages/shared/shared/models/message.py:
- MediaType(StrEnum): IMAGE, DOCUMENT, AUDIO, VIDEO
- MediaAttachment(BaseModel): media_type, url, storage_key, mime_type, filename, size_bytes
- MessageContent.media: list[MediaAttachment]
From packages/gateway/gateway/channels/whatsapp.py:
- async send_whatsapp_message(phone_number_id, access_token, recipient_wa_id, text) -> None
- async send_whatsapp_media(phone_number_id, access_token, recipient_wa_id, media_url, media_type)
<!-- From Plan 02-02: Tool loop and runner -->
From packages/orchestrator/orchestrator/agents/runner.py:
- run_agent() with tool-call loop, accepts messages parameter
- AuditLogger passed through the loop
<!-- From Plan 02-01: Memory-enriched message building -->
From packages/orchestrator/orchestrator/agents/builder.py:
- build_messages_with_memory(agent, current_message, recent_messages, relevant_context) -> list[dict]
</interfaces>
</context>
<tasks>
<task type="auto" tdd="true">
<name>Task 1: Slack file_share media extraction and channel-aware outbound routing</name>
<files>
packages/gateway/gateway/channels/slack.py,
packages/orchestrator/orchestrator/tasks.py,
tests/unit/test_slack_media.py
</files>
<behavior>
- Slack file_share events extract file URL, download via Slack API, upload to MinIO with tenant-prefixed key
- Slack file_share events populate MediaAttachment on the KonstructMessage before dispatching to orchestrator
- handle_message in tasks.py checks msg.channel to determine outbound delivery method
- channel=='slack': existing chat.update flow (unchanged)
- channel=='whatsapp': call send_whatsapp_message via httpx to gateway or directly via Meta API
- WhatsApp outbound includes phone_number_id and access_token from task extras
</behavior>
<action>
1. Update `packages/gateway/gateway/channels/slack.py`:
- In the existing Slack event handler, detect `file_share` subtype events
- For file_share events: extract file info via Slack `files.info` API call (uses bot_token)
- Download the file content via the file's `url_private_download` (with Authorization: Bearer bot_token)
- Upload to MinIO with key: {tenant_id}/{agent_id}/{message_id}/{filename}
- Create MediaAttachment with media_type (infer from mime_type: image/* -> IMAGE, application/pdf -> DOCUMENT, etc.), storage_key, mime_type, filename, size_bytes
- Attach to the KonstructMessage.content.media list before dispatching handle_message.delay()
- Install boto3 in gateway if not already: `uv add boto3` (may already be installed from Plan 02-03)
2. Update `packages/orchestrator/orchestrator/tasks.py` -- channel-aware outbound routing:
- After getting the LLM response text, check msg['channel'] (from the deserialized KonstructMessage)
- If channel == 'slack': use existing chat.update flow (no change)
- If channel == 'whatsapp': POST to send_whatsapp_message. Import and call the WhatsApp send function, or make an httpx POST to the gateway's WhatsApp send endpoint. Use phone_number_id and access_token from extras dict passed through Celery.
- Extract a helper function: async send_response(channel, text, extras) that dispatches to the correct channel's outbound method. This keeps the main handle_message clean.
- For media responses (when the agent generates/references files to send back): check if the LLM response contains file references, and use send_whatsapp_media or Slack files.upload_v2 accordingly. For v1, text-only outbound is sufficient -- media outbound can be a follow-up.
3. Write test_slack_media.py:
- Test file_share event detection (file_share subtype identified)
- Test MediaAttachment creation from Slack file metadata (correct media_type, filename, mime_type)
- Test MinIO upload key format: {tenant_id}/{agent_id}/{message_id}/{filename}
- Mock httpx/boto3 calls to avoid real API hits
</action>
<verify>
<automated>cd /home/adelorenzo/repos/konstruct && python -m pytest tests/unit/test_slack_media.py -x -v</automated>
</verify>
<done>
- Slack file_share events produce KonstructMessages with populated media list
- Files downloaded from Slack and stored in MinIO with tenant-isolated keys
- Orchestrator routes responses to correct channel (Slack chat.update vs WhatsApp API)
- WhatsApp outbound delivery wired into orchestrator pipeline
</done>
</task>
<task type="auto" tdd="true">
<name>Task 2: Multimodal LLM interpretation -- image_url content blocks for media attachments</name>
<files>
packages/orchestrator/orchestrator/agents/builder.py,
packages/orchestrator/orchestrator/agents/runner.py,
tests/unit/test_multimodal_messages.py
</files>
<behavior>
- build_messages_with_media detects MediaAttachment objects on the current message
- For IMAGE media: generates a MinIO presigned URL and injects an image_url content block into the user message
- For DOCUMENT media: generates a presigned URL and includes it as a text reference (PDFs cannot be image_url blocks)
- supports_vision(model_name) returns True for known vision models (claude-3*, gpt-4o*, gpt-4-vision*, gemini-pro-vision*)
- When model does NOT support vision: image_url blocks are stripped and replaced with text "[Image attached: {filename}]"
- LLM messages array uses the multipart content format: {"role": "user", "content": [{"type": "text", "text": "..."}, {"type": "image_url", "image_url": {"url": "..."}}]}
- Presigned URLs have a 1-hour expiry
</behavior>
<action>
1. Update `packages/orchestrator/orchestrator/agents/builder.py`:
- Add supports_vision(model_name: str) -> bool function:
- Returns True if model_name matches known vision-capable patterns: "claude-3" (all Claude 3+ models), "gpt-4o", "gpt-4-vision", "gemini-pro-vision", "gemini-1.5"
- Check via LiteLLM's litellm.supports_vision(model) if available, otherwise use the pattern match above
- Add generate_presigned_url(storage_key: str, expiry: int = 3600) -> str function:
- Uses boto3 S3 client with MinIO endpoint to generate a presigned GET URL
- Expiry defaults to 1 hour
- Update build_messages_with_memory() (or create build_messages_with_media() wrapper):
- After assembling the messages array, check if current_message has media attachments
- For each MediaAttachment with media_type == IMAGE and a storage_key:
- Generate presigned URL
- If model supports vision: convert the user message content from a plain string to multipart format:
[{"type": "text", "text": original_text}, {"type": "image_url", "image_url": {"url": presigned_url, "detail": "auto"}}]
- If model does NOT support vision: append "[Image attached: {filename}]" to the text content instead
- For DOCUMENT attachments: append "[Document attached: {filename} - {presigned_url}]" to text content (documents are text-referenced, not image_url blocks)
- This follows the OpenAI/Anthropic multimodal message format that LiteLLM normalizes across providers
2. Update `packages/orchestrator/orchestrator/agents/runner.py`:
- Ensure the messages array with multipart content blocks is passed through to the LLM call without modification
- The tool-call loop must preserve multipart content format when re-calling the LLM
- No changes needed if runner already passes messages directly to llm-pool -- just verify
3. Write test_multimodal_messages.py:
- Test: message with IMAGE MediaAttachment + vision model produces image_url content block
- Test: message with IMAGE MediaAttachment + non-vision model produces text fallback "[Image attached: ...]"
- Test: message with DOCUMENT MediaAttachment produces text reference with presigned URL
- Test: message with no media produces standard text-only content (no regression)
- Test: supports_vision returns True for "claude-3-sonnet", "gpt-4o", False for "gpt-3.5-turbo"
- Test: presigned URL has correct format and expiry (mock boto3)
</action>
<verify>
<automated>cd /home/adelorenzo/repos/konstruct && python -m pytest tests/unit/test_multimodal_messages.py -x -v</automated>
</verify>
<done>
- Media attachments from both Slack and WhatsApp are passed to the LLM as image_url content blocks
- Vision-capable models receive image_url blocks; non-vision models get text fallback
- Document attachments are text-referenced with presigned URLs
- Presigned URLs generated from MinIO with 1-hour expiry
- No regression for text-only messages
</done>
</task>
</tasks>
<verification>
- All Phase 1 + Phase 2 plans 01-04 tests still pass: `pytest tests/ -x`
- Media tests pass: `pytest tests/unit/test_slack_media.py tests/unit/test_multimodal_messages.py -x`
- End-to-end: Slack file_share -> MinIO storage -> image_url in LLM prompt (verified via test mocks)
- End-to-end: WhatsApp image -> MinIO storage -> image_url in LLM prompt (verified via test mocks)
</verification>
<success_criteria>
- Agent interprets images sent via Slack and WhatsApp using multimodal LLM capabilities
- Slack file_share events are extracted, stored in MinIO, and passed to the orchestrator
- Orchestrator routes responses to the correct channel (Slack or WhatsApp)
- Non-vision models gracefully handle media with text fallback
- Bidirectional media support: receive and interpret on both channels
</success_criteria>
<output>
After completion, create `.planning/phases/02-agent-features/02-05-SUMMARY.md`
</output>