13 KiB
phase, plan, type, wave, depends_on, files_modified, autonomous, requirements, must_haves
| phase | plan | type | wave | depends_on | files_modified | autonomous | requirements | must_haves | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 02-agent-features | 05 | execute | 3 |
|
|
true |
|
|
Purpose: Completes the locked decision "Agent can RECEIVE images and documents and interpret them via multimodal LLM" and "Bidirectional media support across Slack and WhatsApp." Without this plan, media is stored in MinIO but never interpreted by the LLM, and WhatsApp responses are not routed back to users. Output: Multimodal message building, Slack media handling, channel-aware outbound routing, passing tests.
<execution_context> @/home/adelorenzo/.claude/get-shit-done/workflows/execute-plan.md @/home/adelorenzo/.claude/get-shit-done/templates/summary.md </execution_context>
@.planning/PROJECT.md @.planning/ROADMAP.md @.planning/STATE.md @.planning/phases/02-agent-features/02-CONTEXT.md @.planning/phases/02-agent-features/02-RESEARCH.md @.planning/phases/02-agent-features/02-02-SUMMARY.md @.planning/phases/02-agent-features/02-03-SUMMARY.md@packages/orchestrator/orchestrator/tasks.py @packages/orchestrator/orchestrator/agents/builder.py @packages/orchestrator/orchestrator/agents/runner.py @packages/gateway/gateway/channels/slack.py @packages/gateway/gateway/channels/whatsapp.py @packages/shared/shared/models/message.py
From packages/shared/shared/models/message.py: - MediaType(StrEnum): IMAGE, DOCUMENT, AUDIO, VIDEO - MediaAttachment(BaseModel): media_type, url, storage_key, mime_type, filename, size_bytes - MessageContent.media: list[MediaAttachment]From packages/gateway/gateway/channels/whatsapp.py:
- async send_whatsapp_message(phone_number_id, access_token, recipient_wa_id, text) -> None
- async send_whatsapp_media(phone_number_id, access_token, recipient_wa_id, media_url, media_type)
From packages/orchestrator/orchestrator/agents/runner.py:
- run_agent() with tool-call loop, accepts messages parameter
- AuditLogger passed through the loop
From packages/orchestrator/orchestrator/agents/builder.py:
- build_messages_with_memory(agent, current_message, recent_messages, relevant_context) -> list[dict]
2. Update `packages/orchestrator/orchestrator/tasks.py` -- channel-aware outbound routing:
- After getting the LLM response text, check msg['channel'] (from the deserialized KonstructMessage)
- If channel == 'slack': use existing chat.update flow (no change)
- If channel == 'whatsapp': POST to send_whatsapp_message. Import and call the WhatsApp send function, or make an httpx POST to the gateway's WhatsApp send endpoint. Use phone_number_id and access_token from extras dict passed through Celery.
- Extract a helper function: async send_response(channel, text, extras) that dispatches to the correct channel's outbound method. This keeps the main handle_message clean.
- For media responses (when the agent generates/references files to send back): check if the LLM response contains file references, and use send_whatsapp_media or Slack files.upload_v2 accordingly. For v1, text-only outbound is sufficient -- media outbound can be a follow-up.
3. Write test_slack_media.py:
- Test file_share event detection (file_share subtype identified)
- Test MediaAttachment creation from Slack file metadata (correct media_type, filename, mime_type)
- Test MinIO upload key format: {tenant_id}/{agent_id}/{message_id}/{filename}
- Mock httpx/boto3 calls to avoid real API hits
cd /home/adelorenzo/repos/konstruct && python -m pytest tests/unit/test_slack_media.py -x -v
- Slack file_share events produce KonstructMessages with populated media list
- Files downloaded from Slack and stored in MinIO with tenant-isolated keys
- Orchestrator routes responses to correct channel (Slack chat.update vs WhatsApp API)
- WhatsApp outbound delivery wired into orchestrator pipeline
Task 2: Multimodal LLM interpretation -- image_url content blocks for media attachments
packages/orchestrator/orchestrator/agents/builder.py,
packages/orchestrator/orchestrator/agents/runner.py,
tests/unit/test_multimodal_messages.py
- build_messages_with_media detects MediaAttachment objects on the current message
- For IMAGE media: generates a MinIO presigned URL and injects an image_url content block into the user message
- For DOCUMENT media: generates a presigned URL and includes it as a text reference (PDFs cannot be image_url blocks)
- supports_vision(model_name) returns True for known vision models (claude-3*, gpt-4o*, gpt-4-vision*, gemini-pro-vision*)
- When model does NOT support vision: image_url blocks are stripped and replaced with text "[Image attached: {filename}]"
- LLM messages array uses the multipart content format: {"role": "user", "content": [{"type": "text", "text": "..."}, {"type": "image_url", "image_url": {"url": "..."}}]}
- Presigned URLs have a 1-hour expiry
1. Update `packages/orchestrator/orchestrator/agents/builder.py`:
- Add supports_vision(model_name: str) -> bool function:
- Returns True if model_name matches known vision-capable patterns: "claude-3" (all Claude 3+ models), "gpt-4o", "gpt-4-vision", "gemini-pro-vision", "gemini-1.5"
- Check via LiteLLM's litellm.supports_vision(model) if available, otherwise use the pattern match above
- Add generate_presigned_url(storage_key: str, expiry: int = 3600) -> str function:
- Uses boto3 S3 client with MinIO endpoint to generate a presigned GET URL
- Expiry defaults to 1 hour
- Update build_messages_with_memory() (or create build_messages_with_media() wrapper):
- After assembling the messages array, check if current_message has media attachments
- For each MediaAttachment with media_type == IMAGE and a storage_key:
- Generate presigned URL
- If model supports vision: convert the user message content from a plain string to multipart format:
[{"type": "text", "text": original_text}, {"type": "image_url", "image_url": {"url": presigned_url, "detail": "auto"}}]
- If model does NOT support vision: append "[Image attached: {filename}]" to the text content instead
- For DOCUMENT attachments: append "[Document attached: {filename} - {presigned_url}]" to text content (documents are text-referenced, not image_url blocks)
- This follows the OpenAI/Anthropic multimodal message format that LiteLLM normalizes across providers
2. Update `packages/orchestrator/orchestrator/agents/runner.py`:
- Ensure the messages array with multipart content blocks is passed through to the LLM call without modification
- The tool-call loop must preserve multipart content format when re-calling the LLM
- No changes needed if runner already passes messages directly to llm-pool -- just verify
3. Write test_multimodal_messages.py:
- Test: message with IMAGE MediaAttachment + vision model produces image_url content block
- Test: message with IMAGE MediaAttachment + non-vision model produces text fallback "[Image attached: ...]"
- Test: message with DOCUMENT MediaAttachment produces text reference with presigned URL
- Test: message with no media produces standard text-only content (no regression)
- Test: supports_vision returns True for "claude-3-sonnet", "gpt-4o", False for "gpt-3.5-turbo"
- Test: presigned URL has correct format and expiry (mock boto3)
cd /home/adelorenzo/repos/konstruct && python -m pytest tests/unit/test_multimodal_messages.py -x -v
- Media attachments from both Slack and WhatsApp are passed to the LLM as image_url content blocks
- Vision-capable models receive image_url blocks; non-vision models get text fallback
- Document attachments are text-referenced with presigned URLs
- Presigned URLs generated from MinIO with 1-hour expiry
- No regression for text-only messages
- All Phase 1 + Phase 2 plans 01-04 tests still pass: `pytest tests/ -x`
- Media tests pass: `pytest tests/unit/test_slack_media.py tests/unit/test_multimodal_messages.py -x`
- End-to-end: Slack file_share -> MinIO storage -> image_url in LLM prompt (verified via test mocks)
- End-to-end: WhatsApp image -> MinIO storage -> image_url in LLM prompt (verified via test mocks)
<success_criteria>
- Agent interprets images sent via Slack and WhatsApp using multimodal LLM capabilities
- Slack file_share events are extracted, stored in MinIO, and passed to the orchestrator
- Orchestrator routes responses to the correct channel (Slack or WhatsApp)
- Non-vision models gracefully handle media with text fallback
- Bidirectional media support: receive and interpret on both channels </success_criteria>