konstruct/.planning/phases/02-agent-features/02-05-PLAN.md at f9ce3d650ff89a776b41257159de81b26cb680c5

adelorenzo/konstruct

Fork 0

Files

Adolfo Delorenzo b2e86f1046 fix(02-agent-features): revise plans based on checker feedback

2026-03-23 14:32:20 -06:00

13 KiB

Raw Blame History

phase, plan, type, wave, depends_on, files_modified, autonomous, requirements, must_haves

phase

plan

type

wave

depends_on

files_modified

autonomous

requirements

must_haves

02-agent-features

execute

02-02

02-03

packages/orchestrator/orchestrator/tasks.py

packages/orchestrator/orchestrator/agents/builder.py

packages/orchestrator/orchestrator/agents/runner.py

packages/gateway/gateway/channels/slack.py

tests/unit/test_multimodal_messages.py

tests/unit/test_slack_media.py

true

CHAN-03

truths

artifacts

key_links

Agent can RECEIVE images and documents and interpret them via multimodal LLM

Media attachments from WhatsApp are passed to the LLM as image_url content blocks

Media attachments from Slack file_share events are downloaded, stored in MinIO, and passed to the LLM as image_url content blocks

Orchestrator routes LLM responses back through the correct channel (Slack or WhatsApp)

Non-vision models gracefully skip image content blocks instead of erroring

Agent can SEND images and documents back to users on both Slack and WhatsApp

path

provides

exports

packages/orchestrator/orchestrator/agents/builder.py

build_messages_with_media() that injects image_url content blocks for media attachments

build_messages_with_media

path	provides
packages/orchestrator/orchestrator/tasks.py	Channel-aware outbound routing (Slack chat.update vs WhatsApp send_whatsapp_message)

path	provides
tests/unit/test_multimodal_messages.py	Tests for image_url content block injection and vision model detection

path	provides
tests/unit/test_slack_media.py	Tests for Slack file_share event extraction

from	to	via	pattern
packages/orchestrator/orchestrator/agents/builder.py	MinIO presigned URLs	generate_presigned_url for each MediaAttachment.storage_key	generate_presigned_url\|image_url

from	to	via	pattern
packages/orchestrator/orchestrator/tasks.py	gateway/channels/whatsapp.py send_whatsapp_message	httpx POST for WhatsApp outbound delivery	send_whatsapp_message\|channel.*whatsapp

from	to	via	pattern
packages/gateway/gateway/channels/slack.py	MinIO storage	Slack file download -> MinIO upload for file_share events	file_share\|files.info

Wire cross-channel media support and multimodal LLM interpretation into the orchestrator pipeline. Add Slack file_share media extraction, channel-aware outbound routing (Slack vs WhatsApp), and image_url content block injection so the LLM can interpret images and documents sent by users.

Purpose: Completes the locked decision "Agent can RECEIVE images and documents and interpret them via multimodal LLM" and "Bidirectional media support across Slack and WhatsApp." Without this plan, media is stored in MinIO but never interpreted by the LLM, and WhatsApp responses are not routed back to users. Output: Multimodal message building, Slack media handling, channel-aware outbound routing, passing tests.

<execution_context> @/home/adelorenzo/.claude/get-shit-done/workflows/execute-plan.md @/home/adelorenzo/.claude/get-shit-done/templates/summary.md </execution_context>

@.planning/PROJECT.md @.planning/ROADMAP.md @.planning/STATE.md @.planning/phases/02-agent-features/02-CONTEXT.md @.planning/phases/02-agent-features/02-RESEARCH.md @.planning/phases/02-agent-features/02-02-SUMMARY.md @.planning/phases/02-agent-features/02-03-SUMMARY.md

@packages/orchestrator/orchestrator/tasks.py @packages/orchestrator/orchestrator/agents/builder.py @packages/orchestrator/orchestrator/agents/runner.py @packages/gateway/gateway/channels/slack.py @packages/gateway/gateway/channels/whatsapp.py @packages/shared/shared/models/message.py

From packages/shared/shared/models/message.py: - MediaType(StrEnum): IMAGE, DOCUMENT, AUDIO, VIDEO - MediaAttachment(BaseModel): media_type, url, storage_key, mime_type, filename, size_bytes - MessageContent.media: list[MediaAttachment]

From packages/gateway/gateway/channels/whatsapp.py:

async send_whatsapp_message(phone_number_id, access_token, recipient_wa_id, text) -> None
async send_whatsapp_media(phone_number_id, access_token, recipient_wa_id, media_url, media_type)

From packages/orchestrator/orchestrator/agents/runner.py:

run_agent() with tool-call loop, accepts messages parameter
AuditLogger passed through the loop

From packages/orchestrator/orchestrator/agents/builder.py:

build_messages_with_memory(agent, current_message, recent_messages, relevant_context) -> list[dict]

Task 1: Slack file_share media extraction and channel-aware outbound routing packages/gateway/gateway/channels/slack.py, packages/orchestrator/orchestrator/tasks.py, tests/unit/test_slack_media.py - Slack file_share events extract file URL, download via Slack API, upload to MinIO with tenant-prefixed key - Slack file_share events populate MediaAttachment on the KonstructMessage before dispatching to orchestrator - handle_message in tasks.py checks msg.channel to determine outbound delivery method - channel=='slack': existing chat.update flow (unchanged) - channel=='whatsapp': call send_whatsapp_message via httpx to gateway or directly via Meta API - WhatsApp outbound includes phone_number_id and access_token from task extras 1. Update `packages/gateway/gateway/channels/slack.py`: - In the existing Slack event handler, detect `file_share` subtype events - For file_share events: extract file info via Slack `files.info` API call (uses bot_token) - Download the file content via the file's `url_private_download` (with Authorization: Bearer bot_token) - Upload to MinIO with key: {tenant_id}/{agent_id}/{message_id}/{filename} - Create MediaAttachment with media_type (infer from mime_type: image/* -> IMAGE, application/pdf -> DOCUMENT, etc.), storage_key, mime_type, filename, size_bytes - Attach to the KonstructMessage.content.media list before dispatching handle_message.delay() - Install boto3 in gateway if not already: `uv add boto3` (may already be installed from Plan 02-03)

2. Update `packages/orchestrator/orchestrator/tasks.py` -- channel-aware outbound routing:
   - After getting the LLM response text, check msg['channel'] (from the deserialized KonstructMessage)
   - If channel == 'slack': use existing chat.update flow (no change)
   - If channel == 'whatsapp': POST to send_whatsapp_message. Import and call the WhatsApp send function, or make an httpx POST to the gateway's WhatsApp send endpoint. Use phone_number_id and access_token from extras dict passed through Celery.
   - Extract a helper function: async send_response(channel, text, extras) that dispatches to the correct channel's outbound method. This keeps the main handle_message clean.
   - For media responses (when the agent generates/references files to send back): check if the LLM response contains file references, and use send_whatsapp_media or Slack files.upload_v2 accordingly. For v1, text-only outbound is sufficient -- media outbound can be a follow-up.

3. Write test_slack_media.py:
   - Test file_share event detection (file_share subtype identified)
   - Test MediaAttachment creation from Slack file metadata (correct media_type, filename, mime_type)
   - Test MinIO upload key format: {tenant_id}/{agent_id}/{message_id}/{filename}
   - Mock httpx/boto3 calls to avoid real API hits

cd /home/adelorenzo/repos/konstruct && python -m pytest tests/unit/test_slack_media.py -x -v - Slack file_share events produce KonstructMessages with populated media list - Files downloaded from Slack and stored in MinIO with tenant-isolated keys - Orchestrator routes responses to correct channel (Slack chat.update vs WhatsApp API) - WhatsApp outbound delivery wired into orchestrator pipeline Task 2: Multimodal LLM interpretation -- image_url content blocks for media attachments packages/orchestrator/orchestrator/agents/builder.py, packages/orchestrator/orchestrator/agents/runner.py, tests/unit/test_multimodal_messages.py - build_messages_with_media detects MediaAttachment objects on the current message - For IMAGE media: generates a MinIO presigned URL and injects an image_url content block into the user message - For DOCUMENT media: generates a presigned URL and includes it as a text reference (PDFs cannot be image_url blocks) - supports_vision(model_name) returns True for known vision models (claude-3*, gpt-4o*, gpt-4-vision*, gemini-pro-vision*) - When model does NOT support vision: image_url blocks are stripped and replaced with text "[Image attached: {filename}]" - LLM messages array uses the multipart content format: {"role": "user", "content": [{"type": "text", "text": "..."}, {"type": "image_url", "image_url": {"url": "..."}}]} - Presigned URLs have a 1-hour expiry 1. Update `packages/orchestrator/orchestrator/agents/builder.py`: - Add supports_vision(model_name: str) -> bool function: - Returns True if model_name matches known vision-capable patterns: "claude-3" (all Claude 3+ models), "gpt-4o", "gpt-4-vision", "gemini-pro-vision", "gemini-1.5" - Check via LiteLLM's litellm.supports_vision(model) if available, otherwise use the pattern match above - Add generate_presigned_url(storage_key: str, expiry: int = 3600) -> str function: - Uses boto3 S3 client with MinIO endpoint to generate a presigned GET URL - Expiry defaults to 1 hour - Update build_messages_with_memory() (or create build_messages_with_media() wrapper): - After assembling the messages array, check if current_message has media attachments - For each MediaAttachment with media_type == IMAGE and a storage_key: - Generate presigned URL - If model supports vision: convert the user message content from a plain string to multipart format: [{"type": "text", "text": original_text}, {"type": "image_url", "image_url": {"url": presigned_url, "detail": "auto"}}] - If model does NOT support vision: append "[Image attached: {filename}]" to the text content instead - For DOCUMENT attachments: append "[Document attached: {filename} - {presigned_url}]" to text content (documents are text-referenced, not image_url blocks) - This follows the OpenAI/Anthropic multimodal message format that LiteLLM normalizes across providers

2. Update `packages/orchestrator/orchestrator/agents/runner.py`:
   - Ensure the messages array with multipart content blocks is passed through to the LLM call without modification
   - The tool-call loop must preserve multipart content format when re-calling the LLM
   - No changes needed if runner already passes messages directly to llm-pool -- just verify

3. Write test_multimodal_messages.py:
   - Test: message with IMAGE MediaAttachment + vision model produces image_url content block
   - Test: message with IMAGE MediaAttachment + non-vision model produces text fallback "[Image attached: ...]"
   - Test: message with DOCUMENT MediaAttachment produces text reference with presigned URL
   - Test: message with no media produces standard text-only content (no regression)
   - Test: supports_vision returns True for "claude-3-sonnet", "gpt-4o", False for "gpt-3.5-turbo"
   - Test: presigned URL has correct format and expiry (mock boto3)

cd /home/adelorenzo/repos/konstruct && python -m pytest tests/unit/test_multimodal_messages.py -x -v - Media attachments from both Slack and WhatsApp are passed to the LLM as image_url content blocks - Vision-capable models receive image_url blocks; non-vision models get text fallback - Document attachments are text-referenced with presigned URLs - Presigned URLs generated from MinIO with 1-hour expiry - No regression for text-only messages - All Phase 1 + Phase 2 plans 01-04 tests still pass: `pytest tests/ -x` - Media tests pass: `pytest tests/unit/test_slack_media.py tests/unit/test_multimodal_messages.py -x` - End-to-end: Slack file_share -> MinIO storage -> image_url in LLM prompt (verified via test mocks) - End-to-end: WhatsApp image -> MinIO storage -> image_url in LLM prompt (verified via test mocks)

<success_criteria>

Agent interprets images sent via Slack and WhatsApp using multimodal LLM capabilities
Slack file_share events are extracted, stored in MinIO, and passed to the orchestrator
Orchestrator routes responses to the correct channel (Slack or WhatsApp)
Non-vision models gracefully handle media with text fallback
Bidirectional media support: receive and interpret on both channels </success_criteria>

After completion, create `.planning/phases/02-agent-features/02-05-SUMMARY.md`

13 KiB Raw Blame History

13 KiB

Raw Blame History