13 KiB
Phase 2: Agent Features Verification Report
Phase Goal: The AI employee maintains conversation memory, can execute tools, handles WhatsApp messages, and escalates to humans when rules trigger — making it a capable product rather than a demo Verified: 2026-03-24T01:18:24Z Status: human_needed Re-verification: Yes — after gap closure (Plan 02-06)
Goal Achievement
Observable Truths (from ROADMAP.md Success Criteria)
| # | Truth | Status | Evidence |
|---|---|---|---|
| 1 | Agent remembers context from earlier in the same conversation (30+ turns without degradation) | VERIFIED | Redis sliding window (RPUSH/LTRIM, 20-msg default) in short_term.py; pgvector HNSW retrieval in long_term.py; both wired in _process_message via get_recent_messages + retrieve_relevant + build_messages_with_memory. No regression. |
| 2 | A user can send a WhatsApp message to the AI employee and receive a reply (per-tenant phone isolation + Meta 2026 scoping) | VERIFIED | Inbound pipeline complete. Outbound routing now wired: _send_response called at lines 355, 395, 438, 556 in _process_message; _update_slack_placeholder only called inside _send_response (line 722). handle_message pops phone_number_id and bot_token before model_validate (lines 223-224); wa_id extracted from msg.sender.user_id and injected into extras dict (lines 234-244). |
| 3 | Agent can invoke a registered tool and incorporate the result into its response | VERIFIED | Tool registry with 4 built-ins, execute_tool with JSON Schema validation, multi-turn loop (max 5 iterations) all wired in runner.py; _process_message builds tool_registry and passes to run_agent. No regression. |
| 4 | When escalation rule triggers, conversation and full context are handed off to human with no information lost | VERIFIED | check_escalation_rules and escalate_to_human imported at module level (line 71); pre-check at lines 386-396 (Redis escalation_status_key check before LLM call); post-check at lines 502-528 (check_escalation_rules called after run_agent, escalate_to_human called when rule matches and escalation_assignee is set). _build_conversation_metadata helper provides billing-keyword metadata. |
| 5 | Every LLM call, tool invocation, and handoff event is recorded in an immutable audit trail queryable by tenant | VERIFIED | AuditLogger initialized at line 375; log_llm_call per LLM iteration in runner.py; log_tool_call in execute_tool; log_escalation called inside escalate_to_human. audit_events table has REVOKE UPDATE, DELETE. No regression. |
Score: 5/5 truths verified
Required Artifacts
| Artifact | Expected | Status | Details |
|---|---|---|---|
packages/orchestrator/orchestrator/memory/short_term.py |
Redis sliding window | VERIFIED | No change from initial verification |
packages/orchestrator/orchestrator/memory/long_term.py |
pgvector HNSW retrieval | VERIFIED | No change from initial verification |
packages/orchestrator/orchestrator/tools/registry.py |
ToolDefinition + BUILTIN_TOOLS |
VERIFIED | No change from initial verification |
packages/orchestrator/orchestrator/tools/executor.py |
Schema-validated tool execution | VERIFIED | No change from initial verification |
packages/orchestrator/orchestrator/audit/logger.py |
Immutable audit event writer | VERIFIED | No change from initial verification |
packages/orchestrator/orchestrator/escalation/handler.py |
Escalation rule evaluation + DM delivery | VERIFIED (was ORPHANED) | Now called from _process_message pre-check and post-check |
packages/orchestrator/orchestrator/agents/builder.py |
build_system_prompt with WhatsApp tier-2 scoping |
VERIFIED (was MISSING) | channel parameter added to build_system_prompt, build_messages_with_memory, build_messages_with_media; scoping appended at line 187 |
packages/orchestrator/orchestrator/tasks.py |
Escalation wiring + channel-aware outbound routing | VERIFIED (was BROKEN) | check_escalation_rules + escalate_to_human at module-level import and called in pipeline; _send_response used at all delivery points; handle_message pops WhatsApp extras |
packages/gateway/gateway/channels/whatsapp.py |
WhatsApp webhook handler + outbound | VERIFIED | No change from initial verification |
tests/unit/test_pipeline_wiring.py |
26 tests covering all three gap fixes | VERIFIED | File exists (773 lines), 26 test functions confirmed |
Key Link Verification
| From | To | Via | Status | Details |
|---|---|---|---|---|
tasks.py |
memory/short_term.py |
get_recent_messages + append_message in _process_message |
WIRED | Lines 451, 566, 567. No regression. |
agents/builder.py |
memory/long_term.py |
retrieve_relevant + build_messages_with_memory |
WIRED | Lines 462, 477. No regression. |
tasks.py |
embed_and_store Celery task |
embed_and_store.delay() after response |
WIRED | Line 576. No regression. |
agents/runner.py |
tools/executor.py |
Tool-call loop | WIRED | No change. |
tasks.py |
audit/logger.py |
AuditLogger passed to run_agent |
WIRED | Line 375. No regression. |
tasks.py |
escalation/handler.py |
check_escalation_rules + escalate_to_human in _process_message |
WIRED (was NOT WIRED) | Module-level import line 71; pre-check lines 386-396; post-check lines 504-528 |
tasks.py |
_send_response |
Called at all response delivery points in _process_message |
WIRED (was NOT WIRED) | Lines 355, 395, 438, 556. _update_slack_placeholder only inside _send_response (line 722). |
agents/builder.py |
Agent.tool_assignments |
build_system_prompt(agent, channel="whatsapp") appends scoping |
WIRED (was MISSING) | Line 183-190: iterates tool_assignments, appends "You only handle" clause |
tasks.py |
build_messages_with_memory |
Passes str(msg.channel) as channel parameter |
WIRED (was MISSING) | Line 482 |
whatsapp.py |
normalize.py |
normalize_whatsapp_event called after HMAC verification |
WIRED | No change. |
whatsapp.py |
handle_message.delay |
Dispatched after normalization with extras | WIRED | No change. |
Requirements Coverage
| Requirement | Source Plan | Description | Status | Evidence |
|---|---|---|---|---|
| AGNT-02 | 02-01 | Agent maintains conversational memory within sessions (sliding window) | SATISFIED | Redis sliding window fully wired; no regression |
| AGNT-03 | 02-01 | Agent retrieves relevant past context via vector search | SATISFIED | pgvector retrieval wired; no regression |
| AGNT-04 | 02-02 | Agent can invoke registered tools | SATISFIED | 4 built-in tools, multi-turn loop wired; no regression |
| AGNT-05 | 02-04, 02-06 | Agent escalates to human when configured rules trigger | SATISFIED | Pre-check + post-check now wired; escalate_to_human called when rule matches and assignee configured |
| AGNT-06 | 02-02, 02-06 | Every agent action logged in audit trail | SATISFIED | LLM calls, tool calls, and escalation events all logged; audit_events immutable |
| CHAN-03 | 02-03, 02-05, 02-06 | User can interact via WhatsApp Business Cloud API | SATISFIED | Inbound fully wired; outbound now routes via _send_response → send_whatsapp_message; handle_message pops WhatsApp extras correctly |
| CHAN-04 | 02-03, 02-06 | WhatsApp adapter enforces business-function scoping per Meta 2026 policy | SATISFIED | Tier-1 (keyword gate + canned reply in gateway) verified previously; Tier-2 (system prompt scoping) now implemented in builder.py line 182-190 |
All 7 required requirements satisfied. No orphaned requirements.
Anti-Patterns Found
| File | Line | Pattern | Severity | Impact |
|---|---|---|---|---|
packages/orchestrator/orchestrator/tasks.py |
677 | _execute_pending_tool returns stub: "Full tool execution will be implemented in Phase 3 with per-tenant OAuth." |
Warning | Confirmed tool execution after user approval is deferred to Phase 3 — this is an acknowledged deviation, not a regression |
No new anti-patterns introduced by Plan 02-06.
Human Verification Required
1. WhatsApp End-to-End Delivery
Test: Configure a WhatsApp-connected tenant, send a WhatsApp message, wait for LLM response.
Expected: The AI employee's reply appears in the WhatsApp conversation thread.
Why human: Requires real Meta Cloud API credentials, a registered phone_number_id, and live webhook traffic. Static analysis confirms the outbound path is now wired (_send_response calls send_whatsapp_message with correct parameters), but delivery cannot be verified without live infrastructure.
2. Escalation DM Delivery
Test: Configure an agent with escalation_assignee (Slack user ID) and a billing escalation rule. Send multiple messages containing billing keywords (e.g., "billing", "invoice", "refund") to trigger the rule.
Expected: The configured Slack user receives a DM with the full conversation transcript. Subsequent messages receive the assistant-mode reply without LLM processing.
Why human: Requires a live Slack workspace, valid bot token, valid escalation_assignee user ID, and triggering the keyword threshold. The pre-check and post-check wiring is verified in code and unit tests, but end-to-end delivery requires the Slack API.
3. WhatsApp Business-Function Scoping (Tier 2) — Behavioural Compliance
Test: Configure an agent with tool_assignments = ["customer support", "billing inquiries"]. Send a borderline off-topic message via WhatsApp (e.g., "Can you help me write a poem?").
Expected: The LLM system prompt contains "You only handle: customer support, billing inquiries" and the response redirects the user to allowed topics.
Why human: The system prompt injection is statically verified (build_system_prompt appends the clause at line 187 when channel == "whatsapp" and tool_assignments is non-empty). LLM behavioural compliance with that constraint requires a live inference call.
Gaps Summary
All three gaps from the initial verification are confirmed closed in the actual codebase:
-
Escalation wiring (AGNT-05):
check_escalation_rulesandescalate_to_humanare imported at module level (line 71) and called from_process_message. Pre-check gates already-escalated conversations at lines 386-396. Post-check evaluates rules afterrun_agentat lines 504-528._build_conversation_metadataprovides billing-keyword metadata. -
WhatsApp outbound routing (CHAN-03):
_send_responseis called at all four response delivery points (lines 355, 395, 438, 556). No direct_update_slack_placeholdercalls remain in_process_message— the only call is inside_send_responseitself (line 722).handle_messagepopsphone_number_idandbot_tokenbeforemodel_validateand injectswa_idinto the extras dict. -
Tier-2 WhatsApp system prompt scoping (CHAN-04):
build_system_promptaccepts achannelparameter and appends the business-function constraint at line 182-190.build_messages_with_memoryandbuild_messages_with_mediapasschannelthrough._process_messagepassesstr(msg.channel)at line 482.
No regressions detected in previously-verified truths (memory pipeline, tool execution, audit logging).
Remaining open items are behavioural and require human verification with live infrastructure.
Verified: 2026-03-24T01:18:24Z Verifier: Claude (gsd-verifier) Re-verification: Yes — after Plan 02-06 gap closure