← Back to projects

2025

Lydia Agentic AI Assistant and Event Delivery Platform

Built Lydia, an agentic AI assistant for fleet-telemetry analytics, and designed the durable event-delivery platform behind its multi-turn tool-calling workflows.

TypeScriptMySQLClickHouseKafka/MSKRedis StreamsSocket.ioDuckDBEC2OpenTelemetry

Agent Workflow

Tool-calling SQL

Analytics Store

ClickHouse

Durable Delivery

Outbox + MSK

// lydia architecture

Durable Agent Delivery Path

A sanitized view of the control plane, event path, cancel interrupt, and observability surfaces behind Lydia.

Problem

Fleet telemetry questions often required engineers to inspect schemas, write SQL, run analysis, and turn the result into a chart or report. Lydia was built to make that workflow conversational without losing the operational controls needed around long-running agent work.

The first version used a direct agent execution path that worked for a prototype but carried production risks: browser sessions were tightly coupled to long-running work, restarts could strand state, cancel behavior was informal, and there was no durable job lifecycle to inspect or replay.

The production requirement was larger than the inference loop. The platform needed tenant-aware job ownership, idempotent submission, reconnect and resume behavior, formal terminal states, backpressure, and enough observability to explain stuck or slow jobs.

Architecture Decision

I built Lydia as a multi-turn, tool-calling agent for fleet-telemetry analytics and proposed moving its execution path from direct calls to a durable job-delivery model:

  • A gateway accepts authenticated requests, validates ownership, and writes job state to MySQL.
  • The agent inspects schema context, generates SQL, executes analytical queries over ClickHouse, and produces charts or reports from the result.
  • A transactional outbox records the handoff event in the same database transaction, avoiding a database-plus-broker dual-write gap.
  • A polling publisher delivers accepted jobs to Kafka/MSK for durable processing by the agent backend.
  • Redis Streams carry replayable live events back to connected frontend sessions.
  • Redis Pub/Sub is used only as a fast interrupt path for cooperative cancel.
  • ClickHouse handles analytical SQL over fleet telemetry, while richer turn history, tool traces, and audit detail stay outside the hot control plane.

The design deliberately kept the v1 runtime on EC2 and on-prem inference because the immediate problem was not worker scale. The core gap was reliable job ownership and recovery semantics around long-running agent work.

Job Lifecycle

The control plane makes the lifecycle explicit:

received -> queued -> processing -> final
                              \-> error
                              \-> cancelled
                              \-> timed_out

Frontend submit, reconnect, follow-up, and cancel flows all resolve through the durable job record rather than a single live socket. That lets multiple frontend sessions observe the same job and lets the gateway recover state after process restarts.

Cancel And Reconnect

Cancel is cooperative instead of preemptive. The gateway records the durable cancel request, emits a user-visible event, and sends a fast interrupt to the active agent backend. The agent checks for cancellation at safe boundaries such as stream chunks, tool calls, turn boundaries, and heartbeat checks, then writes the final cancelled state.

Reconnect uses the last observed event id to replay recent stream events. If the short-lived event stream has expired, the frontend can still recover the durable job status and final summary from the control plane.

SLO Thinking

The key architecture decision was to separate delivery latency from model execution time. GPU inference, tools, and retrieval can be slow for legitimate reasons; the platform still needs to prove that accepted jobs move through handoff, queueing, event delivery, and terminal-state updates predictably.

The provisional delivery SLOs focused on measurements such as time to queued, event-to-browser latency, cancel acknowledgement, cancel completion after checkpoint, outbox publish lag, broker consumer lag, and terminal-state completion ratio.

Tradeoffs

This design adds more infrastructure than a direct request/response path. The tradeoff is intentional: a broker and control plane are not necessary for a single happy-path prototype, but they become justified when the product needs durable handoff, replay, backpressure, reconnect, auditability, and restart recovery.

For v1, a polling outbox publisher was the pragmatic choice over a managed CDC connector. It keeps cost low, gives the application direct control over retry and status semantics, and is enough for a single app-owned outbox stream.