2024 - 2025

Self-Hosted LLM Inference for Lydia

Deployed vLLM and llama.cpp on on-prem GPU infrastructure behind EC2, NGINX, and Route 53, replacing hosted inference and reducing Lydia serving cost.

vLLMllama.cppNGINXRoute 53EC2PythonTypeScript

Internal Users

7 engineers

Daily Query Volume

400+

Infra Cost Reduction

$2K/month

Inference Cost Control

Normalized internal usage after moving Lydia inference to self-hosted GPU infrastructure.

S1S2S3S4S5S6

Engineer adoption7 engineers

Daily query volume400+

Infra savings$2K/mo

Self-hosted vLLM and llama.cpp serving reduced hosted GPU spend while keeping Lydia usable for daily engineering workflows.

Goal

Move Lydia's inference path off hosted GPU infrastructure while keeping latency and operational access practical for internal engineering workflows.

Implementation

I deployed a self-hosted inference engine with vLLM and llama.cpp on an on-prem GPU server behind the existing AWS edge path.

Put EC2, NGINX, and Route 53 in front of the on-prem GPU server for controlled access.
Supported both high-throughput serving and lighter local model execution paths with vLLM and llama.cpp.
Kept the inference runtime separate from Lydia's job-delivery control plane so delivery SLOs and model runtime could be measured independently.

Impact

The stack replaced RunPod, reduced serving cost by about $2K/month, and supported Lydia usage by 7 engineers at 400+ queries/day.

Future Extensions

Add feedback loops for answer-quality scoring.
Integrate chart generation for trend-heavy questions.
Support deeper comparative analytics between driver cohorts.