2024 - 2025
Self-Hosted LLM Inference for Lydia
Deployed vLLM and llama.cpp on on-prem GPU infrastructure behind EC2, NGINX, and Route 53, replacing hosted inference and reducing Lydia serving cost.
Internal Users
7 engineers
Daily Query Volume
400+
Infra Cost Reduction
$2K/month
Inference Cost Control
Normalized internal usage after moving Lydia inference to self-hosted GPU infrastructure.
Self-hosted vLLM and llama.cpp serving reduced hosted GPU spend while keeping Lydia usable for daily engineering workflows.
Goal
Move Lydia's inference path off hosted GPU infrastructure while keeping latency and operational access practical for internal engineering workflows.
Implementation
I deployed a self-hosted inference engine with vLLM and llama.cpp on an on-prem GPU server behind the existing AWS edge path.
- Put EC2, NGINX, and Route 53 in front of the on-prem GPU server for controlled access.
- Supported both high-throughput serving and lighter local model execution paths with vLLM and llama.cpp.
- Kept the inference runtime separate from Lydia's job-delivery control plane so delivery SLOs and model runtime could be measured independently.
Impact
The stack replaced RunPod, reduced serving cost by about $2K/month, and supported Lydia usage by 7 engineers at 400+ queries/day.
Future Extensions
- Add feedback loops for answer-quality scoring.
- Integrate chart generation for trend-heavy questions.
- Support deeper comparative analytics between driver cohorts.