Relay
Protocol infrastructure for local AI models
The agent should never know it's talking to a local model.
Problem
Local AI inference servers expose APIs that differ from the OpenAI and Anthropic standards most agent frameworks expect. Developers who want to run agents against local models face broken streaming, missing tool calls, and unpredictable response formats — a compatibility fragmentation problem that makes local-first AI infrastructure harder to adopt.
Solution
Relay sits between agent frameworks and local inference servers as a protocol translation layer. It accepts standard OpenAI/Anthropic-shaped requests and translates them into the native formats expected by servers like llama.cpp, Ollama, and other local backends. Streaming SSE chunks, tool-call deltas, and structured output modes are mapped with high fidelity so the agent never knows it is talking to a local model.
What I Built
I designed Relay as a modular Node.js gateway with separate translation layers per backend. It supports SSE streaming passthrough, function-call serialization that matches both OpenAI tool_call and Anthropic tool_use conventions, and a unified request pipeline that normalises errors and timeouts across servers. The architecture prioritises low latency and minimal overhead so local inference speed is preserved. The gateway includes SDK smoke tests, deployment docs, and a clean configuration surface for developers who want to switch between local and cloud models without changing their agent code.
Technical & Product Focus
Protocol translation between OpenAI/Anthropic schemas and local inference backends. Streaming SSE relay with buffered tool-call accumulation. Modular backend adapters. Node.js, TypeScript, Express-style middleware pipeline. SDK smoke tests and deployment documentation.
Links
Next Steps
Open-source release with initial Ollama and llama.cpp backend support. Plugin-style backend registration so the community can contribute adapters for new local servers. Lightweight dashboard for observability into translation latency and error rates. Multi-model routing with fallback chains.