Relay

Protocol infrastructure for local AI models

The agent should never know it's talking to a local model.

— Relay design principle

Problem

Local AI inference servers expose APIs that differ from the OpenAI and Anthropic standards most agent frameworks expect. Developers who want to run agents against local models face broken streaming, missing tool calls, and unpredictable response formats — a compatibility fragmentation problem that makes local-first AI infrastructure harder to adopt.

Solution

Relay sits between agent frameworks and local inference servers as a protocol translation layer. It accepts standard OpenAI/Anthropic-shaped requests and translates them into the native formats expected by servers like llama.cpp, Ollama, and other local backends. Streaming SSE chunks, tool-call deltas, and structured output modes are mapped with high fidelity so the agent never knows it is talking to a local model.

What I Built

I designed Relay as a modular Node.js gateway with separate translation layers per backend. It supports SSE streaming passthrough, function-call serialization that matches both OpenAI tool_call and Anthropic tool_use conventions, and a unified request pipeline that normalises errors and timeouts across servers. The architecture prioritises low latency and minimal overhead so local inference speed is preserved. The gateway includes SDK smoke tests, deployment docs, and a clean configuration surface for developers who want to switch between local and cloud models without changing their agent code.

Technical & Product Focus

Protocol translation between OpenAI/Anthropic schemas and local inference backends. Streaming SSE relay with buffered tool-call accumulation. Modular backend adapters. Node.js, TypeScript, Express-style middleware pipeline. SDK smoke tests and deployment documentation.

Next Steps