Documentation

Everything you need to integrate Kestrel and start saving.

Contents

Quickstart

Get started in 3 steps. No SDK to install, no code migration.

1. Create your API key

Sign in to the dashboard with GitHub or Google. Go to the API Keys tab, enter your provider API keys (OpenAI, Anthropic, etc.), and generate a Kestrel key.

2. Change your base URL

Replace your provider's base URL with Kestrel's. Your existing code stays the same.

# Before client = OpenAI(api_key="sk-proj-...") # After client = OpenAI( base_url="https://api.usekestrel.io/v1", api_key="ks-your-key", )

3. Send requests as normal

Every request is analyzed and routed to the cheapest model that can handle it. Simple prompts go to economy models, complex ones stay on premium.

response = client.chat.completions.create( model="gpt-4o", # Kestrel may route to a cheaper model messages=[{"role": "user", "content": "Hello"}], ) # Simple prompt → routed to gpt-4o-mini (90% cheaper) # Complex prompt → stays on gpt-4o (full quality)

That's it. View your savings in real-time on the dashboard.

How Routing Works

Every request goes through a 3-stage pipeline in under 2ms:

  1. Analyze — Extract features from the request: message length, keywords, tools, system prompt complexity, conversation depth
  2. Classify — Score the request across 5 dimensions (reasoning, output complexity, domain specificity, instruction nuance, error tolerance) and assign a tier: Economy, Standard, or Premium
  3. Route — Select the cheapest available model in the assigned tier from your configured providers

The semantic cache adds a Stage 0: if a similar request was recently answered, the cached response is returned instantly with zero provider cost.

Tier examples

SDK Examples

Python (OpenAI SDK)

from openai import OpenAI client = OpenAI( base_url="https://api.usekestrel.io/v1", api_key="ks-your-key", ) response = client.chat.completions.create( model="gpt-4o", messages=[{"role": "user", "content": "Explain quantum computing"}], ) print(response.choices[0].message.content)

JavaScript/TypeScript (OpenAI SDK)

import OpenAI from "openai"; const client = new OpenAI({ baseURL: "https://api.usekestrel.io/v1", apiKey: "ks-your-key", }); const response = await client.chat.completions.create({ model: "gpt-4o", messages: [{ role: "user", content: "Explain quantum computing" }], }); console.log(response.choices[0].message.content);

cURL

curl https://api.usekestrel.io/v1/chat/completions \ -H "Authorization: Bearer ks-your-key" \ -H "Content-Type: application/json" \ -d '{ "model": "gpt-4o", "messages": [{"role": "user", "content": "Hello"}] }'

LangChain

from langchain_openai import ChatOpenAI llm = ChatOpenAI( model="gpt-4o", base_url="https://api.usekestrel.io/v1", api_key="ks-your-key", ) response = llm.invoke("Explain quantum computing") print(response.content)

API Reference

Kestrel is fully OpenAI API-compatible. Any SDK or tool that works with OpenAI works with Kestrel.

Base URL

https://api.usekestrel.io/v1

Authentication

Pass your Kestrel API key in the Authorization header:

Authorization: Bearer ks-your-key

Endpoints

POST /v1/chat/completions
Send a chat completion request. Supports streaming.
GET /health
Health check. Returns {"status": "ok"}.
GET /api/dashboard/savings
Usage and savings summary for your API key.
GET /api/dashboard/routing
Routing tier distribution (economy/standard/premium/cache).
GET /api/dashboard/analytics
Per-model usage analytics.

Request format

Standard OpenAI chat completions format. All fields are supported:

{ "model": "gpt-4o", "messages": [{"role": "user", "content": "..."}], "temperature": 0.7, "max_tokens": 1000, "stream": false, "tools": [...], "response_format": {"type": "json_object"} }

Dashboard

The dashboard shows real-time analytics:

Supported Providers

Kestrel routes across all major LLM providers. Add as many as you want when creating your API key:

The more providers you add, the more routing options Kestrel has to find savings.

Security

FAQ

Will my responses be different?

For simple requests routed to cheaper models, responses may differ slightly in style but not in correctness. Complex requests stay on premium models and produce identical results. You can always set a tier floor to prevent downgrading below a certain level.

What if a cheaper model gives a bad response?

The routing classifier is conservative — it only downgrades when confident the cheaper model can handle it. Over time, the system learns from outcome signals and improves its routing decisions.

Does Kestrel add latency?

The routing classification takes less than 2ms. Semantic cache hits are near-instant. Total added latency is negligible compared to LLM inference time.

Can I force a specific model?

Yes. If you request a model that Kestrel recognizes as economy-tier (like gpt-4o-mini), it won't route up to a more expensive model. The system only routes cheaper, never more expensive.

How does billing work?

You pay 15% of the savings Kestrel generates. If your baseline cost would have been $1,000 and Kestrel reduces it to $400, you pay 15% of the $600 saved = $90. Your total cost: $490 instead of $1,000. If savings are $0, you pay $0.

Is Kestrel open source?

The core routing engine is open source at github.com/andber6/kestrel. The managed service (billing, caching, dashboard) is the commercial product.