Documentation

Everything you need to integrate Kestrel and start saving.

Quickstart
How Routing Works
SDK Examples
API Reference
Dashboard
Supported Providers
Security
FAQ

Quickstart

Get started in 3 steps. No SDK to install, no code migration.

1. Create your API key

Sign in to the dashboard with GitHub or Google. Go to the API Keys tab, enter your provider API keys (OpenAI, Anthropic, etc.), and generate a Kestrel key.

2. Change your base URL

Replace your provider's base URL with Kestrel's. Your existing code stays the same.

# Before
client = OpenAI(api_key="sk-proj-...")

# After
client = OpenAI(
    base_url="https://api.usekestrel.io/v1",
    api_key="ks-your-key",
)

3. Send requests as normal

Every request is analyzed and routed to the cheapest model that can handle it. Simple prompts go to economy models, complex ones stay on premium.

response = client.chat.completions.create(
    model="gpt-4o",  # Kestrel may route to a cheaper model
    messages=[{"role": "user", "content": "Hello"}],
)
# Simple prompt → routed to gpt-4o-mini (90% cheaper)
# Complex prompt → stays on gpt-4o (full quality)

That's it. View your savings in real-time on the dashboard.

How Routing Works

Every request goes through a 3-stage pipeline in under 2ms:

Analyze — Extract features from the request: message length, keywords, tools, system prompt complexity, conversation depth
Classify — Score the request across 5 dimensions (reasoning, output complexity, domain specificity, instruction nuance, error tolerance) and assign a tier: Economy, Standard, or Premium
Route — Select the cheapest available model in the assigned tier from your configured providers

The semantic cache adds a Stage 0: if a similar request was recently answered, the cached response is returned instantly with zero provider cost.

Tier examples

Economy — "What is 2+2?", "Say hello", simple translations → gpt-4o-mini, gemini-2.5-flash
Standard — Moderate analysis, comparisons, summaries → gpt-4o-mini, claude-haiku
Premium — Complex architecture design, multi-step reasoning with code, domain expertise → gpt-4o, claude-sonnet, grok-3

SDK Examples

Python (OpenAI SDK)

from openai import OpenAI

client = OpenAI(
    base_url="https://api.usekestrel.io/v1",
    api_key="ks-your-key",
)

response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "Explain quantum computing"}],
)
print(response.choices[0].message.content)

JavaScript/TypeScript (OpenAI SDK)

import OpenAI from "openai";

const client = new OpenAI({
  baseURL: "https://api.usekestrel.io/v1",
  apiKey: "ks-your-key",
});

const response = await client.chat.completions.create({
  model: "gpt-4o",
  messages: [{ role: "user", content: "Explain quantum computing" }],
});
console.log(response.choices[0].message.content);

cURL

curl https://api.usekestrel.io/v1/chat/completions \
  -H "Authorization: Bearer ks-your-key" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gpt-4o",
    "messages": [{"role": "user", "content": "Hello"}]
  }'

LangChain

from langchain_openai import ChatOpenAI

llm = ChatOpenAI(
    model="gpt-4o",
    base_url="https://api.usekestrel.io/v1",
    api_key="ks-your-key",
)

response = llm.invoke("Explain quantum computing")
print(response.content)

API Reference

Kestrel is fully OpenAI API-compatible. Any SDK or tool that works with OpenAI works with Kestrel.

Base URL

https://api.usekestrel.io/v1

Authentication

Pass your Kestrel API key in the Authorization header:

Authorization: Bearer ks-your-key

Endpoints

POST /v1/chat/completions

Send a chat completion request. Supports streaming.

GET /health

Health check. Returns {"status": "ok"}.

GET /api/dashboard/savings

Usage and savings summary for your API key.

GET /api/dashboard/routing

Routing tier distribution (economy/standard/premium/cache).

GET /api/dashboard/analytics

Per-model usage analytics.

Request format

Standard OpenAI chat completions format. All fields are supported:

{
  "model": "gpt-4o",
  "messages": [{"role": "user", "content": "..."}],
  "temperature": 0.7,
  "max_tokens": 1000,
  "stream": false,
  "tools": [...],
  "response_format": {"type": "json_object"}
}

Dashboard

The dashboard shows real-time analytics:

Savings — total requests, baseline cost, actual cost, and savings percentage
Routing — breakdown by tier (economy, standard, premium, cache)
Provider health — status of each connected provider
Model analytics — per-model request counts, tokens, and costs
API keys — create, list, and revoke keys with provider credentials
CSV export — download usage data for your own analysis

Supported Providers

Kestrel routes across all major LLM providers. Add as many as you want when creating your API key:

OpenAI — GPT-4o, GPT-4o-mini
Anthropic — Claude Sonnet 4.6, Claude Haiku 4.5
Google — Gemini 2.5 Pro, Gemini 2.5 Flash
Groq — Llama 3.1 8B, Llama 3.1 70B (ultra-fast inference)
xAI — Grok-3, Grok-3-mini
Mistral — Mistral Large, Mistral Small
Cohere — Command R+, Command R
Together AI — Open-source model hosting

The more providers you add, the more routing options Kestrel has to find savings.

Security

Encryption at rest — Provider API keys are encrypted with AES-256 (Fernet) before storage
HTTPS everywhere — All traffic encrypted in transit via TLS
No prompt logging — We do not store or log your prompt content or model responses
Tenant isolation — Each customer's cache and data is fully isolated
Revocable keys — Revoke any API key instantly from the dashboard
Open source core — The routing engine is open source and auditable

FAQ

Will my responses be different?

For simple requests routed to cheaper models, responses may differ slightly in style but not in correctness. Complex requests stay on premium models and produce identical results. You can always set a tier floor to prevent downgrading below a certain level.

What if a cheaper model gives a bad response?

The routing classifier is conservative — it only downgrades when confident the cheaper model can handle it. Over time, the system learns from outcome signals and improves its routing decisions.

Does Kestrel add latency?

The routing classification takes less than 2ms. Semantic cache hits are near-instant. Total added latency is negligible compared to LLM inference time.

Can I force a specific model?

Yes. If you request a model that Kestrel recognizes as economy-tier (like gpt-4o-mini), it won't route up to a more expensive model. The system only routes cheaper, never more expensive.

How does billing work?

You pay 15% of the savings Kestrel generates. If your baseline cost would have been $1,000 and Kestrel reduces it to $400, you pay 15% of the $600 saved = $90. Your total cost: $490 instead of $1,000. If savings are $0, you pay $0.

Is Kestrel open source?

The core routing engine is open source at github.com/andber6/kestrel. The managed service (billing, caching, dashboard) is the commercial product.

Documentation

Contents

Quickstart

1. Create your API key

2. Change your base URL

3. Send requests as normal

How Routing Works

Tier examples

SDK Examples

Python (OpenAI SDK)

JavaScript/TypeScript (OpenAI SDK)

cURL

LangChain

API Reference

Base URL

Authentication

Endpoints

Request format

Dashboard

Supported Providers

Security

FAQ

Will my responses be different?

What if a cheaper model gives a bad response?

Does Kestrel add latency?

Can I force a specific model?

How does billing work?

Is Kestrel open source?