Flock is a self-hosted LLM gateway and control plane. One Go binary turns your Macs and Linux boxes into a private inference cluster behind one endpoint that speaks both the OpenAI and Anthropic APIs. You can run 40+ open-weight models (Qwen, Llama, DeepSeek, GLM, gpt-oss and more) on your own hardware, shard a model too large for any single box across several machines, route across 20+ hosted providers with automatic key rotation and failover, and fall back to paid Claude / GPT only when you choose — with the team controls vendor APIs don't give you: per-user API keys, quotas, dollar budgets, audit log, and a built-in admin dashboard. Engine-agnostic: pick Ollama, vLLM, MLX, or llama.cpp-RPC as the backend.

What can Flock do besides governance (keys, quotas, audit)?

Flock is much more than a policy-and-logging layer. It runs open-weight LLMs on your own hardware through one API, load-balances the same model across many machines, and shards a single giant model across several boxes via llama.cpp-RPC. Its router does intelligent cross-provider routing with model=auto, automatic fallback chains, request hedging, sticky sessions for KV-cache reuse, placement cooldown circuit-breaking, and latency-aware routing. It speaks to 20+ hosted providers (Anthropic, OpenAI, Bedrock, OpenRouter, Groq, Together, Fireworks, Cohere, Mistral, Perplexity, DeepSeek, Cerebras, NVIDIA, Gemini, and more) with multi-key rotation and 429 failover. It also exposes embeddings, rerank, vision, and audio endpoints, an embeddings response cache, a guardrails framework, and full observability (Prometheus, OTLP tracing, Grafana dashboards, webhook/Langfuse/S3 callback sinks).

How many model providers does Flock support?

Flock ships native adapters for Anthropic, OpenAI, AWS Bedrock and GCP Vertex, plus 20+ OpenAI-compatible hosted gateways including OpenRouter, Groq, Together, Fireworks, Cohere, Mistral, Perplexity, DeepSeek, Cerebras, NVIDIA, Gemini, Hugging Face, Z.AI, Ollama Cloud, GitHub Models, Cloudflare and more — all behind one endpoint and one API key. You can stack multiple API keys per provider for automatic round-robin rotation and 429 failover, and define a cross-provider routing chain that walks free → cheap → paid and pins a local model as the always-available fallback.

Can Flock save my team money on AI API bills?

Yes. Flock moves the bulk of routine LLM calls onto open-weight models running on hardware you already own (free apart from electricity), and keeps an optional escape hatch to paid Claude / GPT for the calls that genuinely need a frontier model. A team of 10 developers using AI tools heavily can spend $200–500 per developer per month on vendor tokens; routing the easy 80% of calls to local models typically pays back the hardware in a few months.

How is Flock different from running an inference engine directly?

Engines like Ollama, vLLM, and MLX-LM are single-machine, single-user inference servers. Flock is the layer above them: it routes requests across a fleet of machines, exposes one consistent API (OpenAI + Anthropic) so your tools don't care which engine is underneath, adds per-user keys + quotas + audit log, orchestrates llama.cpp-RPC sharding so a single model can run split across multiple machines, and can transparently fall back to paid Claude/GPT when you choose.

Can Claude Code use a local model via Flock?

Yes. Set ANTHROPIC_BASE_URL=http://localhost:8080, ANTHROPIC_AUTH_TOKEN=your-flock-key, ANTHROPIC_MODEL=llama-3.2-1b (or any local catalog id). Claude Code now talks to your local model instead of paying for the API.

Does Flock work across multiple machines?

Yes. The leader machine runs flock up. Every other machine runs flock join. The leader's router automatically dispatches requests to whichever worker has the model loaded, and llama.cpp-RPC sharding lets one large model run split across many machines.

v1.32.0 100% free open source · Apache-2.0 no telemetry

Self-hosted AI
for your team.
One endpoint. Your hardware.

Flock is a free, open-source self-hosted LLM gateway. One Go binary turns your Macs and Linux boxes into a private inference cluster behind one endpoint that speaks both the OpenAI and Anthropic APIs. Run 40+ open-weight models on your own hardware, shard a model too big for one box across machines, route intelligently across 20+ providers with automatic failover — and fall back to paid Claude / GPT only when you choose. The team controls vendor APIs lack (per-user keys, quotas, budgets, audit) come built in. Engine-agnostic: bring Ollama, vLLM, MLX, or llama.cpp-RPC.

Try it in 60 seconds

curl -fsSL https://raw.githubusercontent.com/llmpy/flock/main/installer/install.sh | sh
flock doctor      # tells you the one command to install an engine, if you don't have one
flock up          # starts your private LLM gateway, prints your admin API key

Get started → ⭐ Star on GitHub See the interface ↓

✓ Mac (Apple Silicon) ✓ Linux (x86_64 / arm64) ✓ Cursor · Claude Code · Aider ✓ No Docker. No Python. No k8s.

The stack

Tools above. Flock in the middle. Engines + models below.

Every client speaks OpenAI or Anthropic. Every engine speaks its native API. Flock is the one URL + one key in between.

Layer 1 Clients · what your team uses

Claude Code

Cursor

Aider

Continue

Zed

Cline

Qwen-Code

Hermes

OpenClaw

OpenCode

Open WebUI

Open Notebook

Goose

Plandex

OpenHands

Codex CLI

OpenAI SDK

Anthropic SDK

curl / HTTP

one URL · one API key

Layer 2 Flock · the gateway you self-host

⬢ Flock ⬢

single Go binary · embedded UI · no telemetry

API surface

chat · messages · embeddings · rerank · audio · vision

Routing

cross-provider auto chain · least-loaded · sharded · hedging · sticky · latency-aware

Providers

20+ vendors · multi-key rotation · 429 failover · vendor fallback

Controls

per-user keys · quotas · $ budgets · audit · Prometheus + OTLP

native APIs · your hardware

Layer 3a Engines · drive the models

Ollama

Mac · Linux · Windows

vLLM

NVIDIA · throughput

MLX-LM

Apple Silicon

llama.cpp

CPU · GGUF · RPC sharding

Layer 3b Models · 41 curated families + any HF or Ollama tag

Llama 3.x / 4

Qwen 2.5 / 3 / 3.6

Gemma 4

GPT-OSS 20B / 120B

DeepSeek R1 / V4

Mistral Nemo

Phi 4

Nemotron 3

Kimi K2.6

GLM 4 → 5.2

Qwen3-VL 8B / 32B

MiMo 7B · VL · Audio

Pixtral 12B

Moondream 3

Step 3.7

LFM 2.5

Mellum 2

Nomic Embed

+ any HF GGUF

+ any Ollama tag

Mix and match across all three layers. Any client above + any engine below + any model on the bottom row — Flock is the only piece that needs to know about the rest.

Where Flock sits

One layer between your tools and the LLMs

Your tools point at Flock with one URL and one API key. Flock decides — per request — whether to serve from your hardware, fan out across machines, or proxy to a paid vendor. Switching the underlying model is a config change, not a re-wire.

           ┌──────────────────────────────────────────────────────────────┐
           │                       YOUR USE CASES                         │
           │             (the tools your team already uses)               │
           └──────────────────────────────────────────────────────────────┘
                  │           │          │             │            │
                  ▼           ▼          ▼             ▼            ▼
            ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐
            │  Cursor  │ │  Claude  │ │  Aider   │ │  Custom  │ │   curl   │
            │          │ │   Code   │ │          │ │ Python   │ │  scripts │
            │          │ │          │ │          │ │   SDK    │ │          │
            └────┬─────┘ └────┬─────┘ └────┬─────┘ └────┬─────┘ └────┬─────┘
                 │  OpenAI    │ Anthropic  │  OpenAI    │  Either    │  HTTP
                 └────────────┴────────────┴────────────┴────────────┘
                                          │
                                          │   ONE URL · ONE API KEY
                                          ▼
      ┌──────────────────────────────────────────────────────────────────────┐
      │                  ⬢ ⬢ ⬢   FLOCK   ⬢ ⬢ ⬢                              │
      │                  (this is what we built)                             │
      │  ────────────────────────────────────────────────────────────────    │
      │  Gateway     OpenAI + Anthropic + /v1/rerank + /v1/audio/*           │
      │              keys: allowlist · RPM/TPM · $ budgets · TTL expiry      │
      │              guardrails · response cache · callbacks · admin UI      │
      │                                                                      │
      │  Router      Same model on N nodes  → load-balance + sticky session  │
      │              Flaky worker          → placement cooldown (skip)        │
      │              Different models      → route by placement              │
      │              Model bigger than node→ split via llama.cpp-RPC         │
      │              Latency-sensitive     → hedge to top-N workers          │
      │              model="auto"          → cross-provider routing chain    │
      │              Claude / GPT / 20+    → proxy to vendor (multi-key)     │
      │              Engine error/timeout  → typed fallback chain + retries  │
      └─────────────────────────────┬────────────────────────────────────────┘
                                    │
              ┌─────────────────────┼─────────────────────┐
              ▼                     ▼                     ▼
       ┌─────────────┐       ┌─────────────┐       ┌─────────────┐
       │   Engines   │       │   Engines   │       │   Egress    │
       │  (any mix)  │       │  (any mix)  │       │   proxy     │
       │  • Ollama   │       │  • Ollama   │       │ Anthropic   │
       │  • vLLM     │       │  • vLLM     │       │ OpenAI      │
       │  • MLX-LM   │       │  • MLX-LM   │       │ Bedrock     │
       │  • llama.cpp│       │  • llama.cpp│       │ OpenRouter  │
       │  • whisper  │       │  • piper    │       │ Groq, Mist- │
       │  • piper    │       │  • whisper  │       │ ral, Cohere │
       └──────┬──────┘       └──────┬──────┘       │ Perplexity… │
              │                     │              └──────┬──────┘
              ▼                     ▼                     ▼
      ┌──────────────────────────────────────────────────────────────────────┐
      │                    UNDERLYING LLMs / WEIGHTS                         │
      │                                                                      │
      │   YOUR HARDWARE                              VENDOR APIs             │
      │   • Mac Studio · Mac Mini                    • Claude (Anthropic)    │
      │   • Linux + RTX GPU                          • GPT, o3, o4 (OpenAI)  │
      │                                                                      │
      │   41 curated catalog models (Qwen 3.6, GLM,   Each request routed    │
      │   gpt-oss, Llama 4, Gemma 4, DeepSeek V4,     to EITHER your hard-   │
      │   Kimi K2.6, Nemotron 3 Ultra, vision +       ware OR a vendor —     │
      │   embedding models)                           you pay vendors only   │
      │   + any HuggingFace or Ollama model.          when YOU chose to.     │
      └──────────────────────────────────────────────────────────────────────┘

Without Flock you'd lock into one provider, share one API key, trust the vendor with your prompts, and pay per token. With Flock you change qwen3.6-27b → claude-opus-4-7 in one place — the dev's editor doesn't know or care.

Shipped catalog · smoke-tested today

Qwen 2.5 Coder · Qwen 3 · Llama 3.2 · Llama 3.3 sharded · DeepSeek R1

Plus any model on HuggingFace via flock model add hf:owner/repo or any Ollama tag via flock model add ollama:<tag>. Vision (image_url content blocks on /v1/chat/completions) and embeddings (/v1/embeddings) ship via the Ollama engine path. /v1/rerank and the /v1/audio/* speech endpoint shells ship today; the whisper / piper engine drivers behind them are on the roadmap (see ROADMAP.md).

What Flock does, in plain English

Flock is the layer that lets your tools talk to any LLM through one URL and one API key. Your team's AI tools talk to one endpoint; Flock decides — per request — whether to serve from your own machines (free + private), shard a giant model across several of them, route across 20+ hosted providers with automatic failover, or transparently fall back to real Claude / GPT (paid, logged) — your call.

It's more than a governance layer. Yes, the team controls are there — per-user keys, quotas, dollar budgets, audit — but they're one pillar. The rest is a real inference router: run open-weight models on your hardware, swap models with a single command, balance load across a fleet, and let an intelligent model="auto" chain pick the best available provider for every request.

🧠

Local models, one API

Pick the engine that fits your hardware — Ollama, vLLM, MLX-LM, or llama.cpp-RPC. Flock exposes whichever one you run through /v1/chat/completions (OpenAI) and /v1/messages (Anthropic) — so Claude Code works against your local Llama.

🔑

Team-ready out of the box

Per-user API keys with scopes, TTL expiry, model allowlists, RPM/TPM rate limits, $ budgets, audit log, Prometheus + OTLP, embedded admin UI. No nginx, no LiteLLM-plus-Python — just one binary.

🌐

Scales from 1 to N machines

Start on a laptop. Add more machines with one flock join command. Router load-balances replicas. For models too big for any single box, built-in llama.cpp-RPC sharding splits one model across many.

🔁

Switching models is one action

flock model add <id> for catalog models. flock model add hf:owner/repo for anything else on HuggingFace. No hand-written YAML, no manual GGUF downloads, no per-worker setup. Engine, quant, and shard count are picked for you from your hardware (M4-T16 → M4-T20 — shipped).

⚙️

CLI is the source of truth

Every action in the dashboard is one flock command underneath. Anything you can do with the UI, you can do with curl, cron, or an SSH session — same audit log, same validation, same outcome. No web-only knobs.

The honest one-line pitch

The only OSS tool that ships, in one Go binary, all of: OpenAI + Anthropic APIs (chat / messages / embeddings / rerank / audio / vision) + 20+ provider passthroughs with multi-key rotation & 429 failover + a cross-provider model="auto" routing chain + per-key allowlists / rate limits / $ budgets / TTL + multi-node routing (sticky / cooldown / hedging / latency-aware) + sharding + response cache + guardrails + webhook / Langfuse / S3 callbacks + embedded admin UI — designed for self-hosting on a Mac + Linux team fleet.

Interface

An admin UI that ships with the binary

Embedded via //go:embed. No separate frontend to deploy. Sign in by pasting the admin key Flock prints on first run. Every action also works from the CLI.

http://localhost:8080

Flock

orchestrate open LLMs · your hardware

key: sk-orc-xK9p…

Nodes

3 ready · 1 draining

Models

1 sharded

Recent requests

2,847

last 200

Tokens served

1.2M

saved ~$340 vs API

# Quick start: paste your admin key into your tools or use curl: $ curl http://localhost:8080/v1/chat/completions \ -H 'Authorization: Bearer sk-orc-xK9p…' \ -d '{"model":"auto","messages":[{"role":"user","content":"hi"}]}'

Nodes

ID	Hostname	OS / Arch	RAM	Address	State	Last heartbeat
local	mbp-hadi	darwin/arm64	24 GB	127.0.0.1:8080	ready	just now	drain · remove
n_abc123	mac-mini-office	darwin/arm64	64 GB	192.168.1.42:8081	ready	3 sec ago	drain · remove
n_def456	gpu-tower	linux/amd64	128 GB	192.168.1.50:8081	ready	2 sec ago	drain · remove
n_ghi789	lab-mac	darwin/arm64	32 GB	192.168.1.60:8081	draining	12 sec ago	drain · remove

Installed models

ID	Status	Source	Size	Installed
llama-3.2-1b	ready	ollama:llama3.2:1b	1.3 GB	2 days ago	remove
qwen-coder-7b	ready	ollama:qwen2.5-coder:7b	4.7 GB	1 day ago	remove
qwen-coder-14b	ready	ollama:qwen2.5-coder:14b	9.0 GB	3 hours ago	remove
qwen3-30b	ready	vllm:Qwen/Qwen3-30B-A3B	19 GB	just now	remove
llama-3.3-70b-sharded	sharded	llamacpp:/var/lib/flock/…q4_k_m.gguf	42 GB	5 min ago	remove

Add a model from the catalog

Catalog entry

Sharded models auto-delegate to the shard orchestrator.

Sharded models

One model split across multiple nodes via llama.cpp RPC. The coordinator runs on the leader; rpc-server runs on each shard host.

Create new sharded model

Catalog model id

Shards

llama-3.3-70b-sharded

3 shards

Role	Node	Address	Status	Last seen
coordinator	local	127.0.0.1:9001	ready	just now
rpc	n_abc123	192.168.1.42:50052	ready	just now
rpc	n_def456	192.168.1.50:50052	ready	just now

Prereqs: The leader needs llama-server (brew install llama.cpp); each worker needs rpc-server on PATH; the catalog entry needs sharding.required: true and a local GGUF path.

API keys

Heads-up: "node" scope tokens are the shared secret between leader and worker. They are stored plaintext on the leader. Only issue on a trusted network (LAN or Tailscale). See the Settings tab for the full security model.

ID	Name	Scope	User	Daily quota	Status	Created
k_initial	initial-admin	admin	admin	∞	active	3 days ago	revoke
k_xY3vP	alice	user	alice	100,000	active	1 day ago	revoke
k_n9LqR	bob	user	bob	200,000	active	1 day ago	revoke
k_join01	mac-mini-join	node	—	∞	active	2 hours ago	revoke
k_old	eve-old	user	eve	100,000	revoked	last week

Create a new key

Name

Scope

Daily quota (0=∞)

New keys are shown once in a modal — save them immediately.

Recent requests

Time	User	Model	Protocol	Prompt	Completion	Latency	Outcome
14:32:08	alice	qwen-coder-14b	openai	412	128	1,832 ms	ok
14:31:55	bob	llama-3.2-1b	anthropic	28	45	312 ms	ok
14:31:42	alice	claude-opus-4-7	anthropic	2,840	1,205	8,541 ms	ok
14:31:21	bob	qwen-coder-14b	openai	198	82	1,021 ms	ok
14:30:58	alice	qwen-coder-32b	openai	0	0	0 ms	rate_limited
14:30:31	bob	llama-3.3-70b-sharded	anthropic	512	312	3,712 ms	ok

Routing chain — what `model="auto"` walks, top to bottom

Drag to reorder. Each request tries the first entry, then advances on rate-limit or failure, and commits the first success. Pin a local model last as the always-available $0 floor. Same as flock route ls / mv / set.

#	Model / provider	Kind	Est. cost
⠿ 1	groq/llama-3.3-70b	provider	free tier	drag
⠿ 2	deepseek/deepseek-chat	provider	$ cheap	drag
⠿ 3	claude-opus-4-7	vendor	$$ paid	drag
⠿ 4	qwen3.6-27b	local · pinned	$0 floor	drag

# or manage it from the CLI $ flock route ls $ flock route add groq/llama-3.3-70b --first $ flock route mv claude-opus-4-7 3 $ flock route reset # recompute free → cheap → paid → local

Audit log

Time	Actor	Action	Target
14:32:18	initial-admin	POST /admin/v1/shards/create	192.168.1.10:54231
14:31:02	initial-admin	POST /admin/v1/tokens	192.168.1.10:54201
14:28:44	initial-admin	POST /admin/v1/nodes/n_ghi789/drain	192.168.1.10:54180
14:15:21	eve	egress.anthropic	claude-opus-4-7
12:02:18	initial-admin	DELETE /admin/v1/tokens/k_old	192.168.1.10:53811
10:48:09	initial-admin	POST /admin/v1/models	192.168.1.10:53120

Mocked preview — click any tab above to navigate. The real UI ships embedded in the Flock binary. View the real source on GitHub →

Get started

Install & first chat in 3 minutes

Pick your platform — 4 commands each.

# 1. install Flock (one Go binary, ~23 MB)
curl -fsSL https://raw.githubusercontent.com/llmpy/flock/main/installer/install.sh | sh
export PATH="$HOME/.local/bin:$PATH"

# 2. install an engine — Ollama is the simplest default on Apple Silicon
#    alternatives: pip install mlx-lm  ·  llama.cpp's llama-server  ·  vLLM in Docker
brew install --cask ollama && open -a Ollama

# 3. start Flock with a small model (~1 GB, fast download)
FLOCK_DEFAULT_MODEL=llama-3.2-1b flock up

# 1. install Flock (one Go binary, ~23 MB)
curl -fsSL https://raw.githubusercontent.com/llmpy/flock/main/installer/install.sh | sh
echo 'export PATH="$HOME/.local/bin:$PATH"' >> ~/.bashrc && source ~/.bashrc

# 2. install an engine — Ollama is the simplest default
#    alternatives: vLLM (NVIDIA)  ·  llama.cpp's llama-server  ·  MLX-LM (Apple Silicon only)
curl -fsSL https://ollama.com/install.sh | sh && sudo systemctl enable --now ollama

# 3. start Flock with a small model (~1 GB, fast download)
FLOCK_DEFAULT_MODEL=llama-3.2-1b flock up

💡 Not sure which engine to install? Run flock doctor after step 1 — it inspects your hardware and tells you the single command to run.

After it boots you'll see

✔ default model: llama-3.2-1b
✔ engine: ollama at http://127.0.0.1:11434

  Flock is ready.

  API:    http://localhost:8080/v1
  Health: http://localhost:8080/healthz

  Admin API key (shown once — store it now):
    sk-orc-xK9p…

Copy the admin key. You'll need it next.

Test it (pick one)

curl

curl :8080/v1/chat/completions -H "Authorization: Bearer sk-orc-..." -d '{"model":"auto","messages":[…]}'

Web UI

http://localhost:8080 → paste the admin key

Claude Code

export ANTHROPIC_BASE_URL=http://localhost:8080 export ANTHROPIC_AUTH_TOKEN=sk-orc-... export ANTHROPIC_MODEL=llama-3.2-1b claude

Team rollout

Wire up Claude Code, Cursor, your whole team

Flock works with every tool that speaks OpenAI or Anthropic. Three ways: flock connect <tool> on the CLI, the Connect tab in the dashboard, or copy a snippet manually. All three use the same code path.

Shipped

CLI

One command per tool

flock connect claude-code
flock connect cursor
flock connect hermes
flock connect open-webui    # ChatGPT-style web UI
flock connect goose         # Block's terminal agent
flock connect plandex       # agentic planner
flock connect openhands     # autonomous coding agent
flock connect codex-cli     # OpenAI's official CLI
flock connect opencode
flock connect --list

Prints config with your base URL + token already substituted. Read from $FLOCK_TOKEN or ~/.flock/admin.key.

Invite a teammate

flock invite hadi \
  --quota 100000

Creates user-scope token + share card with snippets for all 19 clients (paste-into-Slack markdown).

Shipped

Dashboard

Connect tab

Dropdown of 19 clients, pre-filled snippet, one-click Copy, Test-connection button that proves the gateway works end-to-end.

Playground tab

In-browser chat — pick a model, send a message, see streaming output. 10-second sanity check before wiring up Cursor.

Multi-turn Chat tab

In-memory chat tester — pick a model, hold a real conversation through the gateway. Nothing stored; reload or Clear wipes it.

Invite teammate (in Tokens tab)

Modal with name + quota + clients form → returns the share card with one-click Copy-as-markdown.

Always works

Manual

Claude Code

export ANTHROPIC_BASE_URL=...
export ANTHROPIC_AUTH_TOKEN=sk-orc-...
claude

Cursor

Settings → Models → Override OpenAI Base URL: .../v1

Full per-client snippets in README · Connecting clients.

The team rollout flow

Admin sets up once

flock up on the leader. Add a worker or two if needed.

Invite each teammate

flock invite <name> → paste the output card into Slack.

Teammates paste & go

They copy the snippet for their tool of choice. Done — Claude Code / Cursor now run against your hardware.

Architecture

What runs where

One machine is enough for most teams. Add more when you want throughput, redundancy, or a model that doesn't fit on a single box.

Single machine

Solo dev or small team sharing one box.

Your computer (Mac or Linux) ┌─────────────────────────────────────────────────┐ │ │ │ Cursor / Claude Code / curl / SDKs │ │ │ │ │ ▼ │ │ FLOCK :8080 │ │ (gateway · auth · UI · audit) │ │ │ │ │ ▼ │ │ Ollama :11434 │ │ (the actual LLM) │ │ │ └─────────────────────────────────────────────────┘

Multiple machines

Leader + workers. Router decides per request.

LEADER WORKER ┌──────────────┐ ┌──────────────┐ │ flock up │ │ flock join │ │ ───────── │ │ ───────── │ │ Router │ ───routes──▶ │ agent │ │ + UI │ :8081 │ + Ollama │ │ + auth │ ◀──heartbeat──│ │ │ + Ollama │ every 5s │ loaded │ │ │ │ models │ └──────────────┘ └──────────────┘ LAN / Tailscale

Sharded model (one big model across many machines)

For models too large for any single box — e.g. Llama 70B Q4 across 2× Mac Mini via llama.cpp RPC. flock shard create <model> <N> does all of this automatically:

Client request for "llama-3.3-70b-sharded" │ ▼ ┌──────────────────────┐ │ LEADER (coordinator)│ ┌──────────────────────┐ │ llama-server │ RPC │ WORKER A │ │ --rpc A:50052, │ ────▶ │ rpc-server :50052 │ (layers 1-40) │ B:50052 │ ◀──── │ (auto-launched │ │ │ │ by Flock) │ │ serves OpenAI API │ └──────────────────────┘ │ to clients │ ┌──────────────────────┐ │ │ ────▶ │ WORKER B │ │ │ ◀──── │ rpc-server :50052 │ (layers 41-80) └──────────────────────┘ └──────────────────────┘

Features

What's in the box

Everything below is in the Go binary you download. No add-ons, no separate services.

🧭 new

Cross-provider `auto` routing chain

Send model="auto" and Flock walks a persisted, user-ordered chain — free → cheap → paid — committing the first success and advancing past any rate-limit or transient failure. A local model pinned last is your always-available $0 floor.

$ flock route                 # show the live chain
⠿ 1 groq/llama-3.3-70b      # free
⠿ 2 deepseek/deepseek-chat  # cheap
⠿ 3 claude-opus-4-7         # paid
$ flock route reset           # free → cheap → paid → local

auto · free → cheap → paid → local

🌍

20+ providers, one endpoint

Native Anthropic, OpenAI, Bedrock & Vertex, plus 20+ OpenAI-compatible gateways — OpenRouter, Groq, Together, Fireworks, Cohere, Mistral, Perplexity, DeepSeek, Cerebras, NVIDIA, Gemini, Hugging Face, Z.AI, GitHub Models and more. Address any with a slash prefix.

$ flock chat -m groq/llama-3.3-70b "ping"
$ flock chat -m deepseek/deepseek-chat "ping"
$ flock chat -m claude-opus-4-7 "ping"

20+ providers · one endpoint

🔄 new

Multi-key rotation + 429 failover

Stack OPENAI_API_KEY, _2, _3… per provider. Egress rotates round-robin, parks any key that returns 429 / 5xx (honoring Retry-After), and retries on the next — so free-tier limits stop being a wall.

export OPENAI_API_KEY=sk-...a1
export OPENAI_API_KEY_2=sk-...b2
export OPENAI_API_KEY_3=sk-...c3   # round-robin

round-robin · 429 failover

🔌

OpenAI + Anthropic APIs

/v1/chat/completions + /v1/messages. SSE streaming. Tool calls. The whole shape both SDKs expect.

$ curl localhost:11434/v1/chat/completions \
      -d '{"model":"auto","stream":true,...}'
$ curl localhost:11434/v1/messages   # Anthropic shape

OpenAI + Anthropic · SSE

🖼️

Vision (image input)

Send image_url content blocks on the same chat endpoint. Works with Gemma 4, Llama 4 Scout, Qwen3-VL, Pixtral, MiMo-VL, Moondream 3, Step-3.7 (Ollama path).

$ flock chat -m qwen3-vl \
      --image ./diagram.png "what's in this?"

image_url · vision models

🧮

Embeddings

/v1/embeddings with nomic-embed-text or any Ollama embedding model. OpenAI-compatible response — drops into any RAG stack.

$ curl localhost:11434/v1/embeddings -d \
      '{"model":"nomic-embed-text","input":"hi"}'
→ [0.021, -0.114, 0.077, …]  (768 dims)

text → vector · RAG-ready

⚙️

Multi-backend engines

Ollama, vLLM, MLX-LM, llama.cpp (single-node + RPC sharding). Hot-swappable via config. Flock auto-launches llama-server when you pick the llama.cpp engine — no second process to manage.

# catalog.yaml
engine: vllm      # or ollama | mlx | llama.cpp
$ flock up        # engine launched for you

Ollama · vLLM · MLX · llama.cpp

🔁

Hybrid vendor fallback

Set ANTHROPIC_API_KEY / OPENAI_API_KEY. Requests for claude-* / gpt-* transparently proxy upstream, logged the same as local.

export ANTHROPIC_API_KEY=sk-ant-...
$ flock chat -m claude-opus-4-7 "…"  # → upstream
$ flock chat -m qwen3-coder     # → local $0

local-first · cloud escape hatch

☁️

Bedrock + Vertex egress

Set FLOCK_BEDROCK_REGION and anthropic.* model IDs are signed via real SigV4 (aws-sdk-go-v2). FLOCK_VERTEX_PROJECT wires the ADC auth probe for gemini-*.

export FLOCK_BEDROCK_REGION=us-east-1
export FLOCK_VERTEX_PROJECT=my-gcp-proj
# anthropic.* → SigV4 · gemini-* → ADC

SigV4 · ADC

🔍

OTLP traces (end-to-end)

Set FLOCK_OTLP_ENDPOINT and get spans for every request: http.request → router.Chat → per-fallback-attempt → ollama.Chat with prompt + completion token counts. W3C traceparent propagation always on.

export FLOCK_OTLP_ENDPOINT=http://localhost:4317
# http.request → router.Chat → ollama.Chat

end-to-end spans

♻️

Catalog fallback chains

Declare fallback: [next-id, …] in catalog YAML. Router walks the chain in order on engine error / 5xx / timeout / model-not-loaded. Transparent to clients; visible in audit log.

# catalog.yaml
- id: qwen3-coder-30b
  fallback: [llama-3.3-70b, gemma-3-27b]

per-model fallback chain

🛡️

Hardware-floor refusal

flock model add checks min_ram_gb / min_vram_gb from the catalog and refuses installs that would oversubscribe. --force overrides when you know better.

$ flock model add llama-3.1-405b
✗ needs 256 GB RAM · you have 64 GB  (--force)

refuse oversubscription

🧠

Memory-aware model switching

flock model load --swap releases the least-recently-used model (draining in-flight requests first), then loads the new one — your machine is never overcommitted. --pin protects a model; loaded models come back after restart.

$ flock model load qwen3-coder-30b --swap
→ evict LRU (gemma-3) · drain 2 reqs · load
$ flock model load gemma-3-27b --pin

LRU swap · never overcommitted

🔑

Multi-tenant auth

Per-user API keys (sha256-hashed). Scopes: admin / user / node. Daily token quotas. Revocation immediate.

$ flock token create --user alice \
      --scope user --quota 1M
$ flock token revoke alice   # immediate

per-user keys · scopes · quotas

📊

Usage + audit

Every request recorded (user, model, tokens, latency, outcome). Admin actions audited. flock usage and flock audit read it back.

$ flock usage                # tokens by user/model
$ flock audit                # admin action log

every request recorded

📈

Prometheus + Grafana

/metrics exposes RPS, latency, tokens, model-loaded gauges. Three importable Grafana dashboards ship in dashboards/ — cluster overview, per-model, per-node.

$ curl localhost:11434/metrics
# import dashboards/ into Grafana

Prometheus + Grafana

🌐

Multi-node routing

flock join. Router picks local-first, then least-loaded worker. Heartbeats reconcile placements every 5s.

$ flock join http://leader:11434 --token …
→ local-first, then least-loaded worker

leader + workers · least-loaded

🪓

Auto-sharding

flock shard create launches rpc-server on workers + the coordinator llama-server on the leader. One command, full orchestration — run a model too big for any single box.

$ flock shard create --model llama-3.1-405b \
      --workers mac-studio-1,mac-studio-2

split one model across machines

🖥️

Embedded web UI

Tailwind via CDN, vanilla JS, served from /. Tabs covering every admin action — nodes, models, shards, the drag-reorder Routing chain, tokens, usage, audit — plus Connect, Playground, and a multi-turn Chat tester.

# open in a browser
$ open http://localhost:11434
Nodes · Models · Routing · Tokens · Usage

tabs for every admin action

📦

One-line install

Single Go binary. SHA-256 verified. Detects Ollama. Tries user-dirs before sudo.

$ curl -fsSL …/install.sh | sh
→ sha-256 ✓ · single binary · no deps

one binary · sha-256 verified

📖

CLI ↔ UI parity

Every admin action works both ways. Every command has --help with examples.

$ flock model load qwen3
≡ click "Load" in the Models tab

CLI ≡ UI

🆓

Apache-2.0

No open-core gotchas. Commercial use, modification, embedding all OK. Patent grant included.

$ cat LICENSE   # Apache-2.0
commercial ✓  modify ✓  embed ✓  patent grant ✓

Apache-2.0 · no gotchas

Multi-machine

Add a second machine

Same install command on every machine. The first becomes the leader; the rest become workers. That's the whole protocol.

On the leader

Issue a one-time worker join token.

$ flock token create --node
✔ sk-orc-NodeJoin-AbCd1234…

On the new machine

Install Flock + Ollama the same way as before, then:

$ flock join http://leader.local:8080?token=sk-orc-NodeJoin-AbCd1234…

3
Install a model on the worker

So it has something to serve.
```
$ flock model add qwen-coder-7b
```
✓
Back on the leader, verify
```
$ flock node ls
ID         HOSTNAME    OS/ARCH       STATE
local      mbp-hadi    darwin/arm64  ready
n_abc123   mac-mini    darwin/arm64  ready
```
From now on, any request for qwen-coder-7b automatically routes to the worker. Install the same model on multiple workers → automatic load balancing.

41 curated models

Qwen 3.6, gpt-oss, Llama 4, Gemma 4, GLM 5.2, DeepSeek V4, Kimi K2.6, Nemotron 3 Ultra, Qwen3-VL, Pixtral…

Flock ships a curated catalog of 41 open-weight models — chat, code, reasoning, vision, and embeddings — spanning 1 GB edge MoEs to 550 B hybrid Mamba-Transformers and 1 T-parameter sharded frontier. Use any of them, install any other Ollama model, or wire up vLLM / MLX-LM for higher throughput.

Catalog id	What it's for	Size	Min RAM
Embedding — for RAG / retrieval
nomic-embed-text	768-dim, 8K ctx — drop-in for OpenAI `text-embedding-*`	0.27 GB	2 GB
Edge — laptop
llama-3.2-1b	smoke test, fastest	1.3 GB	2 GB
llama-3.2-3b	small fast chat	2.0 GB	4 GB
Small — 8–16 GB box
qwen-coder-7b	code completion + chat	4.7 GB	8 GB
deepseek-r1-8b	distilled reasoning ("thinking")	4.9 GB	12 GB
mimo-7b	Xiaomi reasoning-focused dense	4.5 GB	8 GB
lfm2.5-8b-a1b ⭐	best on-device MoE (1 B active)	5.0 GB	8 GB
qwen3-8b	general chat, balanced	5.2 GB	12 GB
glm-4-9b	Z.ai dense chat, 128 K context	5.5 GB	12 GB
mellum2-12b	JetBrains coder MoE (2.5 B active, Apache-2.0)	7.0 GB	12 GB
mistral-nemo-12b	128 K context, multilingual	7.1 GB	12 GB
gemma4-12b	multimodal (text + image; audio declared, route pending)	7.6 GB	12 GB
qwen3-14b	more capable Qwen 3 chat	9.0 GB	16 GB
qwen-coder-14b	code + agent (proven)	9.0 GB	16 GB
phi-4-14b	strong reasoning per byte	9.1 GB	12 GB
Mid — 24–32 GB box
gpt-oss-20b ⭐	OpenAI open-weight; adjustable thinking	14 GB	16 GB
qwen3.6-27b ⭐	77 % SWE-bench; top consumer pick	17 GB	24 GB
gemma4-26b	MoE 4 B active; multimodal vision	18 GB	24 GB
qwen3-30b	MoE 3 B active; very fast	19 GB	24 GB
qwen3-coder-30b	MoE 3.3 B active code agent	19 GB	24 GB
qwen-coder-32b	dense code agent (older, proven)	20 GB	32 GB
Vision & multimodal — `image_url` on /v1/chat/completions
moondream3	tiny VLM — runs on a Raspberry Pi	1.2 GB	4 GB
mimo-vl-7b	charts, UI, screenshots (Xiaomi)	4.8 GB	8 GB
qwen3-vl-8b	strong OCR / charts / UI	5.5 GB	10 GB
gemma4-e4b	edge multimodal (text + image + audio)	9.6 GB	12 GB
pixtral-12b	Mistral Nemo + visual encoder	7.8 GB	16 GB
qwen3-vl-32b	frontier-tier vision-language	20 GB	32 GB
gemma4-31b	larger multimodal Gemma 4	20 GB	32 GB
Power user — single 80 GB GPU / 2-node sharded
llama-3.3-70b-sharded	frontier-ish, ≥ 2 nodes	43 GB	48 GB
gpt-oss-120b	≈ o4-mini reasoning, single H100	65 GB	80 GB
llama-4-scout	10 M context, multimodal (109 B MoE)	67 GB	80 GB
glm-4.5-air-sharded	106 B MoE / 12 B active — agentic	70 GB	80 GB
Frontier — multi-machine sharded
step-3.7-flash-sharded ⭐	198 B MoE / 11 B active VLM — Apache-2.0, ~400 tok/s	100 GB	128 GB
deepseek-v4-flash-sharded ⭐	284 B MoE / 13 B active — cost-efficient frontier	150 GB	160 GB
nemotron-3-ultra-sharded	550 B hybrid Mamba-MoE / 55 B active — 1 M context, MMLU 89.1	280 GB	320 GB
glm-4.6-sharded	357 B MoE / 32 B active — agentic coder	200 GB	224 GB
glm-5.1-sharded	754 B MoE / 40 B active — best agentic coder	400 GB	416 GB
glm-5.2-sharded ⭐	MoE, 1 M context — newest Z.ai frontier	460 GB	480 GB
kimi-k2.6-sharded	1 T MoE / 32 B active — # 1 open coding	500 GB	512 GB

# Install a catalog model $ flock model add qwen3.6-27b # Use it via the API $ curl :8080/v1/chat/completions \ -H 'Authorization: Bearer sk-orc-...' \ -d '{"model":"qwen3.6-27b","messages":[…]}' # Or in Claude Code $ export ANTHROPIC_MODEL=qwen3.6-27b $ claude

# Use any Ollama model (no catalog entry needed) $ ollama pull qwen3:0.6b $ curl :8080/v1/chat/completions \ -H 'Authorization: Bearer sk-orc-...' \ -d '{"model":"qwen3:0.6b","messages":[…]}' # Or swap engines entirely $ export FLOCK_ENGINE=vllm $ export FLOCK_VLLM_ENDPOINT=http://gpu:8000 $ flock up

For the complete per-model walkthrough — picker table with code/chat/reasoning/vision ratings, install + use snippets for every client (curl / Cursor / Claude Code / SDKs) — see MODELS.md.

Start here

qwen3.6-27b ⭐

The single best default if you have ≥ 24 GB RAM. 77 % SWE-bench, Apache-2.0, strong code + agent. Works great with Claude Code and Cursor.

flock model add qwen3.6-27b

Tight on RAM

gpt-oss-20b

OpenAI's open-weight model, Apache-2.0, adjustable reasoning effort, fits a 16 GB box. ≈ o3-mini quality on reasoning benchmarks.

flock model add gpt-oss-20b

Frontier tier

deepseek-v4-flash-sharded

Frontier reasoning quality at consumer cost — 284 B MoE / 13 B active means fast inference. Splits cleanly across 2 nodes via llama.cpp RPC.

flock shard create \
  deepseek-v4-flash-sharded 2

Reversible

Try Flock without commitment

Pointing Claude Code at Flock is just three env vars. Going back to api.anthropic.com is unsetting them.

Switch to Flock

export ANTHROPIC_BASE_URL=\
  http://localhost:8080
export ANTHROPIC_AUTH_TOKEN=\
  sk-orc-...
export ANTHROPIC_MODEL=\
  llama-3.2-1b

claude

Switch back to real Anthropic

flock disconnect claude-code

# prints the exact unset + export
# commands — same for every
# supported client.

Manually: unset ANTHROPIC_BASE_URL ANTHROPIC_AUTH_TOKEN ANTHROPIC_MODEL, then export ANTHROPIC_API_KEY=sk-ant-.... Or just open a fresh terminal — Claude Code defaults to api.anthropic.com when the BASE_URL var isn't set.

Hybrid (recommended)

# Keep Flock vars set,
# add real Anthropic key:
export ANTHROPIC_API_KEY=\
  sk-ant-...

flock up  # restart

Now --model claude-opus-4-7 transparently proxies to real Anthropic. Local models stay free. Same claude, you pick per-prompt.

Compare

vs the alternatives

Flock sits at the intersection of three categories that mostly don't overlap.

Feature	Flock	Ollama	LiteLLM	exo	LocalAI
OpenAI-compatible API	✓	✓	✓	✓	✓
Anthropic-compatible API (Claude Code)	✓	✗	✓	✗	✗
Per-user API keys + quotas	✓	✗	✓	✗	✗
Audit log	✓	✗	✓	✗	✗
Multi-machine routing	✓	✗	✗	✓	✗
Auto-sharding (one model across N machines)	✓	✗	✗	✓	✗
Hybrid local + vendor fallback	✓	✗	✓	✗	✗
20+ providers + cross-provider `auto` chain	✓	✗	✓	✗	✗
Multi-key rotation + 429 failover	✓	✗	partial	✗	✗
Embedded admin UI	✓	✗	✗	✗	partial
Single binary (no Python/Docker/k8s)	✓	✓	✗	✗	partial
Apache-2.0	✓	✓	✓	✓	✓

Honest framing: any single feature above is available in one of the alternatives. The combination — OpenAI + Anthropic + multi-tenant + multi-node + sharding + UI, all in one Go binary — is what Flock uniquely offers.

Docs

CLI reference

Every command supports --help with examples.

Lifecycle

flock up: Start local node (leader on first run)
flock down: Stop the local node
flock status: Cluster status summary
flock join <url>?token=…: Join as a worker
flock doctor: Diagnose common problems
flock update: In-place upgrade to latest release
flock version: Print version

Nodes

flock node ls: List nodes
flock node show <id>: Inspect a node
flock node drain <id>: Stop routing to it
flock node remove <id>: Forget a node

Models

flock model search [q]: Browse catalog
flock model add <id>: Install (auto-delegates if sharded)
flock model ls: List installed
flock model remove <id>: Uninstall

Sharded models

flock shard create <m> [N]: Orchestrate sharded model across N workers
flock shard ls: List shards
flock shard remove <m>: Tear down

Routing chain (`model="auto"`)

flock route ls: Show the cross-provider chain
flock route add <model>: Append (or --first) a provider/model
flock route mv <model> <n>: Reorder a chain entry
flock route reset: Recompute free → cheap → paid → local

Tokens / users

flock token create [name]: Issue API key (--admin, --node)
flock token ls: List API keys
flock token revoke <id>: Revoke a key

Observability + config

flock usage [--limit N]: Recent inference records
flock audit [--limit N]: Recent admin actions
flock config show: Effective config (secrets redacted)

Connect your tools

Cursor / Continue / Aider

Set OpenAI base URL:

http://localhost:8080/v1

API key: sk-orc-…

Claude Code

ANTHROPIC_BASE_URL=http://localhost:8080
ANTHROPIC_AUTH_TOKEN=sk-orc-…
ANTHROPIC_MODEL=llama-3.2-1b

OpenAI / Anthropic SDK

OpenAI(
  base_url="http://localhost:8080/v1",
  api_key="sk-orc-…"
)

QUICKSTART.md →

3-min new-user landing page with diagrams

README.md →

Full reference: API, config, troubleshooting

ARCHITECTURE.md →

Contributor deep-dive into internals

Security

Security model

Flock assumes a trusted network (LAN or Tailscale) for cluster traffic. Honest about what's protected and what isn't.

What's strongly protected

User API keys stored as sha256 hashes — plaintext shown only at creation
Worker HTTP servers bind to the mesh address (LAN / tailnet IP), never 0.0.0.0
Web UI auth by pasted admin key (in browser localStorage)
Quotas + audit log limit damage from a leaked key
Vendor fallback uses team-scoped vendor keys, never the user's

What requires LAN trust

Worker tokens are stored plaintext in nodes.worker_token on the leader's SQLite
Anyone with read access to the leader's DB can impersonate a worker
HMAC-SHA256 mutual auth between leader and workers is shipped — signatures travel instead of tokens; set FLOCK_REJECT_BEARER=1 on workers to require it
For hostile networks run the cluster behind Tailscale or a zero-trust overlay

Free. Open source. No telemetry.

Flock is released under the Apache License 2.0. You can use it commercially, modify it, embed it in your own products, redistribute it. No "open core" gotchas. No "free for personal use only" clauses. No SaaS plan to upgrade to.

No license fee. Ever.

The binary you download from GitHub Releases is the same binary a Fortune-500 would use. There is no Pro / Enterprise / Cloud tier hiding the features you actually want.

⚖️

Apache-2.0 — actually permissive

✓ Commercial use · ✓ Modification · ✓ Distribution · ✓ Patent grant included · ✓ Private use. The only requirements: keep the license + notice, state significant changes you made.

🔒

Your data stays yours

No phone-home. No analytics. No "anonymized" telemetry that's actually fingerprinted. The binary doesn't open outbound connections except to engines you configure (Ollama / vendor APIs you opt in to).

What it could save you

A team of 10 devs running modern AI tools heavily can burn $200–500 per dev per month in API tokens. That's $30–60k/year scaling linearly with usage. Flock moves the 80% of "easy" calls to your own hardware — for free — and keeps the optional escape hatch to real Claude / GPT for the 20% that actually need it.

Download free View LICENSE

Rough monthly cost (10 devs · heavy use)

Claude API (Sonnet)	~$3,000
OpenAI API (gpt-4o)	~$2,500
OpenRouter / vendor proxy	$2,500 + markup
Flock + your own hardware	~$50 (electricity)

Hardware (~$16k for the team-of-10 build) pays back in ~5 months. Stack works for years after.

Roadmap

What's shipped & what's next

✓ Shipped

Core gateway
• OpenAI + Anthropic API surface, streaming, tool use, vision (image_url), embeddings (/v1/embeddings)
• /v1/rerank (Cohere shape, llama-server passthrough) + /v1/audio/transcriptions + /v1/audio/speech shells (whisper / piper endpoints)
• Ollama / vLLM / MLX / llama.cpp drivers (single-node + RPC)
• Multi-node routing with heartbeat + placements (LAN mesh)
• Sharding auto-orchestration + shard crash auto-restart + auto-distribution of GGUF weights (sha256-verified)
• 41-model catalog + license metadata + flock model add hf:/ollama:/file: + --from <my.yaml> for non-catalog installs
Multi-tenancy & auth
• Per-user API keys, scopes (admin / user / node), TTL expiry (--ttl 7d, renew, expire)
• Per-key model allowlist (literal + claude-* glob) — 403 model_not_allowed + audit row
• Per-key RPM + TPM rate limits (leaky-bucket) with reconciliation on actual usage
• Daily token quotas + dollar budgets (day/week/month windows; multiple budgets compose with AND)
• Per-call $ cost tracking (vendor pricing table + catalog override; cost_usd snapshotted on every usage row)
• Standard X-RateLimit-* response headers + X-Flock-Request-Id correlation
Router intelligence
• Failure-based catalog fallback + typed chains (fallback_on_context_length, fallback_on_content_policy) with error classification
• Per-request overrides (flock.fallbacks, num_retries, retry_backoff_ms, hedge) via body or X-Flock-* headers
• Sticky sessions (KV-cache locality on multi-turn chats) + placement cooldown (circuit breaker for flaky workers)
• Request hedging — fire to top-N least-loaded workers, return whichever responds first
• Latency-aware fallback (p95 trigger)
Hybrid local + cloud
• Cross-provider model="auto" routing chain — persisted, drag-reorderable, free → cheap → paid → local floor (flock route)
• Anthropic + OpenAI passthrough (claude-*, gpt-*)
• 20+ vendor passthroughs: openrouter/, groq/, together/, fireworks/, cohere/, mistral/, perplexity/, deepseek/, cerebras/, nvidia/, gemini/, huggingface/, zai/, github/ … — slash-prefix stripped before forwarding
• Multi-key rotation + 429 failover — stack <PROVIDER>_API_KEY_2..N, round-robin, park-on-429, honor Retry-After
• Bedrock SigV4 (anthropic.* family) + Vertex ADC probe
Observability & policy
• Prometheus metrics, OTLP traces end-to-end across all four engine drivers, reference Grafana dashboards
• Webhook + Langfuse + S3 callbacks — usage / audit fan-out with HMAC-signed payloads, bounded queues
• Guardrails framework — pre-call webhook hook that can block / rewrite / flag (PII redaction, prompt-injection checks)
• Response cache for embeddings (memory or SQLite-backed; canonical key; Cache-Control opt-out)
• Time-bucketed usage breakdown (/admin/v1/usage/breakdown?group_by=user,model&bucket=day)
• HMAC mutual auth for worker token (no plaintext on wire)
• Typed engine_unreachable + guardrail_blocked + budget_exceeded errors with actionable hints
DX
• Embedded web UI: live SSE event stream, modal-based confirm/prompt, audit filter, per-row "models / rates / budgets / expiry" editors, $ today KPI, breakdown panel
• 19-client flock connect roster (golden-tested), interactive picker, shell completion, --json / --summary / --dry-run everywhere
• Install one-liner + signed binaries + .deb / .rpm packages + 2-node smoke + nightly single-node e2e

→ Next

• Semantic cache — chromem-go embedded vector store, per-namespace threshold
• OIDC login for the web UI (Google / GitHub / Okta)
• Chat completion caching with streaming replay (today: embeddings only)
• Post-call guardrails on streamed responses (today: pre-call only)
• Vertex body translation (OpenAI / Anthropic → generateContent) — ADC probe wired, translation queued
• Bedrock streaming + non-Anthropic body shapes (amazon, meta, mistral)
• Whisper / Piper engine drivers (today: endpoint proxies; auto-launch + catalog entries queued)
• Router fallback callback events (today: usage + audit sinks ship; fallback is in the audit log only)
• Tailscale (tsnet) mesh for NAT traversal + mTLS
• LoRA adapter loading + live model migration
• Postgres backend for HA control plane
• AMD ROCm path · NAS .spk packages (Synology DSM)

Get started in 3 minutes

Free & open source (Apache-2.0). No signup. No SaaS. Just a binary.

$ curl -fsSL https://raw.githubusercontent.com/llmpy/flock/main/installer/install.sh | sh

⭐ Star on GitHub Read QUICKSTART →

Self-hosted AIfor your team. One endpoint. Your hardware.

Tools above. Flock in the middle. Engines + models below.

One layer between your tools and the LLMs

What Flock does, in plain English

Local models, one API

Team-ready out of the box

Scales from 1 to N machines

Switching models is one action

CLI is the source of truth

The honest one-line pitch

An admin UI that ships with the binary

Connect a tool

Playground

Nodes

Installed models

Sharded models

API keys

Recent requests

Routing chain — what model="auto" walks, top to bottom

Audit log

Settings

Chat

Install & first chat in 3 minutes

After it boots you'll see

Test it (pick one)

Wire up Claude Code, Cursor, your whole team

CLI

Dashboard

Manual

The team rollout flow

What runs where

Single machine

Multiple machines

Sharded model (one big model across many machines)

What's in the box

Cross-provider auto routing chain

20+ providers, one endpoint

Multi-key rotation + 429 failover

OpenAI + Anthropic APIs

Vision (image input)

Embeddings

Multi-backend engines

Hybrid vendor fallback

Bedrock + Vertex egress

OTLP traces (end-to-end)

Catalog fallback chains

Hardware-floor refusal

Memory-aware model switching

Multi-tenant auth

Usage + audit

Prometheus + Grafana

Multi-node routing

Auto-sharding

Embedded web UI

One-line install

CLI ↔ UI parity

Apache-2.0

Add a second machine

Qwen 3.6, gpt-oss, Llama 4, Gemma 4, GLM 5.2, DeepSeek V4, Kimi K2.6, Nemotron 3 Ultra, Qwen3-VL, Pixtral…

Try Flock without commitment

vs the alternatives

CLI reference

Lifecycle

Nodes

Models

Sharded models

Routing chain (model="auto")

Tokens / users

Observability + config

Connect your tools

Security model

What's strongly protected

What requires LAN trust

Free. Open source. No telemetry.

No license fee. Ever.

Apache-2.0 — actually permissive

Your data stays yours

What it could save you

What's shipped & what's next

✓ Shipped

Self-hosted AI
for your team.
One endpoint. Your hardware.

Routing chain — what `model="auto"` walks, top to bottom

Cross-provider `auto` routing chain

Routing chain (`model="auto"`)