Flock
v1.20.1 100% free open source · Apache-2.0 no telemetry

Self-hosted AI
for your team.
One endpoint. Your hardware.

Flock is the self-hosted control plane for LLMs. One Go binary turns your Macs and Linux boxes into a private inference cluster — multi-machine routing, per-user keys, daily quotas, full audit log, and a built-in admin dashboard, behind one endpoint that speaks both the OpenAI and Anthropic APIs. Engine-agnostic: bring Ollama, vLLM, MLX, or llama.cpp-RPC. Fall back to paid Claude/GPT only when you choose.

Try it in 60 seconds
curl -fsSL https://raw.githubusercontent.com/hadihonarvar/flock/main/installer/install.sh | sh
flock doctor      # tells you the one command to install an engine, if you don't have one
flock up          # starts your private LLM gateway, prints your admin API key
✓ Mac (Apple Silicon) ✓ Linux (x86_64 / arm64) ✓ Cursor · Claude Code · Aider ✓ No Docker. No Python. No k8s.
The stack

Tools above. Flock in the middle. Engines + models below.

Every client speaks OpenAI or Anthropic. Every engine speaks its native API. Flock is the one URL + one key in between.

Layer 1 Clients · what your team uses
Claude Code
Cursor
Aider
Continue
Zed
Cline
Qwen-Code
Hermes
OpenClaw
OpenCode
Open WebUI
Open Notebook
Goose
Plandex
OpenHands
Codex CLI
OpenAI SDK
Anthropic SDK
curl / HTTP
one URL · one API key
Layer 2 Flock · the gateway you self-host
Flock
single Go binary · embedded UI · no telemetry
API surface
/v1/chat/completions
/v1/messages
/v1/embeddings
Routing
local-first → least-loaded worker · sharded · vendor fallback · latency-aware
Controls
per-user keys · daily quotas · audit · usage · Prometheus + OTLP
native APIs · your hardware
Layer 3a Engines · drive the models
Ollama
Mac · Linux · Windows
vLLM
NVIDIA · throughput
MLX-LM
Apple Silicon
llama.cpp
CPU · GGUF · RPC sharding
Layer 3b Models · 37 curated families + any HF or Ollama tag
Llama 3.x / 4
Qwen 2.5 / 3 / 3.6
Gemma 4
GPT-OSS 20B / 120B
DeepSeek R1 / V4
Mistral Nemo
Phi 4
Nemotron 3
Kimi K2.6
GLM 5.1
Step 3.7
LFM 2.5
Mellum 2
Nomic Embed
+ any HF GGUF
+ any Ollama tag

Mix and match across all three layers. Any client above + any engine below + any model on the bottom row — Flock is the only piece that needs to know about the rest.

Where Flock sits

One layer between your tools and the LLMs

Your tools point at Flock with one URL and one API key. Flock decides — per request — whether to serve from your hardware, fan out across machines, or proxy to a paid vendor. Switching the underlying model is a config change, not a re-wire.

           ┌──────────────────────────────────────────────────────────────┐
           │                       YOUR USE CASES                         │
           │             (the tools your team already uses)               │
           └──────────────────────────────────────────────────────────────┘
                  │           │          │             │            │
                  ▼           ▼          ▼             ▼            ▼
            ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐
            │  Cursor  │ │  Claude  │ │  Aider   │ │  Custom  │ │   curl   │
            │          │ │   Code   │ │          │ │ Python   │ │  scripts │
            │          │ │          │ │          │ │   SDK    │ │          │
            └────┬─────┘ └────┬─────┘ └────┬─────┘ └────┬─────┘ └────┬─────┘
                 │  OpenAI    │ Anthropic  │  OpenAI    │  Either    │  HTTP
                 └────────────┴────────────┴────────────┴────────────┘
                                          │
                                          │   ONE URL · ONE API KEY
                                          ▼
      ┌──────────────────────────────────────────────────────────────────────┐
      │                  ⬢ ⬢ ⬢   FLOCK   ⬢ ⬢ ⬢                              │
      │                  (this is what we built)                             │
      │  ────────────────────────────────────────────────────────────────    │
      │  Gateway     OpenAI + Anthropic + /v1/rerank + /v1/audio/*           │
      │              keys: allowlist · RPM/TPM · $ budgets · TTL expiry      │
      │              guardrails · response cache · callbacks · admin UI      │
      │                                                                      │
      │  Router      Same model on N nodes  → load-balance + sticky session  │
      │              Flaky worker          → placement cooldown (skip)        │
      │              Different models      → route by placement              │
      │              Model bigger than node→ split via llama.cpp-RPC         │
      │              Latency-sensitive     → hedge to top-N workers          │
      │              Claude / GPT / 7 more → proxy to vendor                 │
      │              Engine error/timeout  → typed fallback chain + retries  │
      └─────────────────────────────┬────────────────────────────────────────┘
                                    │
              ┌─────────────────────┼─────────────────────┐
              ▼                     ▼                     ▼
       ┌─────────────┐       ┌─────────────┐       ┌─────────────┐
       │   Engines   │       │   Engines   │       │   Egress    │
       │  (any mix)  │       │  (any mix)  │       │   proxy     │
       │  • Ollama   │       │  • Ollama   │       │ Anthropic   │
       │  • vLLM     │       │  • vLLM     │       │ OpenAI      │
       │  • MLX-LM   │       │  • MLX-LM   │       │ Bedrock     │
       │  • llama.cpp│       │  • llama.cpp│       │ OpenRouter  │
       │  • whisper  │       │  • piper    │       │ Groq, Mist- │
       │  • piper    │       │  • whisper  │       │ ral, Cohere │
       └──────┬──────┘       └──────┬──────┘       │ Perplexity… │
              │                     │              └──────┬──────┘
              ▼                     ▼                     ▼
      ┌──────────────────────────────────────────────────────────────────────┐
      │                    UNDERLYING LLMs / WEIGHTS                         │
      │                                                                      │
      │   YOUR HARDWARE                              VENDOR APIs             │
      │   • Mac Studio · Mac Mini                    • Claude (Anthropic)    │
      │   • Linux + RTX GPU                          • GPT, o3, o4 (OpenAI)  │
      │                                                                      │
      │   37 curated catalog models (Qwen 3.6,        Each request routed    │
      │   gpt-oss, Llama 4, Gemma 4, DeepSeek V4,     to EITHER your hard-   │
      │   Kimi K2.6, Nemotron 3 Ultra, vision +       ware OR a vendor —     │
      │   embedding models)                           you pay vendors only   │
      │   + any HuggingFace or Ollama model.          when YOU chose to.     │
      └──────────────────────────────────────────────────────────────────────┘

Without Flock you'd lock into one provider, share one API key, trust the vendor with your prompts, and pay per token. With Flock you change qwen3.6-27bclaude-opus-4-7 in one place — the dev's editor doesn't know or care.

Shipped catalog · smoke-tested today
Qwen 2.5 Coder · Qwen 3 · Llama 3.2 · Llama 3.3 sharded · DeepSeek R1

Plus any model on HuggingFace via flock model add hf:owner/repo or any Ollama tag via flock model add ollama:<tag>. Vision (image_url content blocks on /v1/chat/completions) and embeddings (/v1/embeddings) ship via the Ollama engine path. Speech endpoints are on the roadmap (see ROADMAP.md).

What Flock does, in plain English

Flock is the missing control plane for self-hosted LLMs — multi-machine routing, per-user keys, quotas, audit, and a built-in dashboard, all behind one API your existing tools already speak. Your team's AI tools talk to one endpoint; Flock decides whether to serve from your own machines (free + private), shard a giant model across several of them, or transparently fall back to real Claude / GPT (paid, logged) — your call.

🧠

Local models, one API

Pick the engine that fits your hardware — Ollama, vLLM, MLX-LM, or llama.cpp-RPC. Flock exposes whichever one you run through /v1/chat/completions (OpenAI) and /v1/messages (Anthropic) — so Claude Code works against your local Llama.

🔑

Team-ready out of the box

Per-user API keys with scopes, TTL expiry, model allowlists, RPM/TPM rate limits, $ budgets, audit log, Prometheus + OTLP, embedded admin UI. No nginx, no LiteLLM-plus-Python — just one binary.

🌐

Scales from 1 to N machines

Start on a laptop. Add more machines with one flock join command. Router load-balances replicas. For models too big for any single box, built-in llama.cpp-RPC sharding splits one model across many.

🔁

Switching models is one action

flock model add <id> for catalog models. flock model add hf:owner/repo for anything else on HuggingFace. No hand-written YAML, no manual GGUF downloads, no per-worker setup. Engine, quant, and shard count are picked for you from your hardware (M4-T16 → M4-T20 — shipped).

⚙️

CLI is the source of truth

Every action in the dashboard is one flock command underneath. Anything you can do with the UI, you can do with curl, cron, or an SSH session — same audit log, same validation, same outcome. No web-only knobs.

The honest one-line pitch

The only OSS tool that ships, in one Go binary, all of: OpenAI + Anthropic APIs + 9 vendor passthroughs + per-key allowlists / rate limits / $ budgets / TTL + multi-node routing (sticky / cooldown / hedging) + sharding + response cache + guardrails + webhook / Langfuse callbacks + embedded admin UI — designed for self-hosting on a Mac + Linux team fleet.

Interface

An admin UI that ships with the binary

Embedded via //go:embed. No separate frontend to deploy. Sign in by pasting the admin key Flock prints on first run. Every action also works from the CLI.

http://localhost:8080
Flock
orchestrate open LLMs · your hardware
key: sk-orc-xK9p…
Nodes
4
3 ready · 1 draining
Models
5
1 sharded
Recent requests
2,847
last 200
Tokens served
1.2M
saved ~$340 vs API
# Quick start: paste your admin key into your tools or use curl: $ curl http://localhost:8080/v1/chat/completions \ -H 'Authorization: Bearer sk-orc-xK9p…' \ -d '{"model":"auto","messages":[{"role":"user","content":"hi"}]}'

Mocked preview — click any tab above to navigate. The real UI ships embedded in the Flock binary. View the real source on GitHub →

Get started

Install & first chat in 3 minutes

Pick your platform — 4 commands each.

# 1. install Flock (one Go binary, ~23 MB)
curl -fsSL https://raw.githubusercontent.com/hadihonarvar/flock/main/installer/install.sh | sh
export PATH="$HOME/.local/bin:$PATH"

# 2. install an engine — Ollama is the simplest default on Apple Silicon
#    alternatives: pip install mlx-lm  ·  llama.cpp's llama-server  ·  vLLM in Docker
brew install --cask ollama && open -a Ollama

# 3. start Flock with a small model (~1 GB, fast download)
FLOCK_DEFAULT_MODEL=llama-3.2-1b flock up

💡 Not sure which engine to install? Run flock doctor after step 1 — it inspects your hardware and tells you the single command to run.

After it boots you'll see

✔ default model: llama-3.2-1b
✔ engine: ollama at http://127.0.0.1:11434

  Flock is ready.

  API:    http://localhost:8080/v1
  Health: http://localhost:8080/healthz

  Admin API key (shown once — store it now):
    sk-orc-xK9p…

Copy the admin key. You'll need it next.

Test it (pick one)

curl
curl :8080/v1/chat/completions -H "Authorization: Bearer sk-orc-..." -d '{"model":"auto","messages":[…]}'
Web UI
http://localhost:8080 → paste the admin key
Claude Code
export ANTHROPIC_BASE_URL=http://localhost:8080 export ANTHROPIC_AUTH_TOKEN=sk-orc-... export ANTHROPIC_MODEL=llama-3.2-1b claude
Team rollout

Wire up Claude Code, Cursor, your whole team

Flock works with every tool that speaks OpenAI or Anthropic. Three ways: flock connect <tool> on the CLI, the Connect tab in the dashboard, or copy a snippet manually. All three use the same code path.

Shipped

CLI

One command per tool
flock connect claude-code
flock connect cursor
flock connect hermes
flock connect open-webui    # ChatGPT-style web UI
flock connect goose         # Block's terminal agent
flock connect plandex       # agentic planner
flock connect openhands     # autonomous coding agent
flock connect codex-cli     # OpenAI's official CLI
flock connect opencode
flock connect --list

Prints config with your base URL + token already substituted. Read from $FLOCK_TOKEN or ~/.flock/admin.key.

Invite a teammate
flock invite hadi \
  --quota 100000

Creates user-scope token + share card with snippets for all 10 clients (paste-into-Slack markdown).

Shipped

Dashboard

Connect tab

Dropdown of 10 tools, pre-filled snippet, one-click Copy, Test-connection button that proves the gateway works end-to-end.

Playground tab

In-browser chat — pick a model, send a message, see streaming output. 10-second sanity check before wiring up Cursor.

Invite teammate (in Tokens tab)

Modal with name + quota + clients form → returns the share card with one-click Copy-as-markdown.

Always works

Manual

Claude Code
export ANTHROPIC_BASE_URL=...
export ANTHROPIC_AUTH_TOKEN=sk-orc-...
claude
Cursor

Settings → Models → Override OpenAI Base URL: .../v1

Full per-client snippets in README · Connecting clients.

The team rollout flow

1.
Admin sets up once
flock up on the leader. Add a worker or two if needed.
2.
Invite each teammate
flock invite <name> → paste the output card into Slack.
3.
Teammates paste & go
They copy the snippet for their tool of choice. Done — Claude Code / Cursor now run against your hardware.
Architecture

What runs where

One machine is enough for most teams. Add more when you want throughput, redundancy, or a model that doesn't fit on a single box.

Single machine

Solo dev or small team sharing one box.

Your computer (Mac or Linux) ┌─────────────────────────────────────────────────┐ │ │ │ Cursor / Claude Code / curl / SDKs │ │ │ │ │ ▼ │ │ FLOCK :8080 │ │ (gateway · auth · UI · audit) │ │ │ │ │ ▼ │ │ Ollama :11434 │ │ (the actual LLM) │ │ │ └─────────────────────────────────────────────────┘

Multiple machines

Leader + workers. Router decides per request.

LEADER WORKER ┌──────────────┐ ┌──────────────┐ │ flock up │ │ flock join │ │ ───────── │ │ ───────── │ │ Router │ ───routes──▶ │ agent │ │ + UI │ :8081 │ + Ollama │ │ + auth │ ◀──heartbeat──│ │ │ + Ollama │ every 5s │ loaded │ │ │ │ models │ └──────────────┘ └──────────────┘ LAN / Tailscale

Sharded model (one big model across many machines)

For models too large for any single box — e.g. Llama 70B Q4 across 2× Mac Mini via llama.cpp RPC. flock shard create <model> <N> does all of this automatically:

Client request for "llama-3.3-70b-sharded" │ ▼ ┌──────────────────────┐ │ LEADER (coordinator)│ ┌──────────────────────┐ │ llama-server │ RPC │ WORKER A │ │ --rpc A:50052, │ ────▶ │ rpc-server :50052 │ (layers 1-40) │ B:50052 │ ◀──── │ (auto-launched │ │ │ │ by Flock) │ │ serves OpenAI API │ └──────────────────────┘ │ to clients │ ┌──────────────────────┐ │ │ ────▶ │ WORKER B │ │ │ ◀──── │ rpc-server :50052 │ (layers 41-80) └──────────────────────┘ └──────────────────────┘
Features

What's in the box

Everything below is in the Go binary you download. No add-ons, no separate services.

🔌 OpenAI + Anthropic APIs

/v1/chat/completions + /v1/messages. SSE streaming. Tool calls. The whole shape both SDKs expect.

🖼️ Vision (image input)

Send image_url content blocks on the same chat endpoint. Works with Gemma 4, Llama 4 Scout, Qwen3-VL, Step-3.7 (Ollama path).

🧮 Embeddings

/v1/embeddings with nomic-embed-text or any Ollama embedding model. OpenAI-compatible response — drops into any RAG stack.

⚙️ Multi-backend engines

Ollama, vLLM, MLX-LM, llama.cpp (single-node + RPC sharding). Hot-swappable via config. Flock auto-launches llama-server when you pick the llama.cpp engine — no second process to manage.

🔁 Hybrid vendor fallback

Set ANTHROPIC_API_KEY / OPENAI_API_KEY. Requests for claude-* / gpt-* transparently proxy upstream, logged the same as local.

☁️ Bedrock + Vertex egress

Set FLOCK_BEDROCK_REGION and anthropic.* model IDs are signed via real SigV4 (aws-sdk-go-v2). FLOCK_VERTEX_PROJECT wires the ADC auth probe for gemini-*. Body translation for the remaining model families is planned.

🔍 OTLP traces (end-to-end)

Set FLOCK_OTLP_ENDPOINT and get spans for every request: http.requestrouter.Chat → per-fallback-attempt → ollama.Chat with prompt + completion token counts. W3C traceparent propagation always on, even when export is off.

♻️ Catalog fallback chains

Declare fallback: [next-id, …] in catalog YAML. Router walks the chain in order on engine error / 5xx / timeout / model-not-loaded. Transparent to clients; visible in audit log.

🛡️ Hardware-floor refusal

flock model add checks min_ram_gb / min_vram_gb from the catalog and refuses installs that would oversubscribe. --force overrides when you know better.

🧠 Memory-aware model switching

flock model load --swap releases the least-recently-used model (draining in-flight requests first), then loads the new one — your machine is never overcommitted. --pin protects a model; loaded models come back after restart; flock down frees engine RAM by default.

🔑 Multi-tenant auth

Per-user API keys (sha256-hashed). Scopes: admin / user / node. Daily token quotas. Revocation immediate.

📊 Usage + audit

Every request recorded (user, model, tokens, latency, outcome). Admin actions audited. flock usage and flock audit read it back.

📈 Prometheus + Grafana

/metrics exposes RPS, latency, tokens, model-loaded gauges. Three importable Grafana dashboards ship in dashboards/ — cluster overview, per-model, per-node.

🌐 Multi-node routing

flock join. Router picks local-first, then least-loaded worker. Heartbeats reconcile placements every 5s.

🪓 Auto-sharding

flock shard create launches rpc-server on workers + the coordinator llama-server on the leader. One command, full orchestration.

🖥️ Embedded web UI

Tailwind via CDN, vanilla JS, served from /. 7 tabs covering every admin action.

📦 One-line install

Single Go binary. SHA-256 verified. Detects Ollama. Tries user-dirs before sudo.

📖 CLI ↔ UI parity

Every admin action works both ways. Every command has --help with examples.

🆓 Apache-2.0

No open-core gotchas. Commercial use, modification, embedding all OK. Patent grant included.

Multi-machine

Add a second machine

Same install command on every machine. The first becomes the leader; the rest become workers. That's the whole protocol.

  1. 1
    On the leader

    Issue a one-time worker join token.

    $ flock token create --node
    ✔ sk-orc-NodeJoin-AbCd1234…
  2. 2
    On the new machine

    Install Flock + Ollama the same way as before, then:

    $ flock join http://leader.local:8080?token=sk-orc-NodeJoin-AbCd1234…
  3. 3
    Install a model on the worker

    So it has something to serve.

    $ flock model add qwen-coder-7b
  4. Back on the leader, verify
    $ flock node ls
    ID         HOSTNAME    OS/ARCH       STATE
    local      mbp-hadi    darwin/arm64  ready
    n_abc123   mac-mini    darwin/arm64  ready

    From now on, any request for qwen-coder-7b automatically routes to the worker. Install the same model on multiple workers → automatic load balancing.

37 curated models

Qwen 3.6, gpt-oss, Llama 4, Gemma 4, DeepSeek V4, Kimi K2.6, Nemotron 3 Ultra…

Flock ships a curated catalog of 37 open-weight models — chat, code, reasoning, vision, and embeddings — spanning 1 GB edge MoEs to 550 B hybrid Mamba-Transformers and 1 T-parameter sharded frontier. Use any of them, install any other Ollama model, or wire up vLLM / MLX-LM for higher throughput.

Catalog id What it's for Size Min RAM
Embedding — for RAG / retrieval
nomic-embed-text768-dim, 8K ctx — drop-in for OpenAI text-embedding-*0.27 GB2 GB
Edge — laptop
llama-3.2-1bsmoke test, fastest1.3 GB2 GB
llama-3.2-3bsmall fast chat2.0 GB4 GB
Small — 8–16 GB box
qwen-coder-7bcode completion + chat4.7 GB8 GB
deepseek-r1-8bdistilled reasoning ("thinking")4.9 GB12 GB
lfm2.5-8b-a1b ⭐best on-device MoE (1 B active)5.0 GB8 GB
qwen3-8bgeneral chat, balanced5.2 GB12 GB
mellum2-12bJetBrains coder MoE (2.5 B active, Apache-2.0)7.0 GB12 GB
mistral-nemo-12b128 K context, multilingual7.1 GB12 GB
gemma4-12bmultimodal (text + image; audio declared, route pending)7.6 GB12 GB
qwen3-14bmore capable Qwen 3 chat9.0 GB16 GB
qwen-coder-14bcode + agent (proven)9.0 GB16 GB
phi-4-14bstrong reasoning per byte9.1 GB12 GB
Mid — 24–32 GB box
gpt-oss-20b ⭐OpenAI open-weight; adjustable thinking14 GB16 GB
qwen3.6-27b ⭐77 % SWE-bench; top consumer pick17 GB24 GB
gemma4-26bMoE 4 B active; multimodal vision18 GB24 GB
qwen3-30bMoE 3 B active; very fast19 GB24 GB
qwen3-coder-30bMoE 3.3 B active code agent19 GB24 GB
qwen-coder-32bdense code agent (older, proven)20 GB32 GB
Power user — single 80 GB GPU / 2-node sharded
llama-3.3-70b-shardedfrontier-ish, ≥ 2 nodes43 GB48 GB
gpt-oss-120b≈ o4-mini reasoning, single H10065 GB80 GB
llama-4-scout10 M context, multimodal (109 B MoE)67 GB80 GB
Frontier — multi-machine sharded
step-3.7-flash-sharded ⭐198 B MoE / 11 B active VLM — Apache-2.0, ~400 tok/s100 GB128 GB
deepseek-v4-flash-sharded ⭐284 B MoE / 13 B active — cost-efficient frontier150 GB160 GB
nemotron-3-ultra-sharded550 B hybrid Mamba-MoE / 55 B active — 1 M context, MMLU 89.1280 GB320 GB
glm-5.1-sharded754 B MoE / 40 B active — best agentic coder400 GB416 GB
kimi-k2.6-sharded1 T MoE / 32 B active — # 1 open coding500 GB512 GB
# Install a catalog model $ flock model add qwen3.6-27b # Use it via the API $ curl :8080/v1/chat/completions \ -H 'Authorization: Bearer sk-orc-...' \ -d '{"model":"qwen3.6-27b","messages":[…]}' # Or in Claude Code $ export ANTHROPIC_MODEL=qwen3.6-27b $ claude
# Use any Ollama model (no catalog entry needed) $ ollama pull qwen3:0.6b $ curl :8080/v1/chat/completions \ -H 'Authorization: Bearer sk-orc-...' \ -d '{"model":"qwen3:0.6b","messages":[…]}' # Or swap engines entirely $ export FLOCK_ENGINE=vllm $ export FLOCK_VLLM_ENDPOINT=http://gpu:8000 $ flock up

For the complete per-model walkthrough — picker table with code/chat/reasoning/vision ratings, install + use snippets for every client (curl / Cursor / Claude Code / SDKs) — see MODELS.md.

Start here
qwen3.6-27b ⭐

The single best default if you have ≥ 24 GB RAM. 77 % SWE-bench, Apache-2.0, strong code + agent. Works great with Claude Code and Cursor.

flock model add qwen3.6-27b
Tight on RAM
gpt-oss-20b

OpenAI's open-weight model, Apache-2.0, adjustable reasoning effort, fits a 16 GB box. ≈ o3-mini quality on reasoning benchmarks.

flock model add gpt-oss-20b
Frontier tier
deepseek-v4-flash-sharded

Frontier reasoning quality at consumer cost — 284 B MoE / 13 B active means fast inference. Splits cleanly across 2 nodes via llama.cpp RPC.

flock shard create \
  deepseek-v4-flash-sharded 2
Reversible

Try Flock without commitment

Pointing Claude Code at Flock is just three env vars. Going back to api.anthropic.com is unsetting them.

Switch to Flock
export ANTHROPIC_BASE_URL=\
  http://localhost:8080
export ANTHROPIC_AUTH_TOKEN=\
  sk-orc-...
export ANTHROPIC_MODEL=\
  llama-3.2-1b

claude
Switch back to real Anthropic
flock disconnect claude-code

# prints the exact unset + export
# commands — same for every
# supported client.

Manually: unset ANTHROPIC_BASE_URL ANTHROPIC_AUTH_TOKEN ANTHROPIC_MODEL, then export ANTHROPIC_API_KEY=sk-ant-.... Or just open a fresh terminal — Claude Code defaults to api.anthropic.com when the BASE_URL var isn't set.

Hybrid (recommended)
# Keep Flock vars set,
# add real Anthropic key:
export ANTHROPIC_API_KEY=\
  sk-ant-...

flock up  # restart

Now --model claude-opus-4-7 transparently proxies to real Anthropic. Local models stay free. Same claude, you pick per-prompt.

Compare

vs the alternatives

Flock sits at the intersection of three categories that mostly don't overlap.

Feature Flock Ollama LiteLLM exo LocalAI
OpenAI-compatible API
Anthropic-compatible API (Claude Code)
Per-user API keys + quotas
Audit log
Multi-machine routing
Auto-sharding (one model across N machines)
Hybrid local + vendor fallback
Embedded admin UIpartial
Single binary (no Python/Docker/k8s)partial
Apache-2.0
Honest framing: any single feature above is available in one of the alternatives. The combination — OpenAI + Anthropic + multi-tenant + multi-node + sharding + UI, all in one Go binary — is what Flock uniquely offers.
Docs

CLI reference

Every command supports --help with examples.

Lifecycle

flock up
Start local node (leader on first run)
flock down
Stop the local node
flock status
Cluster status summary
flock join <url>?token=…
Join as a worker
flock doctor
Diagnose common problems
flock update
In-place upgrade to latest release
flock version
Print version

Nodes

flock node ls
List nodes
flock node show <id>
Inspect a node
flock node drain <id>
Stop routing to it
flock node remove <id>
Forget a node

Models

flock model search [q]
Browse catalog
flock model add <id>
Install (auto-delegates if sharded)
flock model ls
List installed
flock model remove <id>
Uninstall

Sharded models

flock shard create <m> [N]
Orchestrate sharded model across N workers
flock shard ls
List shards
flock shard remove <m>
Tear down

Tokens / users

flock token create [name]
Issue API key (--admin, --node)
flock token ls
List API keys
flock token revoke <id>
Revoke a key

Observability + config

flock usage [--limit N]
Recent inference records
flock audit [--limit N]
Recent admin actions
flock config show
Effective config (secrets redacted)

Connect your tools

Cursor / Continue / Aider

Set OpenAI base URL:

http://localhost:8080/v1

API key: sk-orc-…

Claude Code
ANTHROPIC_BASE_URL=http://localhost:8080
ANTHROPIC_AUTH_TOKEN=sk-orc-…
ANTHROPIC_MODEL=llama-3.2-1b
OpenAI / Anthropic SDK
OpenAI(
  base_url="http://localhost:8080/v1",
  api_key="sk-orc-…"
)
Security

Security model

Flock assumes a trusted network (LAN or Tailscale) for cluster traffic. Honest about what's protected and what isn't.

What's strongly protected

  • User API keys stored as sha256 hashes — plaintext shown only at creation
  • Worker HTTP servers bind to the mesh address (LAN / tailnet IP), never 0.0.0.0
  • Web UI auth by pasted admin key (in browser localStorage)
  • Quotas + audit log limit damage from a leaked key
  • Vendor fallback uses team-scoped vendor keys, never the user's

What requires LAN trust

  • Worker tokens are stored plaintext in nodes.worker_token on the leader's SQLite
  • Anyone with read access to the leader's DB can impersonate a worker
  • HMAC-SHA256 mutual auth between leader and workers is shipped — signatures travel instead of tokens; set FLOCK_REJECT_BEARER=1 on workers to require it
  • For hostile networks run the cluster behind Tailscale or a zero-trust overlay
$0

Free. Open source. No telemetry.

Flock is released under the Apache License 2.0. You can use it commercially, modify it, embed it in your own products, redistribute it. No "open core" gotchas. No "free for personal use only" clauses. No SaaS plan to upgrade to.

$0

No license fee. Ever.

The binary you download from GitHub Releases is the same binary a Fortune-500 would use. There is no Pro / Enterprise / Cloud tier hiding the features you actually want.

⚖️

Apache-2.0 — actually permissive

✓ Commercial use · ✓ Modification · ✓ Distribution · ✓ Patent grant included · ✓ Private use. The only requirements: keep the license + notice, state significant changes you made.

🔒

Your data stays yours

No phone-home. No analytics. No "anonymized" telemetry that's actually fingerprinted. The binary doesn't open outbound connections except to engines you configure (Ollama / vendor APIs you opt in to).

What it could save you

A team of 10 devs running modern AI tools heavily can burn $200–500 per dev per month in API tokens. That's $30–60k/year scaling linearly with usage. Flock moves the 80% of "easy" calls to your own hardware — for free — and keeps the optional escape hatch to real Claude / GPT for the 20% that actually need it.

Rough monthly cost (10 devs · heavy use)
Claude API (Sonnet)~$3,000
OpenAI API (gpt-4o)~$2,500
OpenRouter / vendor proxy$2,500 + markup
Flock + your own hardware~$50 (electricity)

Hardware (~$16k for the team-of-10 build) pays back in ~5 months. Stack works for years after.

Roadmap

What's shipped & what's next

✓ Shipped

  • Core gateway
  • • OpenAI + Anthropic API surface, streaming, tool use, vision (image_url), embeddings (/v1/embeddings)
  • /v1/rerank (Cohere shape, llama-server passthrough) + /v1/audio/transcriptions + /v1/audio/speech shells (whisper / piper endpoints)
  • • Ollama / vLLM / MLX / llama.cpp drivers (single-node + RPC)
  • • Multi-node routing with heartbeat + placements (LAN mesh)
  • • Sharding auto-orchestration + shard crash auto-restart + auto-distribution of GGUF weights (sha256-verified)
  • • 37-model catalog + license metadata + flock model add hf:/ollama:/file: + --from <my.yaml> for non-catalog installs
  • Multi-tenancy & auth
  • • Per-user API keys, scopes (admin / user / node), TTL expiry (--ttl 7d, renew, expire)
  • Per-key model allowlist (literal + claude-* glob) — 403 model_not_allowed + audit row
  • Per-key RPM + TPM rate limits (leaky-bucket) with reconciliation on actual usage
  • Daily token quotas + dollar budgets (day/week/month windows; multiple budgets compose with AND)
  • Per-call $ cost tracking (vendor pricing table + catalog override; cost_usd snapshotted on every usage row)
  • • Standard X-RateLimit-* response headers + X-Flock-Request-Id correlation
  • Router intelligence
  • • Failure-based catalog fallback + typed chains (fallback_on_context_length, fallback_on_content_policy) with error classification
  • Per-request overrides (flock.fallbacks, num_retries, retry_backoff_ms, hedge) via body or X-Flock-* headers
  • Sticky sessions (KV-cache locality on multi-turn chats) + placement cooldown (circuit breaker for flaky workers)
  • Request hedging — fire to top-N least-loaded workers, return whichever responds first
  • Latency-aware fallback (p95 trigger)
  • Hybrid local + cloud
  • • Anthropic + OpenAI passthrough (claude-*, gpt-*)
  • 7 new vendor passthroughs: openrouter/, groq/, together/, fireworks/, cohere/, mistral/, perplexity/ — slash-prefix stripped before forwarding
  • Bedrock SigV4 (anthropic.* family) + Vertex ADC probe
  • Observability & policy
  • • Prometheus metrics, OTLP traces end-to-end across all four engine drivers, reference Grafana dashboards
  • Webhook + Langfuse callbacks — usage / audit fan-out with HMAC-signed payloads, bounded queues
  • Guardrails framework — pre-call webhook hook that can block / rewrite / flag (PII redaction, prompt-injection checks)
  • Response cache for embeddings (memory or SQLite-backed; canonical key; Cache-Control opt-out)
  • Time-bucketed usage breakdown (/admin/v1/usage/breakdown?group_by=user,model&bucket=day)
  • HMAC mutual auth for worker token (no plaintext on wire)
  • • Typed engine_unreachable + guardrail_blocked + budget_exceeded errors with actionable hints
  • DX
  • • Embedded web UI: live SSE event stream, modal-based confirm/prompt, audit filter, per-row "models / rates / budgets / expiry" editors, $ today KPI, breakdown panel
  • • 19-client flock connect roster (golden-tested), interactive picker, shell completion, --json / --summary / --dry-run everywhere
  • • Install one-liner + signed binaries + .deb / .rpm packages + 2-node smoke + nightly single-node e2e

→ Next

  • Semantic cache — chromem-go embedded vector store, per-namespace threshold
  • OIDC login for the web UI (Google / GitHub / Okta)
  • Chat completion caching with streaming replay (today: embeddings only)
  • Post-call guardrails on streamed responses (today: pre-call only)
  • Vertex body translation (OpenAI / Anthropic → generateContent) — ADC probe wired, translation queued
  • Bedrock streaming + non-Anthropic body shapes (amazon, meta, mistral)
  • Whisper / Piper engine drivers (today: endpoint proxies; auto-launch + catalog entries queued)
  • Router fallback callback events (today: usage + audit sinks ship; fallback is in the audit log only)
  • • Tailscale (tsnet) mesh for NAT traversal + mTLS
  • • LoRA adapter loading + live model migration
  • • Postgres backend for HA control plane
  • • AMD ROCm path · NAS .spk packages (Synology DSM)

Get started in 3 minutes

No signup. No SaaS. Just a binary.

$ curl -fsSL https://raw.githubusercontent.com/hadihonarvar/flock/main/installer/install.sh | sh