ggml-org/llama.cpp
llama-server
Active contributors: Georgi Gerganov, Xuan-Son Nguyen, allozaur, angt, ServeurpersoCom
llama-server is the OpenAI-compatible HTTP server and the heaviest single tool in the tree. It serves multiple concurrent requests against one in-process model using a slot-based scheduler, and ships an in-browser WebUI under tools/server/webui/.
User-facing docs are exhaustively documented in tools/server/README.md and tools/server/README-dev.md; this page describes the implementation.
Directory layout
tools/server/
├── server.cpp # main(), wires HTTP layer to scheduler
├── server-http.cpp / .h # HTTP layer (vendored cpp-httplib)
├── server-context.cpp / .h # slot-based scheduler, the real work
├── server-task.cpp / .h # per-request task representation
├── server-queue.cpp / .h # producer-consumer queue feeding the scheduler
├── server-models.cpp / .h # multi-model loading + routing
├── server-tools.cpp / .h # OpenAI tool-calling glue
├── server-chat.cpp / .h # chat-template orchestration
├── server-common.cpp / .h # logging, params, response helpers
├── server-cors-proxy.h # tiny CORS handler
├── public/ # static assets served at /
├── webui/ # SPA source (npm project, bundled into the binary)
├── bench/ # k6-based load tests
└── tests/ # pytest end-to-end testsThe server-context.cpp file alone is ~178 KB — it holds the slot loop that pulls tasks from the queue, decodes them in batched form, applies sampling per slot, and streams responses back.
Endpoints
The full surface is in tools/server/README.md. Key categories:
| Category | Endpoints |
|---|---|
| OpenAI-compatible | POST /v1/chat/completions, POST /v1/completions, POST /v1/embeddings, POST /v1/rerank |
| Native | POST /completion, POST /tokenize, POST /detokenize, POST /infill, GET /health, GET /metrics, GET /props, POST /props, GET /slots, POST /slots/{id} (save/load), POST /apply-template, POST /chat/format |
| Static | GET / → WebUI, GET /<asset> → bundled JS/CSS |
The HTTP layer is a thin wrapper over cpp-httplib (vendor/) plus a small CORS proxy in server-cors-proxy.h.
Slot scheduler
The server's defining concept is the slot. A slot is a per-request execution unit that owns:
- A
seq_idin the underlying KV cache. - The user's prompt + ongoing generation tokens.
- A per-slot sampler chain.
- Optional per-slot LoRA scaling.
- A streaming response channel.
The scheduler in server-context.cpp repeatedly:
- Pulls new tasks from
server-queue. - Assigns each task to a free slot (or queues it if all slots are busy).
- Builds a unified
llama_batchcontaining tokens from every active slot. - Calls
llama_decodeon the unified batch. - Per slot: samples a token, streams it back to the client, checks stop conditions.
- Releases finished slots.
graph TD
HTTP[HTTP request] --> Queue[server-queue]
Queue --> Sched[server-context scheduler]
Sched --> Slot1[slot 0]
Sched --> Slot2[slot 1]
Sched --> SlotN[slot N-1]
Slot1 --> Batch[unified llama_batch]
Slot2 --> Batch
SlotN --> Batch
Batch --> Llama[llama_decode]
Llama --> Sample[per-slot sampler]
Sample --> Stream[SSE / JSON to client]Important runtime knobs:
--parallel N(or-np) — number of slots, i.e. concurrent requests.--cont-batching— interleave decoding of slots that started at different times.--cache-reuse— try to reuse a slot's cached prefix when a new request arrives with overlapping prompt.--slot-prompt-similarity— threshold for the prefix matcher.--slot-save-path— directory wherePOST /slots/{id}?action=savewrites per-slot KV state.
Tool / function calling
server-tools.cpp orchestrates the tool-calling flow. The high-level shape:
- Render the prompt with the user's
toolsarray using the model's chat template (server-chat.cpp→common/chat.cpp). - Optionally compile a grammar that constrains output to the tool-call format (
common/json-schema-to-grammar.cppor llguidance). - Stream tokens back. The autoparser (
common/chat-auto-parser*.cpp) extractstool_callsdeltas as they appear and surfaces them in the OpenAI-compatible response shape.
See Chat templates and docs/function-calling.md.
Multi-model loading
server-models.cpp allows a single server to serve multiple GGUFs simultaneously, routing requests to the right model based on the model field in OpenAI-style requests. Each loaded model gets its own slot pool.
Embeddings & reranking
When a model has an embedding pooling type, --embeddings exposes /v1/embeddings. Rerankers (cross-encoders) are exposed via /v1/rerank.
WebUI
tools/server/webui/ is a separate JavaScript project (Vite + Svelte/React, depending on the era — see its package.json). It is built ahead of time and bundled into the C++ binary. The maintainers responsible are @ggml-org/llama-webui. Source includes:
- A chat UI with conversations, branching, message editing.
- Settings, presets, and chat-template inspector.
- Integration with the
/slotsand/propsendpoints.
Observability
GET /metricsreturns Prometheus-style counters.GET /slotsis a live snapshot of slot state.--log-file pathand--log-disablecontrol logs.GET /healthreturns 200 once the model is fully loaded.
Integration points
libllama— single in-process model + context per loaded model.common/— argument parsing, sampling, chat templating, downloads.vendor/cpp-httplib— single-header HTTP server.tools/server/webui/— bundled SPA; HTTP layer serves it at/.tools/server/tests/— pytest integration tests;tools/server/bench/has k6 load tests.
Entry points for modification
- New endpoint. Add a route in
server-http.cpp, a handler inserver-context.cpp, and a test undertools/server/tests/. - New per-request param. Add it to the request struct in
server-task.cpp, parse inserver-context.cpp, plumb to the sampler. - WebUI changes. Edit
tools/server/webui/, rebuild, commit the bundled artifact. - Slot scheduling tweak.
server-context.cppis the single source of truth.
For end-to-end testing instructions, see Testing.
Built by Factory AutoWiki from public repository content. It is a generated preview for codebase exploration, not source-maintained documentation.
Previous
llama-cli
Next
llama-quantize