ggml-org/llama.cpp

llama-server

Active contributors: Georgi Gerganov, Xuan-Son Nguyen, allozaur, angt, ServeurpersoCom

llama-server is the OpenAI-compatible HTTP server and the heaviest single tool in the tree. It serves multiple concurrent requests against one in-process model using a slot-based scheduler, and ships an in-browser WebUI under tools/server/webui/.

User-facing docs are exhaustively documented in tools/server/README.md and tools/server/README-dev.md; this page describes the implementation.

Directory layout

tools/server/
├── server.cpp              # main(), wires HTTP layer to scheduler
├── server-http.cpp / .h    # HTTP layer (vendored cpp-httplib)
├── server-context.cpp / .h # slot-based scheduler, the real work
├── server-task.cpp / .h    # per-request task representation
├── server-queue.cpp / .h   # producer-consumer queue feeding the scheduler
├── server-models.cpp / .h  # multi-model loading + routing
├── server-tools.cpp / .h   # OpenAI tool-calling glue
├── server-chat.cpp / .h    # chat-template orchestration
├── server-common.cpp / .h  # logging, params, response helpers
├── server-cors-proxy.h     # tiny CORS handler
├── public/                 # static assets served at /
├── webui/                  # SPA source (npm project, bundled into the binary)
├── bench/                  # k6-based load tests
└── tests/                  # pytest end-to-end tests

The server-context.cpp file alone is ~178 KB — it holds the slot loop that pulls tasks from the queue, decodes them in batched form, applies sampling per slot, and streams responses back.

Endpoints

The full surface is in tools/server/README.md. Key categories:

Category	Endpoints
OpenAI-compatible	`POST /v1/chat/completions`, `POST /v1/completions`, `POST /v1/embeddings`, `POST /v1/rerank`
Native	`POST /completion`, `POST /tokenize`, `POST /detokenize`, `POST /infill`, `GET /health`, `GET /metrics`, `GET /props`, `POST /props`, `GET /slots`, `POST /slots/{id}` (save/load), `POST /apply-template`, `POST /chat/format`
Static	`GET /` → WebUI, `GET /<asset>` → bundled JS/CSS

The HTTP layer is a thin wrapper over cpp-httplib (vendor/) plus a small CORS proxy in server-cors-proxy.h.

Slot scheduler

The server's defining concept is the slot. A slot is a per-request execution unit that owns:

A seq_id in the underlying KV cache.
The user's prompt + ongoing generation tokens.
A per-slot sampler chain.
Optional per-slot LoRA scaling.
A streaming response channel.

The scheduler in server-context.cpp repeatedly:

Pulls new tasks from server-queue.
Assigns each task to a free slot (or queues it if all slots are busy).
Builds a unified llama_batch containing tokens from every active slot.
Calls llama_decode on the unified batch.
Per slot: samples a token, streams it back to the client, checks stop conditions.
Releases finished slots.

graph TD
    HTTP[HTTP request] --> Queue[server-queue]
    Queue --> Sched[server-context scheduler]
    Sched --> Slot1[slot 0]
    Sched --> Slot2[slot 1]
    Sched --> SlotN[slot N-1]
    Slot1 --> Batch[unified llama_batch]
    Slot2 --> Batch
    SlotN --> Batch
    Batch --> Llama[llama_decode]
    Llama --> Sample[per-slot sampler]
    Sample --> Stream[SSE / JSON to client]

Important runtime knobs:

--parallel N (or -np) — number of slots, i.e. concurrent requests.
--cont-batching — interleave decoding of slots that started at different times.
--cache-reuse — try to reuse a slot's cached prefix when a new request arrives with overlapping prompt.
--slot-prompt-similarity — threshold for the prefix matcher.
--slot-save-path — directory where POST /slots/{id}?action=save writes per-slot KV state.

Tool / function calling

server-tools.cpp orchestrates the tool-calling flow. The high-level shape:

Render the prompt with the user's tools array using the model's chat template (server-chat.cpp → common/chat.cpp).
Optionally compile a grammar that constrains output to the tool-call format (common/json-schema-to-grammar.cpp or llguidance).
Stream tokens back. The autoparser (common/chat-auto-parser*.cpp) extracts tool_calls deltas as they appear and surfaces them in the OpenAI-compatible response shape.

See Chat templates and docs/function-calling.md.

Multi-model loading

server-models.cpp allows a single server to serve multiple GGUFs simultaneously, routing requests to the right model based on the model field in OpenAI-style requests. Each loaded model gets its own slot pool.

Embeddings & reranking

When a model has an embedding pooling type, --embeddings exposes /v1/embeddings. Rerankers (cross-encoders) are exposed via /v1/rerank.

WebUI

tools/server/webui/ is a separate JavaScript project (Vite + Svelte/React, depending on the era — see its package.json). It is built ahead of time and bundled into the C++ binary. The maintainers responsible are @ggml-org/llama-webui. Source includes:

A chat UI with conversations, branching, message editing.
Settings, presets, and chat-template inspector.
Integration with the /slots and /props endpoints.

Observability

GET /metrics returns Prometheus-style counters.
GET /slots is a live snapshot of slot state.
--log-file path and --log-disable control logs.
GET /health returns 200 once the model is fully loaded.

Integration points

libllama — single in-process model + context per loaded model.
common/ — argument parsing, sampling, chat templating, downloads.
vendor/cpp-httplib — single-header HTTP server.
tools/server/webui/ — bundled SPA; HTTP layer serves it at /.
tools/server/tests/ — pytest integration tests; tools/server/bench/ has k6 load tests.

Entry points for modification

New endpoint. Add a route in server-http.cpp, a handler in server-context.cpp, and a test under tools/server/tests/.
New per-request param. Add it to the request struct in server-task.cpp, parse in server-context.cpp, plumb to the sampler.
WebUI changes. Edit tools/server/webui/, rebuild, commit the bundled artifact.
Slot scheduling tweak. server-context.cpp is the single source of truth.

For end-to-end testing instructions, see Testing.

Built by Factory AutoWiki from public repository content. It is a generated preview for codebase exploration, not source-maintained documentation.