Factory.ai

Open-Source Wikis

/

llama.cpp

/

Tools

/

llama-server

ggml-org/llama.cpp

llama-server

Active contributors: Georgi Gerganov, Xuan-Son Nguyen, allozaur, angt, ServeurpersoCom

llama-server is the OpenAI-compatible HTTP server and the heaviest single tool in the tree. It serves multiple concurrent requests against one in-process model using a slot-based scheduler, and ships an in-browser WebUI under tools/server/webui/.

User-facing docs are exhaustively documented in tools/server/README.md and tools/server/README-dev.md; this page describes the implementation.

Directory layout

tools/server/
├── server.cpp              # main(), wires HTTP layer to scheduler
├── server-http.cpp / .h    # HTTP layer (vendored cpp-httplib)
├── server-context.cpp / .h # slot-based scheduler, the real work
├── server-task.cpp / .h    # per-request task representation
├── server-queue.cpp / .h   # producer-consumer queue feeding the scheduler
├── server-models.cpp / .h  # multi-model loading + routing
├── server-tools.cpp / .h   # OpenAI tool-calling glue
├── server-chat.cpp / .h    # chat-template orchestration
├── server-common.cpp / .h  # logging, params, response helpers
├── server-cors-proxy.h     # tiny CORS handler
├── public/                 # static assets served at /
├── webui/                  # SPA source (npm project, bundled into the binary)
├── bench/                  # k6-based load tests
└── tests/                  # pytest end-to-end tests

The server-context.cpp file alone is ~178 KB — it holds the slot loop that pulls tasks from the queue, decodes them in batched form, applies sampling per slot, and streams responses back.

Endpoints

The full surface is in tools/server/README.md. Key categories:

Category Endpoints
OpenAI-compatible POST /v1/chat/completions, POST /v1/completions, POST /v1/embeddings, POST /v1/rerank
Native POST /completion, POST /tokenize, POST /detokenize, POST /infill, GET /health, GET /metrics, GET /props, POST /props, GET /slots, POST /slots/{id} (save/load), POST /apply-template, POST /chat/format
Static GET / → WebUI, GET /<asset> → bundled JS/CSS

The HTTP layer is a thin wrapper over cpp-httplib (vendor/) plus a small CORS proxy in server-cors-proxy.h.

Slot scheduler

The server's defining concept is the slot. A slot is a per-request execution unit that owns:

  • A seq_id in the underlying KV cache.
  • The user's prompt + ongoing generation tokens.
  • A per-slot sampler chain.
  • Optional per-slot LoRA scaling.
  • A streaming response channel.

The scheduler in server-context.cpp repeatedly:

  1. Pulls new tasks from server-queue.
  2. Assigns each task to a free slot (or queues it if all slots are busy).
  3. Builds a unified llama_batch containing tokens from every active slot.
  4. Calls llama_decode on the unified batch.
  5. Per slot: samples a token, streams it back to the client, checks stop conditions.
  6. Releases finished slots.
graph TD
    HTTP[HTTP request] --> Queue[server-queue]
    Queue --> Sched[server-context scheduler]
    Sched --> Slot1[slot 0]
    Sched --> Slot2[slot 1]
    Sched --> SlotN[slot N-1]
    Slot1 --> Batch[unified llama_batch]
    Slot2 --> Batch
    SlotN --> Batch
    Batch --> Llama[llama_decode]
    Llama --> Sample[per-slot sampler]
    Sample --> Stream[SSE / JSON to client]

Important runtime knobs:

  • --parallel N (or -np) — number of slots, i.e. concurrent requests.
  • --cont-batching — interleave decoding of slots that started at different times.
  • --cache-reuse — try to reuse a slot's cached prefix when a new request arrives with overlapping prompt.
  • --slot-prompt-similarity — threshold for the prefix matcher.
  • --slot-save-path — directory where POST /slots/{id}?action=save writes per-slot KV state.

Tool / function calling

server-tools.cpp orchestrates the tool-calling flow. The high-level shape:

  1. Render the prompt with the user's tools array using the model's chat template (server-chat.cppcommon/chat.cpp).
  2. Optionally compile a grammar that constrains output to the tool-call format (common/json-schema-to-grammar.cpp or llguidance).
  3. Stream tokens back. The autoparser (common/chat-auto-parser*.cpp) extracts tool_calls deltas as they appear and surfaces them in the OpenAI-compatible response shape.

See Chat templates and docs/function-calling.md.

Multi-model loading

server-models.cpp allows a single server to serve multiple GGUFs simultaneously, routing requests to the right model based on the model field in OpenAI-style requests. Each loaded model gets its own slot pool.

Embeddings & reranking

When a model has an embedding pooling type, --embeddings exposes /v1/embeddings. Rerankers (cross-encoders) are exposed via /v1/rerank.

WebUI

tools/server/webui/ is a separate JavaScript project (Vite + Svelte/React, depending on the era — see its package.json). It is built ahead of time and bundled into the C++ binary. The maintainers responsible are @ggml-org/llama-webui. Source includes:

  • A chat UI with conversations, branching, message editing.
  • Settings, presets, and chat-template inspector.
  • Integration with the /slots and /props endpoints.

Observability

  • GET /metrics returns Prometheus-style counters.
  • GET /slots is a live snapshot of slot state.
  • --log-file path and --log-disable control logs.
  • GET /health returns 200 once the model is fully loaded.

Integration points

  • libllama — single in-process model + context per loaded model.
  • common/ — argument parsing, sampling, chat templating, downloads.
  • vendor/cpp-httplib — single-header HTTP server.
  • tools/server/webui/ — bundled SPA; HTTP layer serves it at /.
  • tools/server/tests/ — pytest integration tests; tools/server/bench/ has k6 load tests.

Entry points for modification

  • New endpoint. Add a route in server-http.cpp, a handler in server-context.cpp, and a test under tools/server/tests/.
  • New per-request param. Add it to the request struct in server-task.cpp, parse in server-context.cpp, plumb to the sampler.
  • WebUI changes. Edit tools/server/webui/, rebuild, commit the bundled artifact.
  • Slot scheduling tweak. server-context.cpp is the single source of truth.

For end-to-end testing instructions, see Testing.

Built by Factory AutoWiki from public repository content. It is a generated preview for codebase exploration, not source-maintained documentation.

llama-server – llama.cpp wiki | Factory