ggml-org/llama.cpp
llama-cli
Active contributors: Xuan-Son Nguyen, Georgi Gerganov
llama-cli is the project's primary command-line interface for chatting with or completing text from a model. The source is one file: tools/cli/cli.cpp (~24 KB). Long-form usage docs live in tools/cli/README.md.
Purpose
- Load a model (local or
-hf <repo>). - Tokenize a prompt and run a generation loop.
- Support interactive ("conversation") mode with a chat template, or one-shot completion.
Modes
| Flag | Mode |
|---|---|
| (default) | One-shot prompt → completion |
-cnv / --conversation |
Interactive chat using the model's chat template |
-i / --interactive |
Free-form interactive completion (no chat template) |
-if |
Interactive first: enter interactive mode after the initial prompt |
--in-prefix / --in-suffix |
Wrap user inputs in interactive mode |
--logit-bias, --grammar, --grammar-file |
Constrain output |
--lora, --lora-scaled |
Apply LoRA adapters |
--mlock, --no-mmap, --n-gpu-layers, --tensor-split |
Memory/placement controls |
The full flag set is parsed by common/arg.cpp (which is shared across every tool, ensuring consistency).
How it works
sequenceDiagram
participant User
participant CLI as llama-cli
participant Common as common/
participant Llama as libllama
participant Sampler as sampler chain
User->>CLI: llama-cli -m model.gguf -p "..."
CLI->>Common: parse argv (common/arg.cpp)
CLI->>Common: download if -hf (download.cpp + hf-cache.cpp)
CLI->>Llama: llama_model_load_from_file
CLI->>Llama: llama_init_from_model
CLI->>Common: build sampler chain (common/sampling.cpp)
CLI->>Llama: tokenize prompt
loop generation
CLI->>Llama: llama_decode(batch)
Llama-->>CLI: logits
CLI->>Sampler: llama_sampler_sample
Sampler-->>CLI: token
CLI->>Llama: llama_sampler_accept(token)
CLI->>Common: console.cpp print piece
alt EOG token
CLI-->>User: stop
end
endKey abstractions used
| Symbol | From | Purpose |
|---|---|---|
common_params |
common/common.h |
Big struct holding every CLI option; populated by common_params_parse |
common_init_from_params |
common/common.cpp |
Convenience wrapper that constructs both llama_model and llama_context from a common_params |
common_sampler_init |
common/sampling.cpp |
Builds the canonical sampler chain |
common_chat_msg + common_chat_apply_template |
common/chat.cpp |
Render messages in conversation mode |
common/console.cpp |
common/ |
Cross-platform interactive console handling (raw mode, color, prompt drawing) |
Conversation mode internals
In -cnv mode, cli.cpp keeps a common_chat_msg history and re-renders the prompt on each turn via common/chat.cpp. The model output is streamed back to the screen via common/console.cpp while the autoparser (when enabled) extracts tool calls. See Chat templates.
Integration points
- Common helpers. Almost every flag and shared behavior comes from
common/. - Sampler. Built via
common/sampling.cppfrom CLI flags. - Chat templating. Built-in plus Jinja paths via
src/llama-chat.cppandcommon/chat.cpp. - Speculative decoding. Disabled in
llama-cli;examples/speculative*demonstrates it.
Entry points for modification
- New CLI flag. Add it to
common_paramsand the parser incommon/arg.cppso every tool gains it; or add it locally incli.cppif it only makes sense for the CLI. - New interactive command (e.g. a
/<verb>slash command in conversation mode). Edit the input loop incli.cpp. - New default sampler order. Edit
common/sampling.cpp;cli.cppdoesn't hard-code the chain.
For exhaustive flag reference see tools/cli/README.md.
Built by Factory AutoWiki from public repository content. It is a generated preview for codebase exploration, not source-maintained documentation.
Previous
Tools
Next
llama-server