ggml-org/llama.cpp

llama-cli

Active contributors: Xuan-Son Nguyen, Georgi Gerganov

llama-cli is the project's primary command-line interface for chatting with or completing text from a model. The source is one file: tools/cli/cli.cpp (~24 KB). Long-form usage docs live in tools/cli/README.md.

Purpose

Load a model (local or -hf <repo>).
Tokenize a prompt and run a generation loop.
Support interactive ("conversation") mode with a chat template, or one-shot completion.

Modes

Flag	Mode
(default)	One-shot prompt → completion
`-cnv` / `--conversation`	Interactive chat using the model's chat template
`-i` / `--interactive`	Free-form interactive completion (no chat template)
`-if`	Interactive first: enter interactive mode after the initial prompt
`--in-prefix` / `--in-suffix`	Wrap user inputs in interactive mode
`--logit-bias`, `--grammar`, `--grammar-file`	Constrain output
`--lora`, `--lora-scaled`	Apply LoRA adapters
`--mlock`, `--no-mmap`, `--n-gpu-layers`, `--tensor-split`	Memory/placement controls

The full flag set is parsed by common/arg.cpp (which is shared across every tool, ensuring consistency).

How it works

sequenceDiagram
    participant User
    participant CLI as llama-cli
    participant Common as common/
    participant Llama as libllama
    participant Sampler as sampler chain

    User->>CLI: llama-cli -m model.gguf -p "..."
    CLI->>Common: parse argv (common/arg.cpp)
    CLI->>Common: download if -hf (download.cpp + hf-cache.cpp)
    CLI->>Llama: llama_model_load_from_file
    CLI->>Llama: llama_init_from_model
    CLI->>Common: build sampler chain (common/sampling.cpp)
    CLI->>Llama: tokenize prompt
    loop generation
        CLI->>Llama: llama_decode(batch)
        Llama-->>CLI: logits
        CLI->>Sampler: llama_sampler_sample
        Sampler-->>CLI: token
        CLI->>Llama: llama_sampler_accept(token)
        CLI->>Common: console.cpp print piece
        alt EOG token
            CLI-->>User: stop
        end
    end

Key abstractions used

Symbol	From	Purpose
`common_params`	`common/common.h`	Big struct holding every CLI option; populated by `common_params_parse`
`common_init_from_params`	`common/common.cpp`	Convenience wrapper that constructs both `llama_model` and `llama_context` from a `common_params`
`common_sampler_init`	`common/sampling.cpp`	Builds the canonical sampler chain
`common_chat_msg` + `common_chat_apply_template`	`common/chat.cpp`	Render messages in conversation mode
`common/console.cpp`	`common/`	Cross-platform interactive console handling (raw mode, color, prompt drawing)

Conversation mode internals

In -cnv mode, cli.cpp keeps a common_chat_msg history and re-renders the prompt on each turn via common/chat.cpp. The model output is streamed back to the screen via common/console.cpp while the autoparser (when enabled) extracts tool calls. See Chat templates.

Integration points

Common helpers. Almost every flag and shared behavior comes from common/.
Sampler. Built via common/sampling.cpp from CLI flags.
Chat templating. Built-in plus Jinja paths via src/llama-chat.cpp and common/chat.cpp.
Speculative decoding. Disabled in llama-cli; examples/speculative* demonstrates it.

Entry points for modification

New CLI flag. Add it to common_params and the parser in common/arg.cpp so every tool gains it; or add it locally in cli.cpp if it only makes sense for the CLI.
New interactive command (e.g. a /<verb> slash command in conversation mode). Edit the input loop in cli.cpp.
New default sampler order. Edit common/sampling.cpp; cli.cpp doesn't hard-code the chain.

For exhaustive flag reference see tools/cli/README.md.

Built by Factory AutoWiki from public repository content. It is a generated preview for codebase exploration, not source-maintained documentation.