Factory.ai

Open-Source Wikis

/

llama.cpp

/

Tools

/

llama-cli

ggml-org/llama.cpp

llama-cli

Active contributors: Xuan-Son Nguyen, Georgi Gerganov

llama-cli is the project's primary command-line interface for chatting with or completing text from a model. The source is one file: tools/cli/cli.cpp (~24 KB). Long-form usage docs live in tools/cli/README.md.

Purpose

  • Load a model (local or -hf <repo>).
  • Tokenize a prompt and run a generation loop.
  • Support interactive ("conversation") mode with a chat template, or one-shot completion.

Modes

Flag Mode
(default) One-shot prompt → completion
-cnv / --conversation Interactive chat using the model's chat template
-i / --interactive Free-form interactive completion (no chat template)
-if Interactive first: enter interactive mode after the initial prompt
--in-prefix / --in-suffix Wrap user inputs in interactive mode
--logit-bias, --grammar, --grammar-file Constrain output
--lora, --lora-scaled Apply LoRA adapters
--mlock, --no-mmap, --n-gpu-layers, --tensor-split Memory/placement controls

The full flag set is parsed by common/arg.cpp (which is shared across every tool, ensuring consistency).

How it works

sequenceDiagram
    participant User
    participant CLI as llama-cli
    participant Common as common/
    participant Llama as libllama
    participant Sampler as sampler chain

    User->>CLI: llama-cli -m model.gguf -p "..."
    CLI->>Common: parse argv (common/arg.cpp)
    CLI->>Common: download if -hf (download.cpp + hf-cache.cpp)
    CLI->>Llama: llama_model_load_from_file
    CLI->>Llama: llama_init_from_model
    CLI->>Common: build sampler chain (common/sampling.cpp)
    CLI->>Llama: tokenize prompt
    loop generation
        CLI->>Llama: llama_decode(batch)
        Llama-->>CLI: logits
        CLI->>Sampler: llama_sampler_sample
        Sampler-->>CLI: token
        CLI->>Llama: llama_sampler_accept(token)
        CLI->>Common: console.cpp print piece
        alt EOG token
            CLI-->>User: stop
        end
    end

Key abstractions used

Symbol From Purpose
common_params common/common.h Big struct holding every CLI option; populated by common_params_parse
common_init_from_params common/common.cpp Convenience wrapper that constructs both llama_model and llama_context from a common_params
common_sampler_init common/sampling.cpp Builds the canonical sampler chain
common_chat_msg + common_chat_apply_template common/chat.cpp Render messages in conversation mode
common/console.cpp common/ Cross-platform interactive console handling (raw mode, color, prompt drawing)

Conversation mode internals

In -cnv mode, cli.cpp keeps a common_chat_msg history and re-renders the prompt on each turn via common/chat.cpp. The model output is streamed back to the screen via common/console.cpp while the autoparser (when enabled) extracts tool calls. See Chat templates.

Integration points

  • Common helpers. Almost every flag and shared behavior comes from common/.
  • Sampler. Built via common/sampling.cpp from CLI flags.
  • Chat templating. Built-in plus Jinja paths via src/llama-chat.cpp and common/chat.cpp.
  • Speculative decoding. Disabled in llama-cli; examples/speculative* demonstrates it.

Entry points for modification

  • New CLI flag. Add it to common_params and the parser in common/arg.cpp so every tool gains it; or add it locally in cli.cpp if it only makes sense for the CLI.
  • New interactive command (e.g. a /<verb> slash command in conversation mode). Edit the input loop in cli.cpp.
  • New default sampler order. Edit common/sampling.cpp; cli.cpp doesn't hard-code the chain.

For exhaustive flag reference see tools/cli/README.md.

Built by Factory AutoWiki from public repository content. It is a generated preview for codebase exploration, not source-maintained documentation.

llama-cli – llama.cpp wiki | Factory