ggml-org/llama.cpp

Getting started

This page walks through building llama.cpp from source, downloading a model, and running inference. The canonical, exhaustive build instructions live in docs/build.md; this page is a working subset.

Prerequisites

Tool	Minimum	Notes
C++ compiler	C++17	gcc, clang, MSVC, or Apple clang
CMake	3.14	`CMakePresets.json` requires a recent CMake
Git	any	Submodules are vendored, but git is needed for `build-info`
Python	3.9+	Only required for `convert_hf_to_gguf.py` and the `gguf-py/` package; managed by `requirements/`

Optional accelerator toolchains are documented in docs/build.md and docs/backend/:

CUDA Toolkit (NVIDIA) — see docs/build.md#cuda
ROCm / HIP (AMD) — see docs/build.md#hip
Metal (macOS) — auto-detected on Apple Silicon
Vulkan SDK — see docs/build.md#vulkan
oneAPI / SYCL (Intel) — see docs/backend/SYCL.md

Build

The simplest CPU build:

git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp
cmake -B build
cmake --build build --config Release -j

Binaries land in build/bin/. Common ones include llama-cli, llama-server, llama-quantize, llama-bench, llama-imatrix, llama-perplexity, llama-mtmd-cli, and llama-gguf-split.

To enable a GPU backend, pass the corresponding CMake variable. Example for CUDA:

cmake -B build -DGGML_CUDA=ON
cmake --build build --config Release -j

Other common flags (full list in ggml/CMakeLists.txt):

Flag	Effect
`-DGGML_METAL=ON`	Apple Metal backend (default on Apple platforms)
`-DGGML_VULKAN=ON`	Vulkan backend
`-DGGML_SYCL=ON`	SYCL / Intel oneAPI backend
`-DGGML_HIP=ON`	AMD HIP/ROCm backend
`-DGGML_OPENCL=ON`	OpenCL backend
`-DGGML_RPC=ON`	RPC client/server backend
`-DGGML_BLAS=ON`	Use a system BLAS for prompt processing
`-DLLAMA_CURL=ON`	Enable the `-hf` HuggingFace download path (requires libcurl)
`-DLLAMA_BUILD_SERVER=OFF`	Skip building `llama-server`
`-DBUILD_SHARED_LIBS=ON`	Build shared `libllama.so` / `libggml.so`

CMakePresets.json defines a few canned presets (make list-presets). The Makefile in the repo root forwards to CMake for convenience.

Get a model

llama.cpp consumes GGUF files. There are three common ways to obtain one:

Download a pre-converted GGUF directly from HuggingFace, e.g. ggml-org/gemma-3-1b-it-GGUF.
Pass -hf <repo> on any CLI, which uses common/download.cpp + common/hf-cache.cpp to fetch and cache the file in the standard HuggingFace cache.
Convert your own checkpoint with convert_hf_to_gguf.py. See Conversion and docs/development/HOWTO-add-model.md.

Run inference

Chat with a model on the command line:

# from the repo root, after building into ./build
./build/bin/llama-cli -hf ggml-org/gemma-3-1b-it-GGUF

Run an OpenAI-compatible HTTP server on :8080:

./build/bin/llama-server -hf ggml-org/gemma-3-1b-it-GGUF

The server exposes /v1/chat/completions, /v1/completions, /v1/embeddings, plus a built-in WebUI at http://localhost:8080. See Server tool.

Quantize a model to a smaller format:

./build/bin/llama-quantize my_model_f16.gguf my_model_q4_k_m.gguf Q4_K_M

See docs/build.md, docs/install.md, and docs/docker.md for installer-based and container-based alternatives.

Run the tests

cmake -B build -DLLAMA_BUILD_TESTS=ON
cmake --build build --config Release -j
ctest --test-dir build --output-on-failure

The ctest suite covers tokenizer round-trips, GGUF I/O, sampling, the chat parser, the PEG parser, grammar handling, and a backend-ops conformance test (tests/test-backend-ops.cpp) that compares each backend's kernel results against the CPU baseline. See Testing and ci/README.md for the longer self-hosted CI flow.

Built by Factory AutoWiki from public repository content. It is a generated preview for codebase exploration, not source-maintained documentation.