ggml-org/llama.cpp
Getting started
This page walks through building llama.cpp from source, downloading a model, and running inference. The canonical, exhaustive build instructions live in docs/build.md; this page is a working subset.
Prerequisites
| Tool | Minimum | Notes |
|---|---|---|
| C++ compiler | C++17 | gcc, clang, MSVC, or Apple clang |
| CMake | 3.14 | CMakePresets.json requires a recent CMake |
| Git | any | Submodules are vendored, but git is needed for build-info |
| Python | 3.9+ | Only required for convert_hf_to_gguf.py and the gguf-py/ package; managed by requirements/ |
Optional accelerator toolchains are documented in docs/build.md and docs/backend/:
- CUDA Toolkit (NVIDIA) — see
docs/build.md#cuda - ROCm / HIP (AMD) — see
docs/build.md#hip - Metal (macOS) — auto-detected on Apple Silicon
- Vulkan SDK — see
docs/build.md#vulkan - oneAPI / SYCL (Intel) — see
docs/backend/SYCL.md
Build
The simplest CPU build:
git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp
cmake -B build
cmake --build build --config Release -jBinaries land in build/bin/. Common ones include llama-cli, llama-server, llama-quantize, llama-bench, llama-imatrix, llama-perplexity, llama-mtmd-cli, and llama-gguf-split.
To enable a GPU backend, pass the corresponding CMake variable. Example for CUDA:
cmake -B build -DGGML_CUDA=ON
cmake --build build --config Release -jOther common flags (full list in ggml/CMakeLists.txt):
| Flag | Effect |
|---|---|
-DGGML_METAL=ON |
Apple Metal backend (default on Apple platforms) |
-DGGML_VULKAN=ON |
Vulkan backend |
-DGGML_SYCL=ON |
SYCL / Intel oneAPI backend |
-DGGML_HIP=ON |
AMD HIP/ROCm backend |
-DGGML_OPENCL=ON |
OpenCL backend |
-DGGML_RPC=ON |
RPC client/server backend |
-DGGML_BLAS=ON |
Use a system BLAS for prompt processing |
-DLLAMA_CURL=ON |
Enable the -hf HuggingFace download path (requires libcurl) |
-DLLAMA_BUILD_SERVER=OFF |
Skip building llama-server |
-DBUILD_SHARED_LIBS=ON |
Build shared libllama.so / libggml.so |
CMakePresets.json defines a few canned presets (make list-presets). The Makefile in the repo root forwards to CMake for convenience.
Get a model
llama.cpp consumes GGUF files. There are three common ways to obtain one:
- Download a pre-converted GGUF directly from HuggingFace, e.g.
ggml-org/gemma-3-1b-it-GGUF. - Pass
-hf <repo>on any CLI, which usescommon/download.cpp+common/hf-cache.cppto fetch and cache the file in the standard HuggingFace cache. - Convert your own checkpoint with
convert_hf_to_gguf.py. See Conversion anddocs/development/HOWTO-add-model.md.
Run inference
Chat with a model on the command line:
# from the repo root, after building into ./build
./build/bin/llama-cli -hf ggml-org/gemma-3-1b-it-GGUFRun an OpenAI-compatible HTTP server on :8080:
./build/bin/llama-server -hf ggml-org/gemma-3-1b-it-GGUFThe server exposes /v1/chat/completions, /v1/completions, /v1/embeddings, plus a built-in WebUI at http://localhost:8080. See Server tool.
Quantize a model to a smaller format:
./build/bin/llama-quantize my_model_f16.gguf my_model_q4_k_m.gguf Q4_K_MSee docs/build.md, docs/install.md, and docs/docker.md for installer-based and container-based alternatives.
Run the tests
cmake -B build -DLLAMA_BUILD_TESTS=ON
cmake --build build --config Release -j
ctest --test-dir build --output-on-failureThe ctest suite covers tokenizer round-trips, GGUF I/O, sampling, the chat parser, the PEG parser, grammar handling, and a backend-ops conformance test (tests/test-backend-ops.cpp) that compares each backend's kernel results against the CPU baseline. See Testing and ci/README.md for the longer self-hosted CI flow.
Built by Factory AutoWiki from public repository content. It is a generated preview for codebase exploration, not source-maintained documentation.
Previous
Architecture
Next
Glossary