Factory.ai

Open-Source Wikis

/

llama.cpp

/

llama.cpp

/

Getting started

ggml-org/llama.cpp

Getting started

This page walks through building llama.cpp from source, downloading a model, and running inference. The canonical, exhaustive build instructions live in docs/build.md; this page is a working subset.

Prerequisites

Tool Minimum Notes
C++ compiler C++17 gcc, clang, MSVC, or Apple clang
CMake 3.14 CMakePresets.json requires a recent CMake
Git any Submodules are vendored, but git is needed for build-info
Python 3.9+ Only required for convert_hf_to_gguf.py and the gguf-py/ package; managed by requirements/

Optional accelerator toolchains are documented in docs/build.md and docs/backend/:

  • CUDA Toolkit (NVIDIA) — see docs/build.md#cuda
  • ROCm / HIP (AMD) — see docs/build.md#hip
  • Metal (macOS) — auto-detected on Apple Silicon
  • Vulkan SDK — see docs/build.md#vulkan
  • oneAPI / SYCL (Intel) — see docs/backend/SYCL.md

Build

The simplest CPU build:

git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp
cmake -B build
cmake --build build --config Release -j

Binaries land in build/bin/. Common ones include llama-cli, llama-server, llama-quantize, llama-bench, llama-imatrix, llama-perplexity, llama-mtmd-cli, and llama-gguf-split.

To enable a GPU backend, pass the corresponding CMake variable. Example for CUDA:

cmake -B build -DGGML_CUDA=ON
cmake --build build --config Release -j

Other common flags (full list in ggml/CMakeLists.txt):

Flag Effect
-DGGML_METAL=ON Apple Metal backend (default on Apple platforms)
-DGGML_VULKAN=ON Vulkan backend
-DGGML_SYCL=ON SYCL / Intel oneAPI backend
-DGGML_HIP=ON AMD HIP/ROCm backend
-DGGML_OPENCL=ON OpenCL backend
-DGGML_RPC=ON RPC client/server backend
-DGGML_BLAS=ON Use a system BLAS for prompt processing
-DLLAMA_CURL=ON Enable the -hf HuggingFace download path (requires libcurl)
-DLLAMA_BUILD_SERVER=OFF Skip building llama-server
-DBUILD_SHARED_LIBS=ON Build shared libllama.so / libggml.so

CMakePresets.json defines a few canned presets (make list-presets). The Makefile in the repo root forwards to CMake for convenience.

Get a model

llama.cpp consumes GGUF files. There are three common ways to obtain one:

  1. Download a pre-converted GGUF directly from HuggingFace, e.g. ggml-org/gemma-3-1b-it-GGUF.
  2. Pass -hf <repo> on any CLI, which uses common/download.cpp + common/hf-cache.cpp to fetch and cache the file in the standard HuggingFace cache.
  3. Convert your own checkpoint with convert_hf_to_gguf.py. See Conversion and docs/development/HOWTO-add-model.md.

Run inference

Chat with a model on the command line:

# from the repo root, after building into ./build
./build/bin/llama-cli -hf ggml-org/gemma-3-1b-it-GGUF

Run an OpenAI-compatible HTTP server on :8080:

./build/bin/llama-server -hf ggml-org/gemma-3-1b-it-GGUF

The server exposes /v1/chat/completions, /v1/completions, /v1/embeddings, plus a built-in WebUI at http://localhost:8080. See Server tool.

Quantize a model to a smaller format:

./build/bin/llama-quantize my_model_f16.gguf my_model_q4_k_m.gguf Q4_K_M

See docs/build.md, docs/install.md, and docs/docker.md for installer-based and container-based alternatives.

Run the tests

cmake -B build -DLLAMA_BUILD_TESTS=ON
cmake --build build --config Release -j
ctest --test-dir build --output-on-failure

The ctest suite covers tokenizer round-trips, GGUF I/O, sampling, the chat parser, the PEG parser, grammar handling, and a backend-ops conformance test (tests/test-backend-ops.cpp) that compares each backend's kernel results against the CPU baseline. See Testing and ci/README.md for the longer self-hosted CI flow.

Built by Factory AutoWiki from public repository content. It is a generated preview for codebase exploration, not source-maintained documentation.

Getting started – llama.cpp wiki | Factory