grafana/grafana

Design decisions

A handful of architectural choices that have shaped Grafana's codebase. Each one is "load-bearing" — knowing the decision and the why behind it makes large parts of the code more legible.

A monolithic Go binary

Grafana ships as a single Go binary that contains the HTTP server, the plugin host, and the (compiled) frontend bundle. There is no microservice split — even the alerting scheduler, image renderer (when in-process), and live coordinator run inside the same process.

Why: The original deployment story prioritized "drop a binary on a machine, run it, you're done." That target is still met today: SQLite + the binary + a config file is all you need to demo Grafana. Microservice boundaries do exist in spirit (services have clean interfaces, plugins run as subprocesses) but the deployment artifact is still one binary.

Cost: A monolithic process forces all subsystems into the same memory and lifecycle. The pkg/modules/ package was added to allow splitting some workloads (e.g. an "alerting-only" mode), but the dominant deployment is still single-process.

In-process Wire DI

Service dependencies are wired via Google Wire at compile time. The graph is declared by hand in pkg/server/wire.go and generated to wire_gen.go.

Why: Compile-time DI catches missing dependencies and cycles before tests run, with no runtime reflection cost. Adding a service is a localized change — declare the constructor, add it to the right wire.NewSet, regenerate.

Cost: Onboarding requires understanding Wire's syntax and conventions. Every new service contributor learns this once.

CUE as the schema source of truth

Dashboards, panels, and app-platform resources are defined in CUE under kinds/ and apps/<name>/kinds/. Code generation produces Go structs, TypeScript types, and OpenAPI fragments.

Why: Dashboards crossed the language boundary at runtime — users edit them in the SPA and the server reads them. Keeping types in sync without a single source of truth caused frequent drift.

Cost: CUE has a learning curve. Codegen is noisy — generated files are huge and account for most of the largest non-test files in the repo. Build cycles include running make gen-cue / make gen-apps.

Two dashboard runtimes during the migration

The legacy dashboard/ runtime and the new dashboard-scene/ runtime coexist. Schema versions v1 and v2 ride along — one stable, one evolving.

Why: Dashboards are the most-loaded surface in the product; rewrites must not break anyone's dashboards. Coexistence buys the time to migrate features one at a time and to preserve backwards-compatibility for client tools.

Cost: Code lives in both places. Bug fixes sometimes need to land twice. Conversion code (apps/dashboard/pkg/migration/) is large and fragile.

Unified Alerting completely replaced legacy alerting

When alerting was rewritten in Grafana 8, the legacy engine was removed entirely (after a deprecation cycle). New code only targets pkg/services/ngalert/.

Why: The new model (Prometheus-style, multi-tenant, label-based) is fundamentally incompatible with the legacy panel-attached alert model. Maintaining two engines indefinitely was untenable.

Cost: A migration tool was required to convert legacy alerts to the new format. Some niche workflows had no exact replacement and had to be re-modelled.

Plugin processes over gRPC

Backend datasource plugins run as separate processes that the host launches and talks to over gRPC.

Why: Process isolation prevents a misbehaving plugin from crashing the server. Different plugins can use incompatible Go modules without breaking the host's build.

Cost: gRPC adds latency. Bundled plugins (pkg/tsdb/<name>/) are compiled into the host binary as in-process plugins to avoid this for first-party datasources.

Centrifuge for Live channels

The Live subsystem (pkg/services/live/) embeds Centrifuge directly. Centrifuge handles WebSocket framing, channel namespacing, presence, and HA coordination via a Redis broker.

Why: Building real-time messaging from scratch is enough work to justify a dependency.

Cost: Centrifuge's API surface is its own thing; documenting and debugging it adds learning curve. HA requires Redis.

RBAC actions as strings

Access control actions are typed as Go string constants (dashboards:read, users:write, …). They are referenced both at registration time (defining a built-in role) and at use time (annotating a route).

Why: Keeping actions as strings allows custom roles to compose them without compile-time coupling. New actions can be introduced by services without core changes.

Cost: Typos in action names won't be caught at compile time. Convention plus tests are the safety net; reviewers should look closely at any new action names.

Server-side expressions as a "datasource"

Expression queries (the __expr__ ref) are dispatched through the same query orchestration pipeline as real datasource queries. The expression engine registers itself as a datasource backend.

Why: Letting expressions ride the existing pipeline avoided a parallel orchestrator. They get authorization, batching, and observability for free.

Cost: A few orchestrator code paths special-case __expr__ — those are documented and tested.

Built by Factory AutoWiki from public repository content. It is a generated preview for codebase exploration, not source-maintained documentation.