Design reference

Goals

Build a small owned telemetry path for hosts where a full collector stack is too heavy or too generic.

minimal agent footprint
reasonable server footprint
minimal maintenance
one server process for ingest, live reads, caching, object storage writes, and rollups
object storage as the durable data store
disk landing buffer and cache for recent queryable data before object-store landing
disk buffer/cache for restart recovery and object-store outages
built-in rollup support
post-MVP metrics/logs ingest and query performance should be benchmarked against a ClickHouse SharedMergeTree-style design over object storage
be honest when ClickHouse is the better engine: a ClickHouse server is doing broadly the same storage, merge, cache, and query work with a much more mature implementation
keep a ClickHouse ingest compatibility path for the agent’s Cap’n Proto metrics and logs, using a ClickHouse-specific unframed HTTP INSERT ... FORMAT CapnProto stream/schema rather than the lightmetrics framed wire body

The agent must send:

host metrics
logs
alerts

The server must accept public HTTPS ingest from known hosts, validate identity, deduplicate at-least-once batches, make incoming data queryable immediately, and persist immutable objects to object storage.

Non-Goals

OpenTelemetry wire compatibility
Cap’n Proto RPC
Arrow or Parquet in the agent
exactly-once delivery
public dashboard access
a multi-service backend for the first version

Architecture

flowchart LR
  agent[agent] --> http[HTTPS ingest]
  http --> spool[disk ingest spool]
  http --> landing[disk landing buffer]
  spool --> object[object storage writer]
  object --> raw[raw objects]
  raw --> rollup[rollup worker]
  rollup --> rolled[rollup objects]
  landing --> query[query/UI API]
  rolled --> query

Agent:

collect host metrics with direct /proc and filesystem reads
tail selected log files or journald later
batch records into Cap’n Proto messages
write each batch to local disk queue before upload
upload via HTTPS POST
delete local queue entries only after server acknowledgement

Server:

terminate HTTPS and authenticate per-host tokens
decode Cap’n Proto batches
append accepted batches to a local disk spool before acknowledging
publish local landing-buffer updates for realtime UI subscriptions after spool append
flush immutable chunks to object storage asynchronously
maintain a bounded disk cache for recent raw chunks and rollups
compute rollups in-process
serve a small private query/UI API from local landing buffers, cached data, and object storage
keep public ingest and private query listeners separate inside the same process

Wire Format

Use Cap’n Proto messages as the framed payload contents. The schema mirrors OTLP concepts without copying OTLP’s nested structure.

Important design choices:

stable host labels live on the batch or series, not every sample
metrics use series definitions plus sample lists
histograms use Prometheus-style cumulative le buckets inside each sample plus explicit temporality for the series
alerts are state transitions, not free-form log lines
logs and alerts share the same attribute model
sequence numbers are mandatory for dedupe and replay

Transport

Agent to collector:

POST /ingest/v1/batch
Content-Type: application/x-capnp
Authorization: Bearer <per-host-token>
X-Lightmetrics-Agent: <agent-id>
X-Lightmetrics-Seq-Start: <seq>
X-Lightmetrics-Seq-End: <seq>

HTTP Content-Encoding is not used in v1. Compression is represented by frame flags so the same bytes can be reused in the agent queue, HTTP body, server spool, and raw object storage.

HTTP request body is a framed payload, not a bare Cap’n Proto message:

magic: 8 bytes
frame_version: u16 little-endian
flags: u16 little-endian
payload_len: u64 little-endian
payload_crc32: u32 little-endian
payload: packed Cap'n Proto batch, optionally compressed

Initial flags:

0x0001: payload is packed Cap’n Proto
0x0002: payload is zstd-compressed

Unknown frame flags are rejected until the protocol defines an explicit optional vs. required flag split.

The framed body is the lightmetrics wire format, not ClickHouse input. The schema should still allow a ClickHouse-specific sender/exporter to write the same logical metrics/logs as an unframed Cap’n Proto stream through ClickHouse HTTP INSERT ... FORMAT CapnProto. That requires a ClickHouse-facing schema/table layout and may require a self-managed ClickHouse target if a given ClickHouse Cloud service does not support Cap’n Proto input. A ClickHouse-backed server mode can be added after MVP if the custom object-store path is not competitive enough for real ingest/query workloads.

Successful ingest response:

{"status":"ok","agent_id":"vpn","boot_id":"...","seq_start":10,"seq_end":20,"duplicate":false}

Duplicate batches return the same successful shape with duplicate=true, so agents can safely delete already-accepted queue entries. Invalid frames return 400. Authentication failures return 401 or 403. Full spool or intentional backpressure returns 503 and must not acknowledge the batch.

Transport must stay HTTP/3-ready even if the MVP listener starts with HTTP/1.1 or HTTP/2. Keep ingest semantics independent of connection ordering, TCP behavior, and client source ports so the same /ingest/v1/batch contract can run over QUIC. The first runtime listener is TLS 1.3 over HTTP/1.1 with rustls; HTTP/3/QUIC should be added as a second listener adapter over the same ingest service.

TLS 1.3 0-RTT may be enabled only for the public ingest endpoint, and only because batch upload is replay-safe by design:

the server authenticates every early-data request before accepting it
accepted batches are deduped by (agent_id, boot_id, seq_start, seq_end)
the dedupe check is atomic with every ingest side effect; replayed batches must not append to spool, update live state, write raw objects, or enqueue rollup work
duplicate batches return duplicate=true without creating another ingest side effect
dedupe state must be persisted for at least the deployment’s effective early-data replay horizon, including TLS session-ticket lifetime and any TLS anti-replay policy window
query, UI, admin, token management, and any non-idempotent endpoint must reject early data

If a deployment cannot maintain that dedupe horizon across restarts, disable 0-RTT for that deployment while still allowing HTTP/3 without early data.

The server must enforce public ingest limits before and during decode:

maximum frame/body bytes
maximum series per batch
maximum samples per series
maximum histogram buckets per sample
maximum labels per series
maximum label key/value bytes
maximum log message bytes
maximum logs per batch
maximum alerts per batch

Requests beyond these limits return 413 or 400 before allocation-heavy decoding or live-state updates.

Configuration

Use TOML as the runtime configuration format.

Rationale:

comments are supported
typed enough for simple host/server config
less ambiguous than YAML
easier to parse from Rust than CUE
no generated config artifact is required for the first version

Runtime files:

/etc/lightmetrics/agent.toml
/etc/lightmetrics/server.toml

Secrets stay outside config. The config points at token/certificate files by path.

Parser choice:

Start with TOML parsing at process startup.
Measure the agent binary before committing to a parser crate.
If a general TOML parser is too expensive, keep the file format TOML but support a constrained subset in the agent.
Do not use YAML for runtime host config.

CUE can still be used later for fleet-level generation or validation, but it should not be required on hosts and should not be a runtime dependency.

Server config must define separate listeners:

public ingest listener for /ingest/v1/batch
private query listener for /api/v1/*

The ingest listener must not route query, UI, or admin paths. The query listener binds to localhost or a private/Tailscale address.

Storage

Object storage is the system of record. The server should write immutable compressed objects and small manifests instead of mutating large files.

Initial layout:

raw/v1/date=YYYY-MM-DD/hour=HH/agent=<agent-id>/<boot-id>-<seq-start>-<seq-end>.lmbatch.zst
rollup/v1/window=1m/date=YYYY-MM-DD/hour=HH/metric=<metric-name>/<chunk-id>.lmrollup.zst
rollup/v1/window=5m/date=YYYY-MM-DD/metric=<metric-name>/<chunk-id>.lmrollup.zst
index/v1/date=YYYY-MM-DD/hour=HH/<chunk-id>.lmindex.zst
manifests/v1/date=YYYY-MM-DD/hour=HH/<chunk-id>.json

The raw object body can remain the same Cap’n Proto batch frames used on the wire. Rollup and index objects can use dedicated Cap’n Proto schemas once query patterns are clearer.

Landing Buffer And Realtime Views

Incoming data is available to private realtime views after durable local acceptance:

append batch to the local disk spool or configured landing buffer
publish a bounded local update notice for connected private UI subscribers

The landing buffer is the pre-object-store durability boundary. It may be the ingest spool itself or a separate disk buffer if that better separates raw acceptance from query/read optimization. Memory is a cache and subscription fanout aid, not the authoritative live query source.

Realtime metric views should update charts when new accepted metric batches arrive, without relying only on periodic dashboard refresh. The server should expose this as bounded private UI event streams over Server-Sent Events. Event payloads identify changed hosts, metric series, rollup windows, and query ranges; clients then fetch bounded deltas from the same query APIs used for normal views. Do not make an unbounded metrics streaming protocol part of MVP query correctness.

Realtime log tail uses the same model: the SSE stream announces newly accepted log records or tail cursors after durable local acceptance, while bounded log fetches read from the landing buffer and object storage.

Query correctness comes from merging:

local landing-buffer data for the not-yet-landed tail
object storage for landed raw chunks and rollups
optional disk/memory caches that can be rebuilt or discarded

If object storage is unavailable, queries may succeed only when the requested range is wholly covered by the local landing-buffer horizon. Otherwise return a bounded partial/error response with explicit partial=true or Prometheus-style error metadata, depending on the API.

Bounded behavior:

If agent queue is full, the agent drops newest non-critical logs first, keeps metrics and alerts as long as possible, and emits a local agent.queue_dropped alert record when it can.
If server spool is full, ingest returns 503 without acknowledgement.
If object storage is down, the server keeps accepting until spool reaches its cap, then applies the same 503 backpressure.
Disk cache eviction must not remove spool files that have not been persisted to object storage.
Queries must not depend on memory-only data. They must either load requested data from the landing buffer/object storage or return an explicit partial/error response.

Rollups

Rollups are a server responsibility, not an agent responsibility.

First metric rollups:

gauges: count, min, max, sum, last
counters: first, last, delta, rate
histograms: merge compatible bucket boundaries by summing bucket counts, count, and sum

Metric temporality:

gauges use instant
counters should use cumulative with startUnixNs for reset detection
histograms should use delta when possible
cumulative histograms must be converted to deltas before rollup aggregation

Histogram aggregation requires identical bucket boundaries for a given series after temporality conversion. The server should reject or quarantine histogram samples that change bucket layout within the same series identity. For cumulative histograms, the server detects resets when startUnixNs changes or bucket/count values decrease, and does not merge across reset boundaries.

Approximate percentiles:

derive p50/p90/p95/p99 at query time from merged cumulative buckets
expose them in API/UI responses as derived values
do not store percentiles as the only rollup representation
keep bucket counts, total count, and sum so percentiles can be recomputed for different windows
document that percentile accuracy is bounded by bucket width

Rollup windows:

1 minute
5 minutes
1 hour

Logs and alerts should get compact time indexes first. Do not build full-text search until the access pattern proves it is needed.

Dashboard Definitions

Dashboard tabs and dashboard panels should be defined by configuration, not hard-coded as separate dashboard-specific UI surfaces. The private UI’s top-level navigation is the dashboard surface: built-in dashboard definitions provide default tabs, and site-specific dashboard definitions can add, hide, reorder, or override those tabs. Built-in operations tabs should also be represented as dashboard definitions that compose fixed private-UI block types. The implementation of those block types, backend APIs, query execution, validation, and route handlers remains ordinary code.

Built-in dashboards and site-specific dashboards use the same definition schema.

Definition loading:

ship built-in dashboard definitions with stable IDs
load custom dashboard definitions from configured files such as /etc/lightmetrics/dashboards.d/*.toml
treat built-ins as fallback/read-only definitions
let custom definitions override, hide, or extend built-ins only through explicit matching IDs and override modes
allow explicit tab ordering and grouping so custom dashboards can appear as normal navigation entries instead of a separate custom-dashboard area
reject ambiguous duplicate custom IDs at startup or config reload

Definition language:

author dashboard definitions as TOML files, matching the rest of runtime configuration
parse authoring TOML into a typed internal schema; expose that canonical schema to tests, validation, and the private UI as structured data
make the language declarative: definitions choose datasets, controls, transforms, component kinds, fields, thresholds, states, and links
do not allow arbitrary JavaScript, HTML, CSS, SQL, shell commands, or plugin code in dashboard definitions
keep frontend rendering behind a fixed component registry, such as collector_strip, segmented_mode, kpi_strip, table, timeseries, log_table, alert_table, host_identity, recent_uploads, and query_debug_shell

Definition fields should cover:

dashboard ID, title, version, tags, source, and default refresh
route, tab order, grouping, visibility, disabled state, and override mode
navigation overflow behavior so narrow screens or many custom definitions keep the active tab visible and move lower-priority tabs into an overflow menu
variables and controls for time range, refresh interval, region, host/agent, boot ID, metric name, labels, rollup window, severity, and alert state
time range presets, custom absolute/rolling ranges, and a display-only metric resolution hint derived from the selected range
datasets such as hosts, collectors, agent ingest state, live metrics, logs, alerts, rollups, object-store health, and Prometheus-compatible metric queries
bounded transforms such as filter, sort, limit, group, count, sum, mean, median, max, p80, p95, p99, current value selection, freshness classification, and derived state
blocks with type, title, query target, units, thresholds, links, and empty, loading, stale, partial, and error states
mode definitions that swap related controls, KPI strips, state filters, and table columns together, as in the Fleet overview Host health vs Agent ingest status switch
table-column display controls that choose two statistics from a bounded set, as in the Fleet overview CPU/memory/disk/net current-vs-rollup selector
links into host detail, logs, alerts, ingest/storage health, live metrics, and metrics query/debug views

Expressions inside dashboard definitions should be a small typed subset for field references, arithmetic, comparisons, boolean operations, duration/unit literals, and named aggregations such as count, count_where, sum, mean, median, max, p80, p95, and p99. Anything more complex belongs in server-side code or a new registered transform.

Top-level built-in tabs should map to the definition model:

Fleet overview: header controls, collapsible active-active collector strip, segmented modes, KPI strips, state chips, selected-stat controls, sortable mode-specific tables, and row click-through to Host detail
Host detail: parameterized host route with identity, metrics, logs, alerts, and recent-upload blocks
Live metrics: landing-buffer-backed controls, realtime chart update notices, freshness states, and latest-series table
Logs and alerts: filter controls plus typed log and alert tables
Ingest and storage health: collector, object-store, spool, dedupe, conflict, rollup, and partial-query panels
Metrics query/debug: a configured shell around the hard-coded query executor and result renderers

Dashboard definitions are the source of truth for the private UI. Grafana dashboard export/provisioning is not a correctness dependency for the MVP; the required Grafana integration is PromQL-compatible query serving from the Lightmetrics server. A low-priority read-only export surface may generate Grafana JSON for metric panels that map to stock Grafana Prometheus panels, but logs, alerts, and custom console-only blocks must be omitted or marked unsupported. Dashboards edited directly in Grafana are outside Lightmetrics configuration in MVP; there is no Grafana-to-Lightmetrics import path in the first version.

MVP dashboard management is file-based. A UI editor for dashboards is post-MVP. Do not accept arbitrary Grafana JSON as a first-class input format. Any Grafana-to-Lightmetrics converter or reuse of Grafana dashboard/frontend components is post-MVP and lower priority than PromQL compatibility.

Grafana Compatibility

Expose a Prometheus HTTP API compatibility target for metrics. Grafana’s built-in Prometheus data source works with systems that implement the Prometheus query API, so this is the required Grafana integration boundary for MVP.

Initial endpoints:

GET /api/v1/query
GET /api/v1/query_range
GET /api/v1/labels
GET /api/v1/label/<name>/values
GET /api/v1/series
GET /api/v1/metadata

Query language should start as a constrained PromQL/MetricsQL subset:

instant vector selectors: metric_name and metric_name{label="value",label!="value"}
range selectors where required by functions: metric_name[5m]
exact label matchers first; regex matchers can be added only if Grafana variable workflows require them
rate() for counters over range selectors
sum, avg, min, max, and count over one expression
by (...) grouping for aggregations
histogram bucket queries using Prometheus-style _bucket, _sum, and _count virtual series
histogram_quantile() over bucket rollups

Reject unsupported PromQL explicitly with Prometheus-style bad_data responses. Do not silently reinterpret:

binary operators and joins
subqueries
offset
logical/set operators: and, or, unless
recording-rule semantics
arbitrary nested functions outside the supported forms
Prometheus staleness semantics beyond “series exists in requested time window”

For query_range, object storage is the canonical source and the local spool/landing buffer overlays the not-yet-landed tail. The query engine must merge by batch identity and series timestamp and dedupe repeated at-least-once uploads. If object storage is unavailable, return a Prometheus-style error unless the requested range is wholly inside the local landing-buffer horizon.

The P0 direct-selector implementation evaluates query_range at start + n * step timestamps with a fixed 5 minute lookback, matching the Prometheus graph-panel shape before broader PromQL execution is available. When only the local accepted spool is configured, successful range responses include a Prometheus warnings entry that marks the result as local-spool-only partial coverage. Full object-store horizon checks remain part of the later object-store query merge work.

Compatibility target:

Prometheus JSON response envelope: status, data, errorType, error, warnings
Prometheus timestamp/value pair format: [unix_seconds_float, "value"]
matrix responses for query_range
vector responses for instant query
standard label and series response shapes for Grafana variables
return Prometheus-style errors for unsupported syntax instead of silently changing semantics
smoke-test against a pinned stock Grafana container with a provisioned Prometheus data source and a dashboard fixture that uses variables, query, query_range, and histogram_quantile()

Do not build a custom Grafana data source plugin for v1. A plugin creates a new maintenance surface, while Prometheus-compatible endpoints can be used by stock Grafana. Grafana dashboard export/provisioning is post-MVP and must not distract from PromQL compatibility.

Logs and alerts do not need to be Grafana-native in v1. Keep the server’s own JSON/Arrow API for those. If Grafana log browsing becomes important, add a Loki-compatible query surface later rather than inventing a plugin first.

Arrow/Perspective UI is post-MVP. Keep it behind a build feature such as ui-perspective so the default server binary does not include Arrow IPC or frontend assets. The MVP private UI path is the built-in console plus config-defined dashboard definitions, backed by the Prometheus-compatible metrics API and bounded JSON endpoints. Grafana support in MVP means stock Grafana can query Lightmetrics through the Prometheus-compatible API.

Logs and Alerts API

MVP logs and alerts use bounded JSON APIs.

GET /api/v1/logs
GET /api/v1/logs/tail
GET /api/v1/alerts

Log query filters:

from
to
host
severity
target
contains
limit
cursor
order

Alert query filters:

from
to
host
name
state
limit
cursor

Alerts are ingested alert state records in MVP. The server does not evaluate alert rules, send notifications, silence alerts, or expose an Alertmanager-compatible API in the first version.

Use Server-Sent Events for live tailing in MVP. WebSocket and Arrow IPC are optional post-MVP paths.

The current log-tail SSE endpoint is GET /api/v1/logs/tail on the private query listener. It requires the query bearer token and emits bounded JSON log_tail_snapshot, log_tail_update, and log_tail_gap events. Event records are loaded from the same accepted-log storage visibility source as GET /api/v1/logs; clients use the returned opaque after cursor to repair a reconnect or lag gap with another bounded tail request scoped to the matching agent_id and boot_id. The endpoint is not a durable event journal and does not provide text search.

With ui-perspective enabled, add bounded Arrow IPC endpoints for table-oriented UI loads:

GET /api/v1/metrics.arrow
GET /api/v1/logs.arrow
GET /api/v1/alerts.arrow

Do not use Arrow for unbounded live tails. Keep live streams as SSE JSON notices and fetch bounded Arrow deltas when the Perspective UI needs table updates.

Delivery

Delivery is at-least-once.

The agent writes encoded batches to a local append-only queue before upload. A batch is removed only after the collector returns success. If the agent retries after an ambiguous failure, the collector deduplicates by:

(agent_id, boot_id, seq_start, seq_end)

Security

The public ingest endpoint must not expose query or admin APIs.

First version:

HTTPS
one bearer token per host
source IP allow-list where stable
collector-side token revocation

mTLS can replace bearer tokens later if certificate distribution is worth the added operational surface.

Footprint

Agent dependencies should stay narrow:

no async runtime unless measurements justify it
no Cap’n Proto RPC
no Arrow or Parquet
no embedded query engine

Server can spend more footprint on:

HTTPS
object storage client
zstd
in-process query/UI API
rollup workers

The server still stays one process for the first version.

Publication Boundary

Reference