Reference public document

Design reference

Target architecture and rationale for the Lightmetrics system.

Boundary

Publication Boundary

Design notes describe direction unless they explicitly say a behavior is implemented; limitations remain the implementation boundary.

Reference

Reference

Cross-check this page against the current limitations page before relying on a behavior operationally.

Goals

Build a small owned telemetry path for hosts where a full collector stack is too heavy or too generic.

  • minimal agent footprint
  • reasonable server footprint
  • minimal maintenance
  • one server process for ingest, live reads, caching, object storage writes, and rollups
  • object storage as the durable data store
  • disk landing buffer and cache for recent queryable data before object-store landing
  • disk buffer/cache for restart recovery and object-store outages
  • built-in rollup support
  • post-MVP metrics/logs ingest and query performance should be benchmarked against a ClickHouse SharedMergeTree-style design over object storage
  • be honest when ClickHouse is the better engine: a ClickHouse server is doing broadly the same storage, merge, cache, and query work with a much more mature implementation
  • keep a ClickHouse ingest compatibility path for the agent’s Cap’n Proto metrics and logs, using a ClickHouse-specific unframed HTTP INSERT ... FORMAT CapnProto stream/schema rather than the lightmetrics framed wire body

The agent must send:

  • host metrics
  • logs
  • alerts

The server must accept public HTTPS ingest from known hosts, validate identity, deduplicate at-least-once batches, make incoming data queryable immediately, and persist immutable objects to object storage.

Non-Goals

  • OpenTelemetry wire compatibility
  • Cap’n Proto RPC
  • Arrow or Parquet in the agent
  • exactly-once delivery
  • public dashboard access
  • a multi-service backend for the first version

Architecture

flowchart LR
  agent[agent] --> http[HTTPS ingest]
  http --> spool[disk ingest spool]
  http --> landing[disk landing buffer]
  spool --> object[object storage writer]
  object --> raw[raw objects]
  raw --> rollup[rollup worker]
  rollup --> rolled[rollup objects]
  landing --> query[query/UI API]
  rolled --> query

Agent:

  • collect host metrics with direct /proc and filesystem reads
  • tail selected log files or journald later
  • batch records into Cap’n Proto messages
  • write each batch to local disk queue before upload
  • upload via HTTPS POST
  • delete local queue entries only after server acknowledgement

Server:

  • terminate HTTPS and authenticate per-host tokens
  • decode Cap’n Proto batches
  • append accepted batches to a local disk spool before acknowledging
  • publish local landing-buffer updates for realtime UI subscriptions after spool append
  • flush immutable chunks to object storage asynchronously
  • maintain a bounded disk cache for recent raw chunks and rollups
  • compute rollups in-process
  • serve a small private query/UI API from local landing buffers, cached data, and object storage
  • keep public ingest and private query listeners separate inside the same process

Wire Format

Use Cap’n Proto messages as the framed payload contents. The schema mirrors OTLP concepts without copying OTLP’s nested structure.

Important design choices:

  • stable host labels live on the batch or series, not every sample
  • metrics use series definitions plus sample lists
  • histograms use Prometheus-style cumulative le buckets inside each sample plus explicit temporality for the series
  • alerts are state transitions, not free-form log lines
  • logs and alerts share the same attribute model
  • sequence numbers are mandatory for dedupe and replay

Transport

Agent to collector:

POST /ingest/v1/batch
Content-Type: application/x-capnp
Authorization: Bearer <per-host-token>
X-Lightmetrics-Agent: <agent-id>
X-Lightmetrics-Seq-Start: <seq>
X-Lightmetrics-Seq-End: <seq>

HTTP Content-Encoding is not used in v1. Compression is represented by frame flags so the same bytes can be reused in the agent queue, HTTP body, server spool, and raw object storage.

HTTP request body is a framed payload, not a bare Cap’n Proto message:

magic: 8 bytes
frame_version: u16 little-endian
flags: u16 little-endian
payload_len: u64 little-endian
payload_crc32: u32 little-endian
payload: packed Cap'n Proto batch, optionally compressed

Initial flags:

  • 0x0001: payload is packed Cap’n Proto
  • 0x0002: payload is zstd-compressed

Unknown frame flags are rejected until the protocol defines an explicit optional vs. required flag split.

The framed body is the lightmetrics wire format, not ClickHouse input. The schema should still allow a ClickHouse-specific sender/exporter to write the same logical metrics/logs as an unframed Cap’n Proto stream through ClickHouse HTTP INSERT ... FORMAT CapnProto. That requires a ClickHouse-facing schema/table layout and may require a self-managed ClickHouse target if a given ClickHouse Cloud service does not support Cap’n Proto input. A ClickHouse-backed server mode can be added after MVP if the custom object-store path is not competitive enough for real ingest/query workloads.

Successful ingest response:

{"status":"ok","agent_id":"vpn","boot_id":"...","seq_start":10,"seq_end":20,"duplicate":false}

Duplicate batches return the same successful shape with duplicate=true, so agents can safely delete already-accepted queue entries. Invalid frames return 400. Authentication failures return 401 or 403. Full spool or intentional backpressure returns 503 and must not acknowledge the batch.

Transport must stay HTTP/3-ready even if the MVP listener starts with HTTP/1.1 or HTTP/2. Keep ingest semantics independent of connection ordering, TCP behavior, and client source ports so the same /ingest/v1/batch contract can run over QUIC. The first runtime listener is TLS 1.3 over HTTP/1.1 with rustls; HTTP/3/QUIC should be added as a second listener adapter over the same ingest service.

TLS 1.3 0-RTT may be enabled only for the public ingest endpoint, and only because batch upload is replay-safe by design:

  • the server authenticates every early-data request before accepting it
  • accepted batches are deduped by (agent_id, boot_id, seq_start, seq_end)
  • the dedupe check is atomic with every ingest side effect; replayed batches must not append to spool, update live state, write raw objects, or enqueue rollup work
  • duplicate batches return duplicate=true without creating another ingest side effect
  • dedupe state must be persisted for at least the deployment’s effective early-data replay horizon, including TLS session-ticket lifetime and any TLS anti-replay policy window
  • query, UI, admin, token management, and any non-idempotent endpoint must reject early data

If a deployment cannot maintain that dedupe horizon across restarts, disable 0-RTT for that deployment while still allowing HTTP/3 without early data.

The server must enforce public ingest limits before and during decode:

  • maximum frame/body bytes
  • maximum series per batch
  • maximum samples per series
  • maximum histogram buckets per sample
  • maximum labels per series
  • maximum label key/value bytes
  • maximum log message bytes
  • maximum logs per batch
  • maximum alerts per batch

Requests beyond these limits return 413 or 400 before allocation-heavy decoding or live-state updates.

Configuration

Use TOML as the runtime configuration format.

Rationale:

  • comments are supported
  • typed enough for simple host/server config
  • less ambiguous than YAML
  • easier to parse from Rust than CUE
  • no generated config artifact is required for the first version

Runtime files:

/etc/lightmetrics/agent.toml
/etc/lightmetrics/server.toml

Secrets stay outside config. The config points at token/certificate files by path.

Parser choice:

  • Start with TOML parsing at process startup.
  • Measure the agent binary before committing to a parser crate.
  • If a general TOML parser is too expensive, keep the file format TOML but support a constrained subset in the agent.
  • Do not use YAML for runtime host config.

CUE can still be used later for fleet-level generation or validation, but it should not be required on hosts and should not be a runtime dependency.

Server config must define separate listeners:

  • public ingest listener for /ingest/v1/batch
  • private query listener for /api/v1/*

The ingest listener must not route query, UI, or admin paths. The query listener binds to localhost or a private/Tailscale address.

Storage

Object storage is the system of record. The server should write immutable compressed objects and small manifests instead of mutating large files.

Initial layout:

raw/v1/date=YYYY-MM-DD/hour=HH/agent=<agent-id>/<boot-id>-<seq-start>-<seq-end>.lmbatch.zst
rollup/v1/window=1m/date=YYYY-MM-DD/hour=HH/metric=<metric-name>/<chunk-id>.lmrollup.zst
rollup/v1/window=5m/date=YYYY-MM-DD/metric=<metric-name>/<chunk-id>.lmrollup.zst
index/v1/date=YYYY-MM-DD/hour=HH/<chunk-id>.lmindex.zst
manifests/v1/date=YYYY-MM-DD/hour=HH/<chunk-id>.json

The raw object body can remain the same Cap’n Proto batch frames used on the wire. Rollup and index objects can use dedicated Cap’n Proto schemas once query patterns are clearer.

Landing Buffer And Realtime Views

Incoming data is available to private realtime views after durable local acceptance:

  1. append batch to the local disk spool or configured landing buffer
  2. publish a bounded local update notice for connected private UI subscribers

The landing buffer is the pre-object-store durability boundary. It may be the ingest spool itself or a separate disk buffer if that better separates raw acceptance from query/read optimization. Memory is a cache and subscription fanout aid, not the authoritative live query source.

Realtime metric views should update charts when new accepted metric batches arrive, without relying only on periodic dashboard refresh. The server should expose this as bounded private UI event streams over Server-Sent Events. Event payloads identify changed hosts, metric series, rollup windows, and query ranges; clients then fetch bounded deltas from the same query APIs used for normal views. Do not make an unbounded metrics streaming protocol part of MVP query correctness.

Realtime log tail uses the same model: the SSE stream announces newly accepted log records or tail cursors after durable local acceptance, while bounded log fetches read from the landing buffer and object storage.

Query correctness comes from merging:

  • local landing-buffer data for the not-yet-landed tail
  • object storage for landed raw chunks and rollups
  • optional disk/memory caches that can be rebuilt or discarded

If object storage is unavailable, queries may succeed only when the requested range is wholly covered by the local landing-buffer horizon. Otherwise return a bounded partial/error response with explicit partial=true or Prometheus-style error metadata, depending on the API.

Bounded behavior:

  • If agent queue is full, the agent drops newest non-critical logs first, keeps metrics and alerts as long as possible, and emits a local agent.queue_dropped alert record when it can.
  • If server spool is full, ingest returns 503 without acknowledgement.
  • If object storage is down, the server keeps accepting until spool reaches its cap, then applies the same 503 backpressure.
  • Disk cache eviction must not remove spool files that have not been persisted to object storage.
  • Queries must not depend on memory-only data. They must either load requested data from the landing buffer/object storage or return an explicit partial/error response.

Rollups

Rollups are a server responsibility, not an agent responsibility.

First metric rollups:

  • gauges: count, min, max, sum, last
  • counters: first, last, delta, rate
  • histograms: merge compatible bucket boundaries by summing bucket counts, count, and sum

Metric temporality:

  • gauges use instant
  • counters should use cumulative with startUnixNs for reset detection
  • histograms should use delta when possible
  • cumulative histograms must be converted to deltas before rollup aggregation

Histogram aggregation requires identical bucket boundaries for a given series after temporality conversion. The server should reject or quarantine histogram samples that change bucket layout within the same series identity. For cumulative histograms, the server detects resets when startUnixNs changes or bucket/count values decrease, and does not merge across reset boundaries.

Approximate percentiles:

  • derive p50/p90/p95/p99 at query time from merged cumulative buckets
  • expose them in API/UI responses as derived values
  • do not store percentiles as the only rollup representation
  • keep bucket counts, total count, and sum so percentiles can be recomputed for different windows
  • document that percentile accuracy is bounded by bucket width

Rollup windows:

  • 1 minute
  • 5 minutes
  • 1 hour

Logs and alerts should get compact time indexes first. Do not build full-text search until the access pattern proves it is needed.

Dashboard Definitions

Dashboard tabs and dashboard panels should be defined by configuration, not hard-coded as separate dashboard-specific UI surfaces. The private UI’s top-level navigation is the dashboard surface: built-in dashboard definitions provide default tabs, and site-specific dashboard definitions can add, hide, reorder, or override those tabs. Built-in operations tabs should also be represented as dashboard definitions that compose fixed private-UI block types. The implementation of those block types, backend APIs, query execution, validation, and route handlers remains ordinary code.

Built-in dashboards and site-specific dashboards use the same definition schema.

Definition loading:

  • ship built-in dashboard definitions with stable IDs
  • load custom dashboard definitions from configured files such as /etc/lightmetrics/dashboards.d/*.toml
  • treat built-ins as fallback/read-only definitions
  • let custom definitions override, hide, or extend built-ins only through explicit matching IDs and override modes
  • allow explicit tab ordering and grouping so custom dashboards can appear as normal navigation entries instead of a separate custom-dashboard area
  • reject ambiguous duplicate custom IDs at startup or config reload

Definition language:

  • author dashboard definitions as TOML files, matching the rest of runtime configuration
  • parse authoring TOML into a typed internal schema; expose that canonical schema to tests, validation, and the private UI as structured data
  • make the language declarative: definitions choose datasets, controls, transforms, component kinds, fields, thresholds, states, and links
  • do not allow arbitrary JavaScript, HTML, CSS, SQL, shell commands, or plugin code in dashboard definitions
  • keep frontend rendering behind a fixed component registry, such as collector_strip, segmented_mode, kpi_strip, table, timeseries, log_table, alert_table, host_identity, recent_uploads, and query_debug_shell

Definition fields should cover:

  • dashboard ID, title, version, tags, source, and default refresh
  • route, tab order, grouping, visibility, disabled state, and override mode
  • navigation overflow behavior so narrow screens or many custom definitions keep the active tab visible and move lower-priority tabs into an overflow menu
  • variables and controls for time range, refresh interval, region, host/agent, boot ID, metric name, labels, rollup window, severity, and alert state
  • time range presets, custom absolute/rolling ranges, and a display-only metric resolution hint derived from the selected range
  • datasets such as hosts, collectors, agent ingest state, live metrics, logs, alerts, rollups, object-store health, and Prometheus-compatible metric queries
  • bounded transforms such as filter, sort, limit, group, count, sum, mean, median, max, p80, p95, p99, current value selection, freshness classification, and derived state
  • blocks with type, title, query target, units, thresholds, links, and empty, loading, stale, partial, and error states
  • mode definitions that swap related controls, KPI strips, state filters, and table columns together, as in the Fleet overview Host health vs Agent ingest status switch
  • table-column display controls that choose two statistics from a bounded set, as in the Fleet overview CPU/memory/disk/net current-vs-rollup selector
  • links into host detail, logs, alerts, ingest/storage health, live metrics, and metrics query/debug views

Expressions inside dashboard definitions should be a small typed subset for field references, arithmetic, comparisons, boolean operations, duration/unit literals, and named aggregations such as count, count_where, sum, mean, median, max, p80, p95, and p99. Anything more complex belongs in server-side code or a new registered transform.

Top-level built-in tabs should map to the definition model:

  • Fleet overview: header controls, collapsible active-active collector strip, segmented modes, KPI strips, state chips, selected-stat controls, sortable mode-specific tables, and row click-through to Host detail
  • Host detail: parameterized host route with identity, metrics, logs, alerts, and recent-upload blocks
  • Live metrics: landing-buffer-backed controls, realtime chart update notices, freshness states, and latest-series table
  • Logs and alerts: filter controls plus typed log and alert tables
  • Ingest and storage health: collector, object-store, spool, dedupe, conflict, rollup, and partial-query panels
  • Metrics query/debug: a configured shell around the hard-coded query executor and result renderers

Dashboard definitions are the source of truth for the private UI. Grafana dashboard export/provisioning is not a correctness dependency for the MVP; the required Grafana integration is PromQL-compatible query serving from the Lightmetrics server. A low-priority read-only export surface may generate Grafana JSON for metric panels that map to stock Grafana Prometheus panels, but logs, alerts, and custom console-only blocks must be omitted or marked unsupported. Dashboards edited directly in Grafana are outside Lightmetrics configuration in MVP; there is no Grafana-to-Lightmetrics import path in the first version.

MVP dashboard management is file-based. A UI editor for dashboards is post-MVP. Do not accept arbitrary Grafana JSON as a first-class input format. Any Grafana-to-Lightmetrics converter or reuse of Grafana dashboard/frontend components is post-MVP and lower priority than PromQL compatibility.

Grafana Compatibility

Expose a Prometheus HTTP API compatibility target for metrics. Grafana’s built-in Prometheus data source works with systems that implement the Prometheus query API, so this is the required Grafana integration boundary for MVP.

Initial endpoints:

GET /api/v1/query
GET /api/v1/query_range
GET /api/v1/labels
GET /api/v1/label/<name>/values
GET /api/v1/series
GET /api/v1/metadata

Query language should start as a constrained PromQL/MetricsQL subset:

  • instant vector selectors: metric_name and metric_name{label="value",label!="value"}
  • range selectors where required by functions: metric_name[5m]
  • exact label matchers first; regex matchers can be added only if Grafana variable workflows require them
  • rate() for counters over range selectors
  • sum, avg, min, max, and count over one expression
  • by (...) grouping for aggregations
  • histogram bucket queries using Prometheus-style _bucket, _sum, and _count virtual series
  • histogram_quantile() over bucket rollups

Reject unsupported PromQL explicitly with Prometheus-style bad_data responses. Do not silently reinterpret:

  • binary operators and joins
  • subqueries
  • offset
  • logical/set operators: and, or, unless
  • recording-rule semantics
  • arbitrary nested functions outside the supported forms
  • Prometheus staleness semantics beyond “series exists in requested time window”

For query_range, object storage is the canonical source and the local spool/landing buffer overlays the not-yet-landed tail. The query engine must merge by batch identity and series timestamp and dedupe repeated at-least-once uploads. If object storage is unavailable, return a Prometheus-style error unless the requested range is wholly inside the local landing-buffer horizon.

The P0 direct-selector implementation evaluates query_range at start + n * step timestamps with a fixed 5 minute lookback, matching the Prometheus graph-panel shape before broader PromQL execution is available. When only the local accepted spool is configured, successful range responses include a Prometheus warnings entry that marks the result as local-spool-only partial coverage. Full object-store horizon checks remain part of the later object-store query merge work.

Compatibility target:

  • Prometheus JSON response envelope: status, data, errorType, error, warnings
  • Prometheus timestamp/value pair format: [unix_seconds_float, "value"]
  • matrix responses for query_range
  • vector responses for instant query
  • standard label and series response shapes for Grafana variables
  • return Prometheus-style errors for unsupported syntax instead of silently changing semantics
  • smoke-test against a pinned stock Grafana container with a provisioned Prometheus data source and a dashboard fixture that uses variables, query, query_range, and histogram_quantile()

Do not build a custom Grafana data source plugin for v1. A plugin creates a new maintenance surface, while Prometheus-compatible endpoints can be used by stock Grafana. Grafana dashboard export/provisioning is post-MVP and must not distract from PromQL compatibility.

Logs and alerts do not need to be Grafana-native in v1. Keep the server’s own JSON/Arrow API for those. If Grafana log browsing becomes important, add a Loki-compatible query surface later rather than inventing a plugin first.

Arrow/Perspective UI is post-MVP. Keep it behind a build feature such as ui-perspective so the default server binary does not include Arrow IPC or frontend assets. The MVP private UI path is the built-in console plus config-defined dashboard definitions, backed by the Prometheus-compatible metrics API and bounded JSON endpoints. Grafana support in MVP means stock Grafana can query Lightmetrics through the Prometheus-compatible API.

Logs and Alerts API

MVP logs and alerts use bounded JSON APIs.

GET /api/v1/logs
GET /api/v1/logs/tail
GET /api/v1/alerts

Log query filters:

  • from
  • to
  • host
  • severity
  • target
  • contains
  • limit
  • cursor
  • order

Alert query filters:

  • from
  • to
  • host
  • name
  • state
  • limit
  • cursor

Alerts are ingested alert state records in MVP. The server does not evaluate alert rules, send notifications, silence alerts, or expose an Alertmanager-compatible API in the first version.

Use Server-Sent Events for live tailing in MVP. WebSocket and Arrow IPC are optional post-MVP paths.

The current log-tail SSE endpoint is GET /api/v1/logs/tail on the private query listener. It requires the query bearer token and emits bounded JSON log_tail_snapshot, log_tail_update, and log_tail_gap events. Event records are loaded from the same accepted-log storage visibility source as GET /api/v1/logs; clients use the returned opaque after cursor to repair a reconnect or lag gap with another bounded tail request scoped to the matching agent_id and boot_id. The endpoint is not a durable event journal and does not provide text search.

With ui-perspective enabled, add bounded Arrow IPC endpoints for table-oriented UI loads:

GET /api/v1/metrics.arrow
GET /api/v1/logs.arrow
GET /api/v1/alerts.arrow

Do not use Arrow for unbounded live tails. Keep live streams as SSE JSON notices and fetch bounded Arrow deltas when the Perspective UI needs table updates.

Delivery

Delivery is at-least-once.

The agent writes encoded batches to a local append-only queue before upload. A batch is removed only after the collector returns success. If the agent retries after an ambiguous failure, the collector deduplicates by:

(agent_id, boot_id, seq_start, seq_end)

Security

The public ingest endpoint must not expose query or admin APIs.

First version:

  • HTTPS
  • one bearer token per host
  • source IP allow-list where stable
  • collector-side token revocation

mTLS can replace bearer tokens later if certificate distribution is worth the added operational surface.

Footprint

Agent dependencies should stay narrow:

  • no async runtime unless measurements justify it
  • no Cap’n Proto RPC
  • no Arrow or Parquet
  • no embedded query engine

Server can spend more footprint on:

  • HTTPS
  • object storage client
  • zstd
  • in-process query/UI API
  • rollup workers

The server still stays one process for the first version.