Observation Architecture

This page is the canonical observation architecture for the Paperclip control plane. Use it when you need to understand how runtime events become logs, traces, provenance, warehouse facts, and operator-visible health signals. The observation layer is not one backend. It is a bounded set of signal families with different durability, ownership, and failure modes:

low-cardinality telemetry for server and runtime lifecycle
run-level observability summaries derived from heartbeat execution
optional Langfuse traces and scores
provenance and output lineage attached to runs and artifacts
warehouse mirroring into ClickHouse for analytical and monitoring reads
readiness and operator health surfaces exposed through the API and runtime-service monitoring

Source Of Truth Rules

The observation layer follows a strict authority order:

Surface	Canonical for	Not canonical for
PostgreSQL via `packages/db`	first-party run state, costs, finance, evaluation, provenance, file writes, output artifacts, runtime-service health	high-volume warehouse marts, third-party trace storage
ClickHouse mirror	derived event streams, warehouse rollups, monitoring marts	first-party business entity truth
Langfuse	optional trace/span export and scoring views	control-plane run truth or audit authority
Server logger / local runtime logs	local event visibility and debugging	durable external telemetry by default

The rule is simple: Postgres owns the control-plane facts, ClickHouse mirrors selected analytical facts, and Langfuse is an optional trace sink rather than the system of record.

Observation Families

1. Telemetry

Primary implementation:

server/src/services/telemetry.ts

What it does:

emits low-cardinality server and runtime telemetry through the server logger
installs the DB hardening telemetry sink
provides lightweight signal points without pretending to be a full metrics platform

Operational meaning:

useful for local diagnosis and runtime identity
incomplete unless read alongside runtime mode, DB target, and company dataset identity

2. Run Observability

Primary implementation:

server/src/services/heartbeat-observability.ts

What it does:

derives work class, evidence, validation state, promotion state, and TQC-style execution fields from heartbeat runs
creates the summary layer that operators actually consume when judging execution health

Operational meaning:

this is the semantic observation layer for runs
these summaries are only as strong as the runtime evidence written during execution

3. Traces

Primary implementation:

server/src/services/langfuse-tracing.ts

What it does:

exports spans and scores to Langfuse when Langfuse is configured
links evaluation flows and execution traces to a trace ID where supported

Operational meaning:

traces exist only when config enables them
trace code presence is not proof that a trace exists in the current runtime

4. Provenance

Primary implementation:

server/src/services/run-provenance.ts
packages/db/src/extract-provenance-ledger.ts

What it does:

records output artifacts and observed file writes
persists lineage in run_file_writes and run_output_artifacts
supports extracted provenance ledgers and reconciliation work

Operational meaning:

provenance is part of auditability, not just debugging
evidence class matters: authoritative_native is stronger than derived_from_runtime_telemetry

5. Warehouse Mirroring

Primary implementation:

server/src/services/clickhouse.ts
server/src/services/intelligence-monitor.ts

What it does:

mirrors selected first-party events into ClickHouse
currently covers heartbeat_run_events, cost_events, finance_events, and performance_ledger
powers warehouse reads, freshness checks, and monitoring summaries

Operational meaning:

ClickHouse is a derived analytical surface
absence or staleness here does not erase the canonical Postgres record, but it does degrade operator visibility

6. Health And Readiness

Primary implementation:

server/src/services/health-checks.ts
server/src/routes/health.ts

What it does:

checks DB connectivity, migration visibility, and company presence at startup
exposes /api/health and /api/health/migration-status
uses canonical migration inspection rather than raw probing of the Drizzle journal table

Operational meaning:

these routes are readiness surfaces
they do not prove correct DB identity, correct tenant selection, or complete observation coverage

End-To-End Signal Flow

The observation path for a heartbeat run is: Read this flow in order:

a runtime executes work and emits API-visible evidence
the API persists first-party run, cost, finance, evaluation, and provenance state into Postgres
run observability derives operator-facing execution classifications from those facts
selected events mirror into ClickHouse for analytical and monitoring reads
traces export to Langfuse when configured
health routes and runtime-service records expose readiness and freshness signals to operators

Operator Surfaces

Primary operator-facing surfaces today:

/api/health
/api/health/migration-status
runtime-service health persisted in workspace runtime-service tables
sidebar badges for failures, approvals, and queue pressure
Tremor operating wiki pages such as /companies/tremor/wiki/observability
intelligence overlay tools when enabled, including Langfuse and ClickHouse-backed monitors

Use them with the right expectations:

/api/health answers “is this runtime up enough to respond?”
/api/health/migration-status answers “what does canonical migration inspection think?”
warehouse tools answer “what does the mirrored analytical surface show?”
provenance answers “what evidence exists for this run or output?”

None of these alone answer “is the whole system semantically correct?”

Configuration Gates

Observation coverage is intentionally conditional in several places.

Surface	Gate	Effect when absent
Langfuse traces	Langfuse configuration	spans and scores are not exported
ClickHouse mirror	ClickHouse configuration and freshness	analytical mirror is absent or degraded
External telemetry sinks	environment-specific logging/export wiring	logs stay local to runtime logger output
Intelligence overlay tools	overlay services running and reachable	vendor dashboards are unavailable even if Postgres state exists

This means a healthy server can still have an incomplete observation surface.

Failure Interpretation Rules

Use these rules before escalating:

If /api/health is green but the dataset looks wrong, verify DB identity and company context before debugging features.
If ClickHouse is stale, treat warehouse views as degraded but verify whether Postgres still holds the canonical run facts.
If Langfuse is empty, check config before assuming trace-generation code failed.
If run classifications look incomplete, inspect whether the runtime actually wrote the expected evidence.
If provenance is missing, distinguish “nothing was recorded” from “recording path was disabled or bypassed.”

The current architecture still has known limitations:

low-cardinality telemetry defaults to local logger output rather than guaranteed external delivery
Langfuse and ClickHouse remain configuration-conditional, so absence silently reduces visibility
readiness health is narrower than semantic correctness
run-derived summaries can lag or under-express failures when the runtime never wrote the expected evidence

Companion Docs

Use these pages adjacent to this one:

Need	Start here
Whole-workspace system map	Architecture
Entity ownership and storage boundary	Data Model
Health endpoints and route contract	Health API
Intelligence overlay services and ingress	Runtime Services
Live Tremor operator view	Observability Wiki

Overview

System Model

Guide Map

Board Operator

Agent Developer

Documentation System

Observation Architecture

Source Of Truth Rules

Observation Families

1. Telemetry

2. Run Observability

3. Traces

4. Provenance

5. Warehouse Mirroring

6. Health And Readiness

End-To-End Signal Flow

Operator Surfaces

Configuration Gates

Failure Interpretation Rules

Current Blind Spots

Companion Docs

Overview

System Model

Guide Map

Board Operator

Agent Developer

Documentation System

​Source Of Truth Rules

​Observation Families

​1. Telemetry

​2. Run Observability

​3. Traces

​4. Provenance

​5. Warehouse Mirroring

​6. Health And Readiness

​End-To-End Signal Flow

​Operator Surfaces

​Configuration Gates

​Failure Interpretation Rules

​Current Blind Spots

​Companion Docs

Source Of Truth Rules

Observation Families

1. Telemetry

2. Run Observability

3. Traces

4. Provenance

5. Warehouse Mirroring

6. Health And Readiness

End-To-End Signal Flow

Operator Surfaces

Configuration Gates

Failure Interpretation Rules

Current Blind Spots

Companion Docs