- low-cardinality telemetry for server and runtime lifecycle
- run-level observability summaries derived from heartbeat execution
- optional Langfuse traces and scores
- provenance and output lineage attached to runs and artifacts
- warehouse mirroring into ClickHouse for analytical and monitoring reads
- readiness and operator health surfaces exposed through the API and runtime-service monitoring
Source Of Truth Rules
The observation layer follows a strict authority order:| Surface | Canonical for | Not canonical for |
|---|---|---|
PostgreSQL via packages/db | first-party run state, costs, finance, evaluation, provenance, file writes, output artifacts, runtime-service health | high-volume warehouse marts, third-party trace storage |
| ClickHouse mirror | derived event streams, warehouse rollups, monitoring marts | first-party business entity truth |
| Langfuse | optional trace/span export and scoring views | control-plane run truth or audit authority |
| Server logger / local runtime logs | local event visibility and debugging | durable external telemetry by default |
Observation Families
1. Telemetry
Primary implementation:server/src/services/telemetry.ts
- emits low-cardinality server and runtime telemetry through the server logger
- installs the DB hardening telemetry sink
- provides lightweight signal points without pretending to be a full metrics platform
- useful for local diagnosis and runtime identity
- incomplete unless read alongside runtime mode, DB target, and company dataset identity
2. Run Observability
Primary implementation:server/src/services/heartbeat-observability.ts
- derives work class, evidence, validation state, promotion state, and TQC-style execution fields from heartbeat runs
- creates the summary layer that operators actually consume when judging execution health
- this is the semantic observation layer for runs
- these summaries are only as strong as the runtime evidence written during execution
3. Traces
Primary implementation:server/src/services/langfuse-tracing.ts
- exports spans and scores to Langfuse when Langfuse is configured
- links evaluation flows and execution traces to a trace ID where supported
- traces exist only when config enables them
- trace code presence is not proof that a trace exists in the current runtime
4. Provenance
Primary implementation:server/src/services/run-provenance.tspackages/db/src/extract-provenance-ledger.ts
- records output artifacts and observed file writes
- persists lineage in
run_file_writesandrun_output_artifacts - supports extracted provenance ledgers and reconciliation work
- provenance is part of auditability, not just debugging
- evidence class matters:
authoritative_nativeis stronger thanderived_from_runtime_telemetry
5. Warehouse Mirroring
Primary implementation:server/src/services/clickhouse.tsserver/src/services/intelligence-monitor.ts
- mirrors selected first-party events into ClickHouse
- currently covers
heartbeat_run_events,cost_events,finance_events, andperformance_ledger - powers warehouse reads, freshness checks, and monitoring summaries
- ClickHouse is a derived analytical surface
- absence or staleness here does not erase the canonical Postgres record, but it does degrade operator visibility
6. Health And Readiness
Primary implementation:server/src/services/health-checks.tsserver/src/routes/health.ts
- checks DB connectivity, migration visibility, and company presence at startup
- exposes
/api/healthand/api/health/migration-status - uses canonical migration inspection rather than raw probing of the Drizzle journal table
- these routes are readiness surfaces
- they do not prove correct DB identity, correct tenant selection, or complete observation coverage
End-To-End Signal Flow
The observation path for a heartbeat run is: Read this flow in order:- a runtime executes work and emits API-visible evidence
- the API persists first-party run, cost, finance, evaluation, and provenance state into Postgres
- run observability derives operator-facing execution classifications from those facts
- selected events mirror into ClickHouse for analytical and monitoring reads
- traces export to Langfuse when configured
- health routes and runtime-service records expose readiness and freshness signals to operators
Operator Surfaces
Primary operator-facing surfaces today:/api/health/api/health/migration-status- runtime-service health persisted in workspace runtime-service tables
- sidebar badges for failures, approvals, and queue pressure
- Tremor operating wiki pages such as
/companies/tremor/wiki/observability - intelligence overlay tools when enabled, including Langfuse and ClickHouse-backed monitors
/api/healthanswers “is this runtime up enough to respond?”/api/health/migration-statusanswers “what does canonical migration inspection think?”- warehouse tools answer “what does the mirrored analytical surface show?”
- provenance answers “what evidence exists for this run or output?”
Configuration Gates
Observation coverage is intentionally conditional in several places.| Surface | Gate | Effect when absent |
|---|---|---|
| Langfuse traces | Langfuse configuration | spans and scores are not exported |
| ClickHouse mirror | ClickHouse configuration and freshness | analytical mirror is absent or degraded |
| External telemetry sinks | environment-specific logging/export wiring | logs stay local to runtime logger output |
| Intelligence overlay tools | overlay services running and reachable | vendor dashboards are unavailable even if Postgres state exists |
Failure Interpretation Rules
Use these rules before escalating:- If
/api/healthis green but the dataset looks wrong, verify DB identity and company context before debugging features. - If ClickHouse is stale, treat warehouse views as degraded but verify whether Postgres still holds the canonical run facts.
- If Langfuse is empty, check config before assuming trace-generation code failed.
- If run classifications look incomplete, inspect whether the runtime actually wrote the expected evidence.
- If provenance is missing, distinguish “nothing was recorded” from “recording path was disabled or bypassed.”
Current Blind Spots
The current architecture still has known limitations:- low-cardinality telemetry defaults to local logger output rather than guaranteed external delivery
- Langfuse and ClickHouse remain configuration-conditional, so absence silently reduces visibility
- readiness health is narrower than semantic correctness
- run-derived summaries can lag or under-express failures when the runtime never wrote the expected evidence
Companion Docs
Use these pages adjacent to this one:| Need | Start here |
|---|---|
| Whole-workspace system map | Architecture |
| Entity ownership and storage boundary | Data Model |
| Health endpoints and route contract | Health API |
| Intelligence overlay services and ingress | Runtime Services |
| Live Tremor operator view | Observability Wiki |