Datadog is remarkable. It is also ten thousand dollars a month for a team the size of ours. New Relic, Splunk, Grafana Cloud — same story, different variation. Meanwhile the open-source stack (Prometheus, Loki, Tempo, Grafana, Alertmanager) is free, but it's five systems that don't talk to each other unless you stitch them by hand. We wanted observability with the coherence of Datadog and the economics of open source, plus the one thing neither gives you: a data plane the rest of the Renkara fleet can plug into directly. That product is Chronicle.

Chronicle Alerts

Three Pillars, One Platform

Metrics, logs, and traces live in the same system, correlated by trace_id from the entry span forward. Click a metric spike, see the logs from that window. Click a log line, open the trace it belongs to. Click a span, see the deploy marker that preceded it. Every signal carries service.name, deployment.environment, and service.version — the three OTel resource attributes that actually matter — so you can filter, group, or diff by any dimension without the vendor-specific tagging gymnastics you get elsewhere. Alerts fire on metric expressions. SLOs evaluate over multi-window burn-rate. Incidents link back to every alert that fed them. The three pillars aren't three products duct-taped together; they're one schema.

OpenTelemetry-Native

Chronicle accepts OTLP-HTTP JSON on the standard paths — /api/v1/otlp/v1/metrics, /api/v1/otlp/v1/logs, /api/v1/otlp/v1/traces. No vendor-specific agent. Drop any OTel SDK in any language, point it at the endpoint, and it works. We also ship a Python shim, chronicle-sdk, that wraps OTel with a three-line wire-up:

from chronicle_sdk import init_chronicle
init_chronicle(service_name="my-service")

The shim auto-instruments FastAPI, httpx, SQLAlchemy, asyncpg, and redis — whatever's installed. Token comes from CHRONICLE_TOKEN, endpoint from CHRONICLE_URL. In dev, set CHRONICLE_ENABLED=false and the shim is a no-op. In tests, the SDK runs in pass-through mode if no token is set. It's the kind of ergonomic surface you can adopt fleet-wide in an afternoon — and we did.

AI Curation, Not AI Spam

Log volume is a tax. Most platforms charge by ingested gigabyte, so the obvious "keep everything" answer becomes "keep everything you can afford." Chronicle scores every log line 0.0 to 1.0 for "interestingness" using Mercury 2. Sub-threshold lines are sampled down; above-threshold lines retain full fidelity. Similar lines get a shared pattern_id, so the top-500 patterns per service become a queryable library — you stop scrolling, you start filtering. When an alert fires, Sonnet 4.6 reads the correlated traces, logs, recent deploys, and similar past incidents and drafts a root-cause hypothesis before the on-call engineer reads the page. The draft isn't authoritative. It's a starting point. On-call reads it, agrees or argues, gets to mitigation in minutes instead of twenty.

Fleet Integration

Vigil was already our service-health dashboard. Chronicle becomes its data source — Vigil is now a presentation layer for Chronicle's probe data. Docket auto-files cards for recurring error signatures, with the alert linking to the filed card. Codex runbooks render inline on alerts by runbook_url annotation, so the on-call doesn't tab-switch during a fire. Cadence owns the on-call rotation — paging looks up the current shift. Tribe owns the preferred contact method per engineer, so SMS, email, or push is honored. Courier drafts customer-facing incident comms from templates. Herald publishes the public status-page digest. Slate lands postmortem action items in the owner's daily queue. Narrative turns published postmortems into a blog post draft. Beacon correlates funnel drops to backend error spikes. Pulse stitches browser session replay to backend traces via propagated session_id. Envoy answers "is this customer hitting errors in production" inline on the deal card. Trellis attributes per-service cost from metrics + log volume to journal entries via cost_center. Meridian logs incident response as billable or non-billable time. Fulcrum turns MTTR, alert fatigue, and change-failure-rate into leverage records. Notification-service fans out alert channels per user prefs.

That's not a coincidence. Chronicle was designed to be the data plane the other tools hang off, not a standalone monitoring product. Your Datadog account doesn't know about your CRM. Ours does.

MCP-Native

Every Chronicle query, dashboard, alert, incident, SLO, and runbook is exposed as an MCP tool. A Claude agent can triage a firing alert, query the correlated logs, read the relevant runbook, draft an RCA, open an incident, and assign it — autonomously, on your behalf, with you in the loop. The agents don't need a separate "Chronicle AI product." They just call the tools.

The Stack

FastAPI, SQLAlchemy 2.0 async, PostgreSQL for metadata and hot-tier metrics (TimescaleDB extension in prod, plain Postgres in the small-volume deployment we're on today), ClickHouse for logs and traces at scale. Valkey for the Celery broker and alert state. S3 for the long-term cold tier. OTLP-HTTP receivers for the ingest path. Prometheus remote_write, StatsD, Fluent Bit, and syslog adapters let legacy sources keep working. React 19 + Vite frontend on the AVIAN design system. Celery workers handle alert evaluation, rollups, and anomaly detection. WebSockets fan out live alert state. Mercury 2 and Sonnet 4.6 handle AI curation and RCA. Playwright for E2E. Light and dark mode. 100% built by Claude.

Sixteen Phases to Production

Chronicle went from empty repo to production in sixteen phases, covering metrics, logs, traces, alerts, SLOs, incidents, dashboards, cost attribution, RUM, issues, and the OTLP receivers. The production deploy was routine — S3 + CloudFront for the frontend, ECR + ALB + SSM for the backend, Terraform for the whole thing — but the fleet rollout happened in the same session: eight backbone tools (Docket, Cadence, Courier, Tribe, Codex, Vigil, Envoy, Chronicle itself) and four services (auth, purchase, notification, onboarding) all got the init_chronicle() wire-up in their FastAPI lifespan within a single afternoon. The SDK is guarded (try/except) so tools that aren't ready still boot fine; the ones that are installed start shipping OTel the moment they start.

What's Next

Chronicle is live at chronicle.renkara.com. The immediate next passes bring the remaining services online (avian-engine, the client backends), publish the browser RUM beacon to the static marketing sites (so web vitals and JS errors land in Chronicle), and migrate the metrics hot tier from Postgres to TimescaleDB once volume forces the issue. But the hard part — one schema, one token, one endpoint, one SDK — is already done.