Vigil: Service Health Monitoring with Audible Alarms

When you operate a platform with a dozen services, four client applications, and multiple static websites, you need to know when something goes down. Not five minutes from now when a user emails you. Now. Vigil is our service health monitoring dashboard. It polls every endpoint in the Renkara platform, tracks latency over time, manages incidents automatically, and plays an audible alarm through your browser speakers when a service drops.

The Problem with Monitoring SaaS

We looked at Datadog, Uptime Robot, Better Stack, and Pingdom. They all share the same limitation: they are external services that poll your endpoints from their infrastructure, and you pay per check, per site, per seat. For a platform our size, the monthly cost adds up fast. More importantly, none of them integrate with our internal tools fleet. We wanted monitoring data to feed into the same ecosystem as our issue tracker (Docket), our analytics platform (Pulse), and our leverage metrics system (Fulcrum). Building Vigil in-house gave us that integration for free.

Health Check Engine

At its core, Vigil runs an APScheduler-based polling engine. Each monitored site has a configurable check interval (default 60 seconds), a timeout threshold, an expected HTTP status code, and optional request headers for authenticated endpoints. The engine uses httpx with async connection pooling, so checks run concurrently without blocking each other.

Every check result records the HTTP status code, response time in milliseconds, and whether the check passed or failed. These results accumulate in PostgreSQL and drive the latency sparklines, uptime calculations, and incident detection logic.

Latency Sparklines

Each site card on the dashboard includes a sparkline showing response time over the last several hours. These are rendered with Recharts and update in real time via WebSocket. The sparkline makes latency degradation visible before it becomes an outage. If your API normally responds in 50ms and starts creeping toward 500ms, you see the trend immediately and can investigate before users notice.

Uptime Summaries

Vigil aggregates raw check results into daily uptime percentages stored in a separate summary table. Querying 90-day uptime takes milliseconds because the computation is pre-aggregated. The dashboard displays uptime for 24 hours, 7 days, 30 days, and 90 days. A configurable retention policy prunes raw check results after N days while preserving uptime summaries indefinitely.

Incident Management

Incidents open automatically on the first failed health check and close automatically when the service recovers. Each incident records the site, the failure start time, the recovery time, and the duration. Engineers can add manual notes and acknowledge incidents from the dashboard. The incident timeline provides a complete history of every outage across all monitored services.

When an incident opens, Vigil broadcasts the event over WebSocket to all connected dashboard clients. This triggers the audible alarm.

Audible Alarms

This is the feature that makes Vigil feel different from a status page. When a service goes down, your browser plays an alarm tone using the Web Audio API and Howler.js. There are distinct tones for new incidents and for recoveries. The alarm is mutable via a toggle in the dashboard header, so you can silence it during maintenance windows or after you have acknowledged the issue.

The alarm exists because monitoring dashboards are only useful if someone is looking at them. We leave Vigil open in a browser tab during work hours. If the auth service drops at 2:15 PM while we are focused on a code review, the alarm cuts through. No missed Slack notification, no email buried under a hundred others. Just an unmistakable sound that says: something is down, go look.

Real-Time Dashboard

The dashboard displays an aggregated status summary: how many sites are healthy, degraded, or down. Below that, individual site cards show the current status, last check time, response time, uptime percentage, and the latency sparkline. Sites can be sorted and filtered by status, so the ones that need attention bubble to the top.

Every status change propagates over WebSocket. If you have the dashboard open and a site recovers from an outage, the card flips from red to green without a page refresh. The Framer Motion animation library handles the state transitions so they feel smooth rather than jarring.

What Makes It Better Than Off-the-Shelf

Three things. First, cost: Vigil runs on our existing PostgreSQL instance and AWS infrastructure. There is no per-site fee, no per-seat fee, no usage tier. We monitor as many endpoints as we want at whatever interval we want.

Second, integration. Vigil shares the same auth system (RS256 JWT via auth-service), the same diagnostics library (avian-diagnostics), and the same deployment pipeline as every other tool in our fleet. When we add a new service, adding it to Vigil is a single API call.

Third, the audible alarm. No SaaS monitoring tool we evaluated plays sound through the browser. They send emails, Slack messages, PagerDuty alerts. All of those require checking another app. The browser alarm is immediate, ambient, and impossible to ignore if you are at your desk.

Data Retention

Raw health check results consume storage linearly with the number of sites and check frequency. Vigil handles this with a configurable retention service that prunes old check results while preserving the aggregated uptime summaries. You get detailed per-check data for recent history and efficient summary data for long-term trend analysis. The retention service runs on a schedule and requires no manual intervention.

Key Specs

Spec	Detail
Frontend	React 19, TypeScript 5.6+, Vite 6, Framer Motion, Recharts
Backend	FastAPI, SQLAlchemy 2.0 async, PostgreSQL, APScheduler
Auth	RS256 JWT + API key + SHA-256 service tokens
Real-time	WebSocket for status changes and incident broadcasts
Sound	Web Audio API + Howler.js (distinct tones for down/recovery)
Scheduling	APScheduler (async), configurable per-site intervals
HTTP Client	httpx (async, connection pooling)
Ports	3423 (frontend), 3433 (backend)
Theme	Light and dark mode

Integration Points

Vigil authenticates through the shared auth-service like every tool in the fleet. It exposes a /diagnostics endpoint via the avian-diagnostics library, so other monitoring tools (or Vigil itself, recursively) can check its health. The WebSocket infrastructure follows the same pattern used in Docket, keeping the real-time architecture consistent across tools. Deployment runs through the standard CodePipeline and CloudFront setup at vigil.renkara.com for the frontend and vigil-api.renkara.com for the backend.

Vigil: Service Monitoring That Sounds the Alarm