Beat Liveness & Durability
Every asynchronous job in TruePPM — CPM recalculation drains, webhook delivery, MS Project imports, retention purges, notification email — is driven by periodic Celery Beat tasks. In a single-pod deployment (the common self-hosted shape) there is exactly one Beat process. If it dies, every drain stops and the outbox tables accumulate indefinitely, with no signal until a downstream consumer notices missing work.
To make that failure visible, the API records a heartbeat and exposes it for monitoring.
How it works
Section titled “How it works”- A
beat.heartbeattask runs every 30 s and writes the current time to a singleBeatHeartbeatrow. GET /api/v1/health/beat/reads that row and reports whether the heartbeat is stale — older thanTRUEPPM_BEAT_STALE_SECONDS(default 120 s, i.e. four missed beats). Staleness is computed on read, so the endpoint reports the truth even when Beat and the workers are completely down — the one detector that survives total task-infrastructure failure.- A
beat.check_stale_heartbeattask runs every 60 s and logs aWARNINGwhen the heartbeat is stale — a secondary, in-cluster signal for deployments with no external monitoring.
The /api/v1/health/beat/ endpoint
Section titled “The /api/v1/health/beat/ endpoint”Requires a staff (admin) account — it exposes operational state, so it is gated with
IsAdminUser. Responses:
| Condition | Status | Body |
|---|---|---|
| Heartbeat fresh | 200 OK | {"last_heartbeat": "<iso8601>", "stale": false} |
| Heartbeat stale | 503 Service Unavailable | {"last_heartbeat": "<iso8601>", "stale": true} |
| No heartbeat recorded yet | 503 Service Unavailable | {"last_heartbeat": null, "stale": true} |
The 200 / 503 split lets status-code-driven monitoring alert without parsing the
body.
curl -fsS -H "Authorization: Bearer $ADMIN_JWT" \ https://trueppm.example.com/api/v1/health/beat/# exits non-zero (curl -f) when Beat is stale (HTTP 503)Configuration
Section titled “Configuration”| Setting | Default | Purpose |
|---|---|---|
TRUEPPM_BEAT_STALE_SECONDS | 120 | Age past which the heartbeat is considered stale, for both the endpoint flag and the WARNING log |
Wiring it into Kubernetes / monitoring
Section titled “Wiring it into Kubernetes / monitoring”/api/v1/health/beat/ is authenticated, so it is not a drop-in httpGet liveness
probe. Use it as follows:
- Basic API liveness/readiness probes → keep using the unauthenticated
GET /api/v1/health/, which returns200 {"status": "ok"}while the API process is up. - Beat liveness alerting → scrape
GET /api/v1/health/beat/from Prometheus (or any monitor) with a bearer token, and alert on a non-200status code. This is the recommended external detector for the single-Beat SPOF. - No external monitoring? → the
beat.check_stale_heartbeatWARNING in the worker logs is your fallback signal; forward worker logs to your aggregator and alert on thecheck_stale_heartbeatmessage.