Skip to content

Beat Liveness & Durability

Every asynchronous job in TruePPM — CPM recalculation drains, webhook delivery, MS Project imports, retention purges, notification email — is driven by periodic Celery Beat tasks. In a single-pod deployment (the common self-hosted shape) there is exactly one Beat process. If it dies, every drain stops and the outbox tables accumulate indefinitely, with no signal until a downstream consumer notices missing work.

To make that failure visible, the API records a heartbeat and exposes it for monitoring.

  • A beat.heartbeat task runs every 30 s and writes the current time to a single BeatHeartbeat row.
  • GET /api/v1/health/beat/ reads that row and reports whether the heartbeat is stale — older than TRUEPPM_BEAT_STALE_SECONDS (default 120 s, i.e. four missed beats). Staleness is computed on read, so the endpoint reports the truth even when Beat and the workers are completely down — the one detector that survives total task-infrastructure failure.
  • A beat.check_stale_heartbeat task runs every 60 s and logs a WARNING when the heartbeat is stale — a secondary, in-cluster signal for deployments with no external monitoring.

Requires a staff (admin) account — it exposes operational state, so it is gated with IsAdminUser. Responses:

ConditionStatusBody
Heartbeat fresh200 OK{"last_heartbeat": "<iso8601>", "stale": false}
Heartbeat stale503 Service Unavailable{"last_heartbeat": "<iso8601>", "stale": true}
No heartbeat recorded yet503 Service Unavailable{"last_heartbeat": null, "stale": true}

The 200 / 503 split lets status-code-driven monitoring alert without parsing the body.

Terminal window
curl -fsS -H "Authorization: Bearer $ADMIN_JWT" \
https://trueppm.example.com/api/v1/health/beat/
# exits non-zero (curl -f) when Beat is stale (HTTP 503)
SettingDefaultPurpose
TRUEPPM_BEAT_STALE_SECONDS120Age past which the heartbeat is considered stale, for both the endpoint flag and the WARNING log

/api/v1/health/beat/ is authenticated, so it is not a drop-in httpGet liveness probe. Use it as follows:

  • Basic API liveness/readiness probes → keep using the unauthenticated GET /api/v1/health/, which returns 200 {"status": "ok"} while the API process is up.
  • Beat liveness alerting → scrape GET /api/v1/health/beat/ from Prometheus (or any monitor) with a bearer token, and alert on a non-200 status code. This is the recommended external detector for the single-Beat SPOF.
  • No external monitoring? → the beat.check_stale_heartbeat WARNING in the worker logs is your fallback signal; forward worker logs to your aggregator and alert on the check_stale_heartbeat message.