← All dashboards

Platform SRE · Live status

Are all services healthy and within SLA right now?

Operational
API Gateway p99 142ms · 99.99% Healthy
Auth Service p99 68ms · 100% Healthy
!
Database Cluster p99 380ms · 99.82% Degraded
CDN Edge p99 18ms · 100% Healthy
Payments p99 96ms · 99.98% Healthy
API p99 latency Last 60 min
142 ms
12 msvs 1h ago
SLA ceiling: 200ms · 29% headroom
Source: Datadog APM · 15s resolution
Error rate Last 60 min
0.08%
0.02 ppvs 1h ago
SLA budget: ≤ 0.10%
Source: Datadog · 5xx only
Throughput Last 60 min avg
8,240 RPS
320vs 1h ago
Peak today: 9,140 RPS at 08:12
Source: load balancer metrics
Uptime Trailing 30 days
99.94%
0.02 ppvs last 30d
SLA target: 99.9% · 4.38h budget/mo
Source: uptime monitors · 1 min checks

Is API latency trending toward the 200ms SLA ceiling?

ms · p50 / p95 / p99 · last 60 min · red dashed = 200ms SLA

Source: Datadog APM·

Is the error rate within the 0.10% SLA budget?

% 5xx errors · last 60 min

Source: Datadog logs·

How is request throughput trending vs the 60-min average?

RPS · last 60 min · dashed line = 60-min avg

Source: load balancer·

Active incidents requiring attention

Open incidents sorted by severity · assigned on-call

Incident Service Severity Duration On-call
DB latency spike Database SEV-2 24 min A. García
Memory pressure Worker pool SEV-3 1h 12m J. Kowalski
Slow query backlog Database SEV-3 38 min A. García
CDN cache miss rate CDN Edge SEV-4 2h 04m M. Chen
Source: PagerDuty·

Latency percentiles from Datadog APM (p50/p95/p99 of all API gateway requests). SLA 200ms applies to p99. Error rate = HTTP 5xx responses ÷ total requests. Uptime = availability of API gateway endpoint per external monitors. All figures are synthetic for illustration only.