TL;DR Operational weight = the fraction of engineering time consumed by manual, repetitive, reactive infrastructure work. Google’s benchmark is <50%; teams average 30% in 2026, up from 25% in 2024 — the first year-over-year rise in five years. The backup “story” most teams tell is aspirational: 71% do no failover testing, and 62% skip restoration exercises entirely. The backup floor is 3-2-1-1-0 (three copies, two media types, one off-site, one immutable, zero untested restores). Notable 2026 change: pgBackRest is now unmaintained — migrate to Barman or WAL-G.
Operational Weight
What Counts as Toil
Google’s SRE workbook defines toil as operational work that is manual, repetitive, automatable, reactive, and scales linearly with the system [1]. Six attributes identify it:
| Attribute | Diagnostic question |
|---|---|
| Manual | Could software do this without human judgement? |
| Repetitive | Is this the third occurrence this month? |
| Automatable | Has the fix been written down in a runbook? |
| Reactive | Did a page cause this, not a calendar? |
| No lasting value | Will the same symptom recur next week? |
| Linear scaling | Does doubling services double this work? |
Work passing all six tests is pure toil. Work passing three or four still deserves a runbook and an automation ticket.
The 50% Ceiling
Google limits SRE toil to 50% of team capacity; the remaining half must go to engineering that compounds [1]. Quarterly surveys of Google’s own SREs show an actual average of 33%, with individual outliers from 0% (pure project work) to 80% [2]. Useful target for most teams: below 50%, aiming toward 30%.
In 2026 the trend reversed. Average team toil rose to ~30% (up from ~25% in 2024) [3] — the first increase in five years, despite widespread adoption of AI operations tooling [4]. Teams that added automation on top of uncharted manual processes automated the wrong things, or created new toil validating AI output.
On-Call as a Lagging Indicator
On-call incident volume is a direct measure of accumulated operational debt. Google’s guidance: no more than 2–3 actionable incidents per on-call shift is a sustainable baseline [5]. Above that threshold, the rotation is absorbing toil that should be automated away.
Healthy on-call hygiene [6]:
- Primary + secondary on every shift — redundancy prevents heroics
- Follow-the-sun scheduling for 24/7 global coverage
- Quarterly alert audit: any alert firing >5×/week without human action gets automated or suppressed
DORA MTTR as Operational Health Proxy
Mean Time to Restore (MTTR) is the DORA metric most sensitive to operational maturity [7]. Poor runbooks, absent automation, and untested recovery all surface here before they surface as an outage:
| Tier | MTTR | Deployment frequency |
|---|---|---|
| Elite | < 1 hour | Multiple per day |
| High | < 1 day | Daily – weekly |
| Medium | < 1 day | Weekly – monthly |
| Low | 1 day – 1 week | Less than monthly |
Elite and high performers reach comparable change failure rates; MTTR is what separates them.
Where the Weight Accumulates
Self-hosted infrastructure carries operational overhead that managed services abstract away [8]. The five heaviest toil sinks:
| Category | Examples | Automatable? |
|---|---|---|
| Production interrupts | Disk cleanups, memory restarts, cert renewals | ✓ |
| Release shepherding | Manual deploy steps, config changes at cutover | ✓ |
| Migrations | One-time technology transitions, mass refactoring | Partial |
| Security patches | OS upgrades, CVE triage, access reviews | Partial |
| Capacity planning | Reserved instance sizing, scaling events | Partial |
The highest-ROI automation target: production interrupts and release shepherding — fully automatable, highest recurrence, lowest risk to automate incorrectly.
Runbook Automation Tools
Runbook automation converts documented recovery procedures into executable workflows triggered from your incident tooling [21]. Typical P1 MTTR without automation: 45–60 min; 12 min of that is coordination overhead alone.
| Tool | Model | Strength |
|---|---|---|
| incident.io | SaaS | Slack-native, automatic audit trail, 37% MTTR reduction (Favor case study) [21] |
| PagerDuty Process Automation | SaaS | AI-suggested runbooks from past incident history, enterprise RBAC [22] |
| Rundeck ⭐ 6.1k | Self-hosted | Open-source script executor, web console, API service |
The Backup Story
3-2-1-1-0: The Modern Floor
The classic 3-2-1 rule — three copies, two media types, one off-site — remains the starting baseline [9]. The expanded 3-2-1-1-0 framework adds two requirements that ransomware made non-negotiable [10]:
| Element | What it means | Why it matters |
|---|---|---|
| 3 copies | Production + two independent backups | Single copy is not a backup |
| 2 media types | e.g. block storage + object storage | One failure mode cannot hit both |
| 1 off-site | Cloud region or physical distance from primary | Fire, flood, or rack failure protection |
| 1 immutable | S3 Object Lock / WORM policy, min 14–30 day retention | Ransomware cannot encrypt what it cannot write [9] |
| 0 errors | Every backup tested before a crisis, not during | An untested backup is an assumption |
Only 58% of organizations use immutable storage across all their data [9] — meaning 42% have no defence against admin credential compromise or ransomware that escalates to backup infrastructure.
RTO and RPO
Write these down before choosing backup cadence. They are constraints, not aspirations:
- RTO (Recovery Time Objective): maximum acceptable downtime — dictates how fast you must restore
- RPO (Recovery Point Objective): maximum acceptable data loss measured in time — dictates how often you must back up
If your RTO is 4 hours and your last restore drill took 6 hours, you have an operational gap, not a backup.
The Reality Gap
The gap between “we have backups” and “we can recover” is where most teams live:
- 62% of organizations fail to do regular backup and restoration exercises [11]
- 71% do no failover testing at all [11]
- 63% risk reintroducing dormant malware during restoration because they skip validation [14]
Most backup plans pass a documentation review and fail the first restore drill.
Testing Cadence
Industry best practice uses a tiered cadence [13] — the key is that the quarterly full DR exercise uses actual timers and produces an actual measured RTO [12]:
| Cadence | Scope |
|---|---|
| Monthly | Granular restore of critical systems — spot-check specific files or DB rows |
| Quarterly | Full DR exercise — take a service to zero, restore from backup, measure real RTO |
| After every major change | Ad-hoc validation after cloud migrations, upgrades, or security incidents |
Backup Tools by Layer
General-Purpose (Files, Object Storage)
| Tool | Stars | UI | Recommendation |
|---|---|---|---|
| Restic ⭐ 34k | ⭐ 34k | CLI only | Broadest backend support, largest ecosystem [16] |
| Kopia ⭐ 13.4k | ⭐ 13.4k | Web + CLI | Faster parallel uploads, built-in repository server, granular retention [15] |
Both use content-defined chunking (deduplication) and AES-256 encryption. Choose Kopia for web-based management or multi-machine central repos; choose Restic for broadest integrations and Borgmatic workflows.
PostgreSQL
pgBackRest is unmaintained as of April 2026 [17] — do not use for new deployments; plan migration for existing ones. The maintainer (David Steele) has stopped work; existing installations continue to function but receive no bug fixes or security updates.
| Tool | Stars | Status | Best for |
|---|---|---|---|
| pgBackRest | — | ⚠ Unmaintained | Existing installs only — plan migration |
| Barman ⭐ 3.2k | ⭐ 3.2k | Active (EDB) | On-premises; closest functional pgBackRest replacement [18] |
| WAL-G ⭐ 4.1k | ⭐ 4.1k | Active | Cloud / object storage + Kubernetes environments [19] |
For migrations off pgBackRest: test the replacement tool in parallel before cutover, benchmark your actual restore times, and only switch once your measured RTO is verified [17].
Kubernetes
Velero ⭐ 10k backs up cluster state and persistent volumes as GitOps-native CRDs (Backup, Restore, Schedule) — backup policies are version-controlled objects alongside application manifests [20]. As of Velero 1.10+, the default file backend is Kopia; Restic remains available but is slower for large data volumes.
Action Checklist
- Calculate your current toil %:
(hours on incidents + manual tasks) ÷ total eng hours per week. Above 50% requires management escalation [1]. - Count actionable on-call alerts per shift this week. Above 3 means unresolved runbook debt [5].
- Write RTO and RPO for each critical service before touching backup cadence.
- Verify at least one backup copy is immutable (S3 Object Lock, MinIO WORM, or equivalent).
- Schedule a restore drill for next month — one critical service, take it to zero, time the actual recovery.
- Migrate away from pgBackRest if it is anywhere in your stack [17].