Atlas survey

Operational Weight & Backup Story: Measuring the Hidden Tax on Your Stack

Operational weight is the % of engineering time lost to manual reactive work — target below 50%, reality is 30% and rising. The backup story most teams tell is fiction: 71% do no failover testing. The floor is 3-2-1-1-0.

22 sources ~9 min read #198 devops · sre · backup · disaster-recovery · toil · infrastructure · on-call · runbook-automation

TL;DR Operational weight = the fraction of engineering time consumed by manual, repetitive, reactive infrastructure work. Google’s benchmark is <50%; teams average 30% in 2026, up from 25% in 2024 — the first year-over-year rise in five years. The backup “story” most teams tell is aspirational: 71% do no failover testing, and 62% skip restoration exercises entirely. The backup floor is 3-2-1-1-0 (three copies, two media types, one off-site, one immutable, zero untested restores). Notable 2026 change: pgBackRest is now unmaintained — migrate to Barman or WAL-G.


Operational Weight

What Counts as Toil

Google’s SRE workbook defines toil as operational work that is manual, repetitive, automatable, reactive, and scales linearly with the system [1]. Six attributes identify it:

Attribute Diagnostic question
Manual Could software do this without human judgement?
Repetitive Is this the third occurrence this month?
Automatable Has the fix been written down in a runbook?
Reactive Did a page cause this, not a calendar?
No lasting value Will the same symptom recur next week?
Linear scaling Does doubling services double this work?

Work passing all six tests is pure toil. Work passing three or four still deserves a runbook and an automation ticket.

The 50% Ceiling

Google limits SRE toil to 50% of team capacity; the remaining half must go to engineering that compounds [1]. Quarterly surveys of Google’s own SREs show an actual average of 33%, with individual outliers from 0% (pure project work) to 80% [2]. Useful target for most teams: below 50%, aiming toward 30%.

In 2026 the trend reversed. Average team toil rose to ~30% (up from ~25% in 2024) [3] — the first increase in five years, despite widespread adoption of AI operations tooling [4]. Teams that added automation on top of uncharted manual processes automated the wrong things, or created new toil validating AI output.

On-Call as a Lagging Indicator

On-call incident volume is a direct measure of accumulated operational debt. Google’s guidance: no more than 2–3 actionable incidents per on-call shift is a sustainable baseline [5]. Above that threshold, the rotation is absorbing toil that should be automated away.

Healthy on-call hygiene [6]:

  • Primary + secondary on every shift — redundancy prevents heroics
  • Follow-the-sun scheduling for 24/7 global coverage
  • Quarterly alert audit: any alert firing >5×/week without human action gets automated or suppressed

DORA MTTR as Operational Health Proxy

Mean Time to Restore (MTTR) is the DORA metric most sensitive to operational maturity [7]. Poor runbooks, absent automation, and untested recovery all surface here before they surface as an outage:

Tier MTTR Deployment frequency
Elite < 1 hour Multiple per day
High < 1 day Daily – weekly
Medium < 1 day Weekly – monthly
Low 1 day – 1 week Less than monthly

Elite and high performers reach comparable change failure rates; MTTR is what separates them.


Where the Weight Accumulates

Self-hosted infrastructure carries operational overhead that managed services abstract away [8]. The five heaviest toil sinks:

Category Examples Automatable?
Production interrupts Disk cleanups, memory restarts, cert renewals
Release shepherding Manual deploy steps, config changes at cutover
Migrations One-time technology transitions, mass refactoring Partial
Security patches OS upgrades, CVE triage, access reviews Partial
Capacity planning Reserved instance sizing, scaling events Partial

The highest-ROI automation target: production interrupts and release shepherding — fully automatable, highest recurrence, lowest risk to automate incorrectly.

Runbook Automation Tools

Runbook automation converts documented recovery procedures into executable workflows triggered from your incident tooling [21]. Typical P1 MTTR without automation: 45–60 min; 12 min of that is coordination overhead alone.

Tool Model Strength
incident.io SaaS Slack-native, automatic audit trail, 37% MTTR reduction (Favor case study) [21]
PagerDuty Process Automation SaaS AI-suggested runbooks from past incident history, enterprise RBAC [22]
Rundeck ⭐ 6.1k Self-hosted Open-source script executor, web console, API service

The Backup Story

3-2-1-1-0: The Modern Floor

The classic 3-2-1 rule — three copies, two media types, one off-site — remains the starting baseline [9]. The expanded 3-2-1-1-0 framework adds two requirements that ransomware made non-negotiable [10]:

Element What it means Why it matters
3 copies Production + two independent backups Single copy is not a backup
2 media types e.g. block storage + object storage One failure mode cannot hit both
1 off-site Cloud region or physical distance from primary Fire, flood, or rack failure protection
1 immutable S3 Object Lock / WORM policy, min 14–30 day retention Ransomware cannot encrypt what it cannot write [9]
0 errors Every backup tested before a crisis, not during An untested backup is an assumption

Only 58% of organizations use immutable storage across all their data [9] — meaning 42% have no defence against admin credential compromise or ransomware that escalates to backup infrastructure.

RTO and RPO

Write these down before choosing backup cadence. They are constraints, not aspirations:

  • RTO (Recovery Time Objective): maximum acceptable downtime — dictates how fast you must restore
  • RPO (Recovery Point Objective): maximum acceptable data loss measured in time — dictates how often you must back up

If your RTO is 4 hours and your last restore drill took 6 hours, you have an operational gap, not a backup.

The Reality Gap

The gap between “we have backups” and “we can recover” is where most teams live:

  • 62% of organizations fail to do regular backup and restoration exercises [11]
  • 71% do no failover testing at all [11]
  • 63% risk reintroducing dormant malware during restoration because they skip validation [14]

Most backup plans pass a documentation review and fail the first restore drill.

Testing Cadence

Industry best practice uses a tiered cadence [13] — the key is that the quarterly full DR exercise uses actual timers and produces an actual measured RTO [12]:

Cadence Scope
Monthly Granular restore of critical systems — spot-check specific files or DB rows
Quarterly Full DR exercise — take a service to zero, restore from backup, measure real RTO
After every major change Ad-hoc validation after cloud migrations, upgrades, or security incidents

Backup Tools by Layer

General-Purpose (Files, Object Storage)

Tool Stars UI Recommendation
Restic ⭐ 34k ⭐ 34k CLI only Broadest backend support, largest ecosystem [16]
Kopia ⭐ 13.4k ⭐ 13.4k Web + CLI Faster parallel uploads, built-in repository server, granular retention [15]

Both use content-defined chunking (deduplication) and AES-256 encryption. Choose Kopia for web-based management or multi-machine central repos; choose Restic for broadest integrations and Borgmatic workflows.

PostgreSQL

pgBackRest is unmaintained as of April 2026 [17] — do not use for new deployments; plan migration for existing ones. The maintainer (David Steele) has stopped work; existing installations continue to function but receive no bug fixes or security updates.

Tool Stars Status Best for
pgBackRest ⚠ Unmaintained Existing installs only — plan migration
Barman ⭐ 3.2k ⭐ 3.2k Active (EDB) On-premises; closest functional pgBackRest replacement [18]
WAL-G ⭐ 4.1k ⭐ 4.1k Active Cloud / object storage + Kubernetes environments [19]

For migrations off pgBackRest: test the replacement tool in parallel before cutover, benchmark your actual restore times, and only switch once your measured RTO is verified [17].

Kubernetes

Velero ⭐ 10k backs up cluster state and persistent volumes as GitOps-native CRDs (Backup, Restore, Schedule) — backup policies are version-controlled objects alongside application manifests [20]. As of Velero 1.10+, the default file backend is Kopia; Restic remains available but is slower for large data volumes.


Action Checklist

  • Calculate your current toil %: (hours on incidents + manual tasks) ÷ total eng hours per week. Above 50% requires management escalation [1].
  • Count actionable on-call alerts per shift this week. Above 3 means unresolved runbook debt [5].
  • Write RTO and RPO for each critical service before touching backup cadence.
  • Verify at least one backup copy is immutable (S3 Object Lock, MinIO WORM, or equivalent).
  • Schedule a restore drill for next month — one critical service, take it to zero, time the actual recovery.
  • Migrate away from pgBackRest if it is anywhere in your stack [17].

Citations · 22 sources

Click the Citations tab to load…