Disaster Recovery Planning — RPO, RTO & Backup Strategy

Your backups exist. They run every night. The job completes successfully — you’ve seen the green checkmark. Now tell me: when was the last time you restored from one? Under pressure? With half your team on vacation? With the ransomware note still on screen? If the answer isn’t a specific date, you don’t have a disaster recovery plan. You have a comfort blanket.

The TLDR

Disaster recovery is the discipline of getting back on your feet after the worst day. It defines how much data you can afford to lose (RPO), how fast you need to recover (RTO), and the procedures to make both happen under real-world conditions. The plan only works if it’s been tested. Untested backups are Schrödinger’s backups — they exist in a superposition of working and corrupted until you try to restore from them. Ransomware operators know this. They’re counting on it.

The Reality

Organizations that haven’t tested their DR plan don’t have a DR plan. They have a document. Probably a PDF sitting in SharePoint that was last updated when someone changed the cover page date in January.

Ransomware changed the game. The old DR scenarios — fire, flood, hardware failure — assumed the disaster was dumb. It wasn’t trying to outsmart you. Modern ransomware groups are specifically, deliberately targeting backup infrastructure before they encrypt production systems. Veeam backup server exploits, Volume Shadow Copy deletion, wiping DR replication targets — the usual suspects studied your recovery playbook and designed their attack to break it.

The Verizon DBIR consistently shows ransomware as a top threat action. And the FBI’s IC3 annual reports track the financial carnage — billions in losses, and that’s only the reported incidents. The unreported ones? Just folks quietly paying ransoms and hoping nobody finds out.

The 3-2-1 backup rule exists because people kept learning the same lesson the hard way. Three copies of your data. Two different media types. One offsite. It’s not complicated. It’s just routinely ignored until the day it matters.

How It Works

Business Impact Analysis — Know What Matters

Before you can recover, you need to know what to recover first. A Business Impact Analysis (BIA) identifies your critical systems and quantifies the cost of downtime. Not everything is equally important. Your customer-facing payment system has a different criticality than your internal wiki.

The BIA forces uncomfortable conversations: What’s the dollar-per-hour cost of this system being down? At what point does downtime threaten the survival of the business? Which systems have dependencies that cascade into other failures? NIST SP 800-34 (Contingency Planning Guide) walks through this process in detail. It’s not light reading, but it’s the standard for a reason.

The Recovery Objectives

Three numbers define your DR posture:

RPO — Recovery Point Objective. How much data loss is acceptable. If your last backup was 24 hours ago, your RPO is 24 hours. Everything created or modified since that backup is gone. For a blog, 24 hours might be fine. For a financial trading platform, 24 seconds might be too much. Your RPO determines your backup frequency.

RTO — Recovery Time Objective. How fast you need to be back online. This isn’t how fast you want to be back — it’s the maximum tolerable downtime before the business impact becomes unacceptable. Four hours? Twelve hours? Three days? Your RTO determines your DR architecture.

MTPD — Maximum Tolerable Period of Disruption. The hard ceiling. Beyond this point, the organization starts suffering irreversible damage — regulatory penalties, permanent customer loss, contractual breaches, existential threat. MTPD is the “this is no longer a bad day, this is a going-out-of-business event” threshold. Your RTO must be shorter than your MTPD, or your DR plan is architecturally incapable of saving you.

Backup Strategies

Not all backups are created equal.

Full backup — Everything, every time. Simple, slow, storage-hungry. Your baseline.
Incremental backup — Only changes since the last backup (full or incremental). Fast, efficient, but restoration requires the full backup plus every incremental in sequence. One corrupted increment breaks the chain.
Differential backup — All changes since the last full backup. Larger than incremental, but restoration only needs the full plus the latest differential. Simpler recovery at the cost of storage.

The 3-2-1 rule is the floor, not the ceiling:

3 copies of your data (production + 2 backups)
2 different media types (disk + tape, disk + cloud, NAS + external)
1 copy offsite (geographically separated)

Modern ransomware has added a critical amendment: 3-2-1-1 — one copy must be air-gapped or immutable. If your backup server is domain-joined and the attacker has domain admin, your backups are already encrypted. An air-gapped backup — physically disconnected from the network — or an immutable backup — write-once storage that can’t be modified or deleted — is the last line of defense.

DR Site Types

Your recovery speed is architecturally constrained by your DR site.

Hot site — Fully operational mirror of production. Data replication is continuous or near-continuous. Failover in minutes. Expensive. Worth it for systems where RTO is measured in minutes.
Warm site — Infrastructure in place, but not fully synchronized. Data may be hours behind. Failover takes hours — you need to restore recent data and bring systems online. The middle ground.
Cold site — Empty facility with power, network, and space. No running systems. Recovery takes days because you’re building the environment from scratch. Cheap. Appropriate for non-critical systems with multi-day RTOs.
Cloud DR — The modern answer. Spin up infrastructure on demand. AWS, Azure, and GCP all offer DR-specific services. The economics are compelling — you pay for standby capacity, not idle hardware. But cloud DR introduces its own dependencies. If your cloud provider has a regional outage during your disaster, you’re rolling dice on two catastrophes at once.

Testing — The Part Everyone Skips

An untested DR plan is a fiction. You wrote a story about how recovery would go. That’s nice. Here’s what actually happens: the runbook references a server that was decommissioned last year, the backup restore takes four times longer than estimated, and the one person who knows the database recovery sequence is unreachable.

Testing types, from least painful to most honest:

Tabletop exercise — Walk through the plan verbally. “If X happens, we do Y.” Low cost, low disruption, catches obvious gaps.
Walkthrough/simulation — Execute the steps without actually failing over production. Verify that the runbook is accurate and the team knows their roles.
Parallel test — Bring up the DR environment alongside production. Verify it works without disrupting live systems.
Full interruption test — Shut down production. Recover from DR. This is the real test. It’s expensive, it’s scary, and it’s the only one that tells you the truth.

NIST SP 800-34 recommends testing at least annually. Quarterly is better. After any significant infrastructure change is non-negotiable.

How It Gets Exploited

Ransomware operators have read your DR playbook. They’re specifically engineering their attacks to defeat it.

Backup server targeting. Before encrypting production, attackers compromise backup infrastructure. CVE-2023-27532 — a Veeam Backup & Replication vulnerability — was exploited in the wild specifically to extract credentials from backup servers. Once they own the backup server, your recovery plan is dead before the ransom note appears.

Volume Shadow Copy deletion. One of the first things ransomware does on Windows: vssadmin delete shadows /all /quiet. Your local restore points? Gone in one command. T1490 (Inhibit System Recovery) is a well-documented MITRE ATT&CK technique because it’s standard operating procedure.

Attacking DR replication. If your DR site uses continuous replication, the encryption replicates too. Your hot site faithfully mirrors your ransomware infection in near-real-time. Congratulations — you now have two encrypted environments instead of one.

Timing the attack. Friday night. Holiday weekend. 2 AM. Ransomware groups deploy during off-hours specifically because response times are slower, key personnel are unavailable, and the dwell time before detection is longer. Your DR plan assumes a fully staffed response team. The attackers assume the opposite.

Destroying the printed runbook. Okay, this one’s metaphorical — mostly. But when your DR documentation lives on the same infrastructure that’s been encrypted, you’re recovering blind. The irony is structural: the plan to recover from the disaster is a casualty of the disaster.

What You Can Do

DR planning isn’t glamorous. It’s the thing you invest in hoping you’ll never need. But when you need it, nothing else matters.

Implement 3-2-1-1. Three copies, two media types, one offsite, one air-gapped or immutable. If your backup infrastructure shares credentials with production Active Directory, it’s not isolated — it’s one credential away from compromise.
Separate backup credentials. Your backup admin accounts should not be in the same AD forest as production. Different credentials, different authentication, different blast radius. NIST CSF Protect function covers access control segmentation.
Test quarterly. At minimum. Full interruption testing annually. Tabletop after every significant change. Document the results. Fix what breaks. Test again.
Define your RPO and RTO in writing. Get business leadership to sign off. When the disaster hits, there’s no time for a debate about priorities. The decisions were made in advance, or they weren’t made at all.
Have a printed runbook. Physical paper. In a binder. With the DR team lead, at the DR site, and in a fireproof safe. Your digital documentation is encrypted along with everything else. The printed copy survives.
Practice the communication plan. Who gets called first? How do you reach the team when email and Slack are down? Out-of-band communication — personal cell phones, a pre-arranged Signal group, a physical phone tree — is part of the plan, not an afterthought.
Assume the backups are compromised until proven otherwise. After a ransomware incident, validate backup integrity before restoring. Restoring from a compromised backup reinfects the environment. Scan backups with updated signatures before you touch production.

Sources & Further Reading

NIST SP 800-34 Rev. 1: Contingency Planning Guide — The foundational document for IT contingency and disaster recovery planning
NIST Cybersecurity Framework — The Recover function defines organizational recovery planning outcomes
ISO 22301: Business Continuity Management — International standard for business continuity management systems
CISA Ransomware Guide — Practical guidance including backup and recovery best practices
MITRE ATT&CK — T1490 (Inhibit System Recovery) — How attackers specifically target recovery capabilities
ISC2 Resources — Professional resources covering BCP/DR domains
FBI IC3 Annual Reports — Cybercrime statistics and ransomware impact data

Disaster Recovery — RPO, RTO & Why Your Backups Aren't Enough