Cloud Reliability: Disaster Recovery Strategies You Need to Know

By Jose Neto Wednesday, June 30, 2021

IT disasters and attacks can be really expensive. Preparedness is the key to recovering with minimal (or none!) of the following: lost revenue, business disruption, damage to reputation, data loss and downtime. Developing an understanding of high availability, fault tolerance, recovery point objective, recovery time objective and disaster recovery is the first step in safeguarding your business from an IT perspective.

Key Definitions

Before we explore the main strategies for disaster recovery, let’s first define a few fundamental concepts.

Disaster

These include events like fires, floods, earthquakes, power outages, network outages, human error, security breaches, and hacker attacks.

High availability

A highly available architecture is a collection of hardware and software designed to quickly recover in case of a failure. High availability is what happens, for example, when you have a flat tire. You just pull over, replace the damaged tire and you’re back on track. You — the user — experienced some downtime, but within the tolerable range for the application. In an IT system this would typically be implemented by provisioning failovers.

Fault tolerance

A fault tolerant system is designed to operate through a failure with no loss of functionality. The end users would experience nothing or only minor, transient performance degradation. Many aircraft systems, for example, are designed for fault tolerance. If a jetliner has an engine flameout midair and that engine won’t restart, the remaining engine is fully capable of continuing the flight safely, and with only a mild increase in pilot workload. The passengers would experience only minor annoyances.

In IT systems, a fault tolerant architecture would typically be implemented by means of segregated, fully functional and warm redundancies. Fault tolerant systems are more expensive and more complex to implement as compared to highly available ones. The choice between the two boils down to the consequences of a failure. At the end of the day, it’s a judgement call that each business has to make for itself.

Recovery point objective (RPO)

This is the maximum a system can afford to lose, expressed in time. It is the maximum tolerable time between the last backup and a failure. The window during which there could be data loss is your maximum exposure interval. The RPO could be anything from near real-time to 24 hours, and again it’s a business decision.

Recovery time objective (RTO)

The RTO is the maximum amount of time a system can afford to be inoperative — your maximum downtime. RTO expresses the permissible time your system can take to recover and typically range from a few seconds up to a few hours. Obviously, the longer the RTO, the more a business loses reputation in the event of a failure.

Disaster Recovery Strategies

Having defined the key terms, now let’s turn our attention to disaster recovery strategies and techniques. If your data lives on-premise, you have two high-level options for disaster recovery:

1. You can back up and maintain your data and servers running on a separate location. This is the traditional and most expensive way of doing things.

2. You can have your redundancies stored and run in the cloud, which is the more cost-effective option for most applications.

If your data already lives in the cloud, the way to go is to back up data and servers in another data center or even another region of your cloud provider. The important thing to remember is that disaster recovery means having redundancies, and these redundancies must be segregated.

Here’s an overview of the main disaster recovery strategies you need to understand:

Backup and restore

This is what is commonly called a rollback. As the name implies, this strategy involves taking periodical backups and restoring both databases and servers in case of disasters. While this strategy is the cheapest and easiest to implement, it won’t lead to high-availability nor fault tolerance. This is because users will experience significant downtime. In this case, RTO will be in the hours range. RPO will depend on how often you elect to take backups, and typically it will be in the two-digit hour range.

Pilot light

Here the term pilot light refers to the bottleneck — the slowest part to boot up in an application. This strategy involves keeping the pilot light always running, on-premise or in the cloud. The pilot light could be, for instance, a database server. If disaster strikes, you would failover to the pilot light and fire up the remaining necessary parts of the system.

This strategy is similar to backup and restore, only faster and more expensive. It may lead to high-availability depending on your service-level agreement. With the pilot light, RTO will be in the tens-of-minutes range. RPO will typically be in the single-digit minute range, since you have a standby database instance running at all times.

Warm standby

In this strategy you would have a full backup system always up, always running, though to a minimum size. If there’s ever a disaster, the warm standby can be scaled to production size. This strategy can be achieved, for instance, with a minimum number of virtual machines and database servers running in the cloud, ready to scale out should the need arise.

The warm standby will lead to high-availability as users as a rule will experience some small downtime, and it’s more costly than the previous two options. With this strategy the RTO would be in the single digit minutes — only the time necessary to failover to an already warm system. RPO would also be in single-digit minutes.

Hot site

This strategy involves maintaining a full-scale system on standby, always up, always running. It is the most expensive disaster recovery strategy but will deliver fault tolerance, zero downtime and real-time RPO. This strategy is reserved strictly for critical services.

The Takeaway

Disasters are unpredictable so being prepared is paramount. Because as with many things in life, after the fact it’s often too late. To learn more about disaster recovery, read AWS' whitepaper. If you liked this blog, stay tuned for the next installment in our four-part series Securing the Cloud with AWS. If you're looking for assistance in this area, explore ICS’ cloud and device cybersecurity services.