cloud recovery and resilience
Alexandria LearyMay 17, 2024 1:34:29 PM5 min read

Weathering the Storm - A Guide to Cloud Recovery and Resilience

Imagine an on-premises server outage. Replacing hardware can be a logistical nightmare and take days or weeks to get back up and running. Outages affecting a cloud account aren’t that dissimilar, but if your environment is designed correctly, it’s massively faster to re-provision your resources than to acquire new hardware. As an added bonus, you don’t need to pay to maintain expensive backup facilities. While cloud service providers make this easier in a lot of ways, you cannot solely rely on them to be able to recover your environment if something goes wrong. 

So, what should you do to make sure you can quickly bounce back from hardware malfunctions, software bugs, or even ‘unprecedented misconfigurations’? 

Create (and practice) a contingency plan

Don't be caught off guard. Develop a comprehensive contingency plan that outlines the steps to take in case of a cloud outage or security incident. This plan should include clearly defined roles, responsibilities, and steps for recovery. All teams that may be involved should be trained on how the plan will be executed.

Contingency plans are only as good as the last time they were tested. Schedule regular exercises to ensure your plan is effective and identify any gaps in your procedures. Aim for annual testing as a minimum, but quarterly exercises are even better to keep your team sharp. As your cloud environment evolves, so should your plan. Regularly review and update the contingency plan to reflect changes in infrastructure, personnel, or security protocols. As your team learns the ropes, advance from tabletop exercises to ‘real’ or simulated exercises. Ensure you test restoring your environment from backups, to validate those same steps can be followed as part of your contingency plan. 

Design and test multiple layers of data backups

To ensure a smooth and quick recovery, it’s imperative that you follow the 3-2-1 backup plan. This industry standard dictates having at least 3 copies of your data, in 2 separate storage services, with 1 copy stored offsite (meaning in another cloud, on-premise, or a 3rd party solution). This multi-layered approach protects against various failure scenarios, like hardware malfunctions, software bugs, human error, and natural disasters. 

Each major cloud platform provides backup solutions:

  • AWS Backup: Automates backups for databases, virtual machines (VMs), and storage solutions, ensuring your critical data is protected within the AWS ecosystem.
  • GCP Cloud Backup: Provides similar functionality for GCP services, offering a centralized platform to manage and monitor backups across your GCP environment.
  • Azure Backup: Centralizes backup management for all your Azure resources, streamlining your backup process and ensuring consistent protection.

Additionally, archival solutions should be considered to provide cost-effective storage for data or backups that may not be actively used but need to be retained. 

For all backups, a WORM (Write Once, Read Many) approach should be taken. Along with object locking and lifecycle policies, WORM storage can be used to prevent unauthorized alterations to your backups and guarantee the integrity of your data during a recovery operation. Another layer of resilience can be added by replicating your data across cloud providers or even storing a copy on-premise. This can mitigate the impact of events such as the “GCP Paris Incident” or any other regional failures.  Though this is not a common occurrence, this critical layer of having backup data outside of your main cloud environment is key, as this is what will truly save the day if something goes terribly wrong. Consider a backup in a different account or even a different cloud service provider to be safest. When backing up to another cloud provider or even to a different region within the same provider, be sure to understand any egress charges that you may incur. Egress charges vary between cloud providers and the amount of data you plan to backup can be impactful. Research and compare rates to find the most cost-effective option for your multi-cloud backup strategy.

Design your cloud infrastructure to be portable and recoverable

Manual configuration is tedious and error prone. Infrastructure as Code (IaC) can be used to define your infrastructure and automate your production deployments with consistency. Utilizing tools like Pulumi or Terraform ensure consistent configurations and minimize the risk of errors during the recovery phase. IaC can also be used to streamline your backup processes and enforce best practices throughout your environment. 

A templatized environment using IaC should be built in a development (dev) environment and used the same way in production (prod). If your team is not familiar with IaC make sure that you educate them on the importance of not making manual changes wherever possible, but especially in prod. This can be a hard habit to break, but it is in your best interest. If just starting out, it can be a good idea to ease into it by progressively adding restrictions to prod by first providing notice and monitoring, then enforcing these rules. 

Prevent disasters in the cloud

Your cloud environment extends beyond just the data itself. The accounts, projects, and subscriptions that manage your resources are critical assets that require robust security measures. Each cloud provider offers solutions to prevent unauthorized access or accidental deletion:

  • GCP Project Lien: This feature prevents deletion of your GCP projects, adding an extra layer of security against unintended changes.
  • AWS Security, Audit & Compliance (SAC) features: Services like Service Control Policies (SCPs) provide granular control over user permissions within your AWS organization, preventing unauthorized account deletion or removal from a governed AWS Organization
  • Azure Resource Manager Lock: Places a lock on an Azure subscription, resource group, or resource to protect them from accidental user deletions and modifications. 

By implementing these key strategies, you can build a cloud environment that isn’t only scalable but also resilient and secure. Following the 3-2-1 backup philosophy for your data, IaC, and any other information pertinent to your environment can make all the difference when something goes wrong. If you’d like a second opinion or assistance in making sure your environment can weather any storm, contact us!

RELATED ARTICLES

The information presented in this article is accurate as of 5/16/24. Follow the ScaleSec blog for new articles and updates.