Why Backup and Disaster Recovery Strategies Must Change at Petascale

Andy Watson. May 6, 2019
Why Backup and Disaster Recovery Strategies Must Change at Petascale

Traditional Backup/Recovery and Disaster Recovery strategies are no longer feasible at Petascale, but must change to new cloud-based approaches that enable faster, more efficient, and more cost-effective ways for data and operational recovery.

Traditional Backup no longer makes much sense in the Petascale Era, because performing full Restore at such scale has become effectively impossible.  Instead, the delivery of immediate Selective File Restore is the service users expect nowadays after they have accidentally deleted or made undesirable changes to one or more of their files.  This is best accomplished by leveraging Snapshots.  That is one of the reasons for only deploying filesystems that provide a robust Snapshot feature that doesn’t incur any performance or other penalties (e.g., severe capacity overhead, filename character restrictions, network protocol or application software incompatibilities, and so on).  WekaIO’s MatrixTM storage software offers up to 4,096 zero-overhead instantaneous Snapshots per logical filesystem — and up to 1,024 logical filesystems per physical storage cluster.

Self-service user file restoration has never been easier.  (And WekaIO Snapshots are also used to checkpoint long-running application progress milestones, too, of course.)

This concept of a Snapshot-based approach to Backup isn’t new, but it is becoming irresistible now that data capacities have outgrown alternative methods.  To remove dependencies on older traditional backup methods, a Snapshot-based approach can simply employ a Snapshot retention schedule.  For example: daily Snapshots are retained for a week; weekly Snapshots are retained for a month; and monthly Snapshots are retained for 3 months.  Not only would this provide all users with multiple choices from immediately-available file images across many points in time, but each of these instances can also be inspected beforehand.  Indeed, the contents of any and all Snapshots are available for review and verification 24×7.  This is unlike backup tapes, for which the contents are not known with certainty until the tapes are mounted.  In other words:  not only does this approach scale better, it also adds an element of quality to the process.

Some Backup Admins are quick to point out what is missing in this approach.  With backup tapes, there is a listing of the contents which can be indexed and searched.  In fact, as the data is being written to tape, in some backup solutions it can also be indexed to enable eDiscovery.  It should be obvious, however, that the same tape data that has traditionally been indexed for eDiscovery can also be indexed from Snapshots.  And like tape, the Snapshots are also read-only images.  (It is also possible to convert a read-only Snapshot into a writeable Clone, though, which is one of the reasons why WekaIO will soon be introducing a “software WORM” feature which will lock the file images in a designated Snapshot for a designated time period.)

And there’s more.  (Of course, there’s more!)  In the Petascale Era, for Disaster Recovery (DR), many data centers are leveraging a concept known as “DR-to-the-Cloud” to extend their modernized Backup to avoid the impossible alternative undertaking of “restoring” their whole environment.  This has many pragmatic cost-saving advantages for DR protection that is otherwise only possible to accomplish by mirroring all storage, networking, and compute infrastructure.  That would incur the significant cost of installing, configuring, and operationally maintaining a mirrored-data center at an offsite location.  And all data updates occurring in the primary location must be replicated to the remote location.  All costs associated with this old-school approach are incurred on an ongoing, recurring basis even if a DR event never occurs.  This is an expensive premium to continuously pay for an insurance policy, but it was unavoidable in many situations prior to the advent of robust public cloud alternatives.

Consider the DR-to-the-Cloud alternative.  Wherever possible, it makes sense to create a situation where the cloud-based DR infrastructure can be elastically spun up on demand only when needed.  Almost all the associated costs of this approach occur only in the event of a disaster instead of continuously.  This opens a new world of Disaster Recovery strategies.  To enable these strategies, however, you do need to have a mechanism for making the data associated with your production environment available in the Cloud.

Using the WekaIO “Snap-to-Object” feature, the data and metadata together are captured for each filesystem, and can be echoed to the Cloud.  Any applications relying on WekaIO-based filesystem infrastructure can then be immediately be elastically spun up in the Cloud.  In this way, only Cloud storage costs associated with the Snap-to-Object image are incurred on a recurring basis.  If AWS is used, then the initial landing point would be S3.  If it meets your DR responsiveness requirements, you could also allow it to be subsequently migrated to the less-expensive AWS storage layers such as Glacier or other related archive-storage services provided by AWS.  In the unlikely event of a disaster at the primary location, the Snap-to-Object image stored by AWS could be rehydrated (perhaps after a delay of a few hours if it must first be re-staged to S3 from Glacier) into a new WekaIO storage cluster resident in the AWS Cloud.  Subsequently, the data would be wholly or partially loaded from S3 into local flash on AWS compute instances to meet your objectives for primary-storage performance for applications running on compute instances also spun up for your DR response. This Snap-to-Object feature allows for Disaster Recovery strategies at are effective at Petascale.

I should point out that an internet network link would be required for daily Snap-to-Object data transfers to AWS.  To establish an initial data copy in the Cloud, a large amount of data might have to be copied.  Either a high-bandwidth link would be needed; or an AWS Snowball device could be used to capture that initial “level zero” data set locally, then shipped physically to AWS.  Thereafter, unless the rate of data change is very high, it is likely that daily delta transfers would be done in the background over existing ordinary internet network links.

This DR-to-the-Cloud approach minimizes recurring costs while facilitating a rapid response to disasters that can be non-destructively tested at any time.  In the event of a genuine disaster, the Cloud could host operations for only the time it takes to correct the on-premises outage.  After that is done, a DR-to-On-Premises recapturing of the data from the Cloud to the revived primary environment would restore the original (or possible completely new) operational data center.  This is very simply done by performing the same Snap-to-Object process in the reverse direction.

The DR-to-the-Cloud solution leverages Cloud resources and the WekaIO Matrix filesystem to provide an effective way to implement Disaster Recovery strategies while eliminating the prohibitive costs of mirroring entire data center environments.  And like the Snapshot-based alternative to traditional Backup/Restore, it is similarly well-suited to the Petascale era by intentional design.

To learn more about the benefits of AWS Cloud storage with WekaIO, click here.

Related Resources

Case Studies
Case Studies

Preymaker VFX Studio-in-the-Cloud

View Preymaker Case Study
White Papers
White Papers

Hyperion Research: HPC Storage TCO – Critical Factors Beyond $/GB

View Now
White Papers
White Papers

A Buyer’s Guide to Modern Storage