Backup Strategies
Overview
Arcus supports a hybrid approach to data backup in order to provide maximum coverage, greater flexibility, improved resiliency, and lower cost. It utilizes existing cloud automation and resources as the foundation for the solution.
Architecture
Separating the stateful and stateless data is core to the recommended approach. The basic concept is:
- Create assets that can deploy the servers (Stateless)
- Back up the data at regular intervals in an automated way (Stateful)
- Restore the data from the backups to the new servers
Stateless Data
This encompasses the tools, applications, and configurations for each element on a system, for each system in the environment. Utilizing the principles of Infrastructure as Code, each system will have a design that defines how it is built and configured. Using the available framework and industry standard tools, any system can be rebuilt from scratch in any connected cloud provider, at any time. This allows for rapid cold, warm and/or hot replacement as necessary. The infrastructure as code designs are stored in the Arcus library. Backups of the library are included in the cost of the service and are performed on a standard hourly/daily/weekly rotation (see below) with weekly backups also stored offsite in a secured data storage location.
Stateful Data
This includes the user-generated data on the systems – models, data sets, presentations, etc. The data lives in several locations – on user workstations, on the servers in the environment, and/or on storage services accessible to users. The user workstations should be discouraged as a storage location. Workstations should be treated as disposable and replaceable at any time. Not only does that simplify the backup requirements, it makes maintenance, patching, and securing the systems easier. As such, backup of user-deployed systems in Arcus is not part of the service. Users should be using the tools offered – repositories, shared storage, etc. – to store their data.
Storage Service
Users can back up stateful data to other VMs within their infrastructure and/or onto storage buckets in the Arcus Storage Service. If enable for a Team, the Storage Service provides access to object storage resources from the Arcus and/or commercial cloud providers. Users can mix and match their compute cloud with their storage cloud for offsite backups.
Advantages
The approach reduces the time and cost of backups by focusing on the useful, unique data, rather than all of the “noise” stuff. Most backup rely on capturing the entire system, when only a small percentage of each is needed. For example, a classic approach to back up of ten servers with 500 GB drives each would initially consume ~5 TB of space (before any compression) per backup event. This adds up very quickly, especially after just a few days. Even a more sophisticated tool that only backs up deltas will start with a broad base and continue to grow quickly. Either way, restore actions can be challenging. They are also slow and disruptive, so testing a restore is often not practical or just skipped outright. Only when in an failure does an organization learn that their backups are insufficient or corrupt.
Conversely, using an automated rebuild & restore approach greatly reduces the footprint of data preserved. In the same scenario, the ten servers would use a fraction of the space to preserve the useful data. This is especially true for something like a user workstation, which is mostly just a repeat of the same operating system files over and over. Furthermore, having the ability to deploy a system automatically provides other related benefits. Patches and updates can be tested on a new system (and with real data as needed), ensuring a smother application and transition, and reducing the possibility of a failure the requires a restore from backup. Major upgrades can become a deploy and replace, rather than a live update. Cold, warm, and/or hot spares can be deployed and synced across multiple locations wherever, whenever.
One common mistake is to rely on snapshots as a backup strategy. It is tempting because they are quick and easy. Some reasons why organizations should not rely on snapshots as a primary backup mechanism:
- Large numbers of snapshots are difficult to manage and track
- They consume large amounts of disk space
- They are generally not protected in the case of hardware failure
- Can negatively affect performance
VMWare itself states “Do not run production virtual machines with snapshots on a permanent basis.”
Examples
ALL Systems:
- Automate process of building new system using assets
- Iterate through the initial development of the assets in a sandbox space; promote regularly as ready
- Incorporate minor updates/patches to assets as they are released
- Periodically redeploy a system as a test to ensure deployment and restore continue to work
A SQL Database System:
- Automate export contents and schema of database using a script or SQL tool
- Push/store encrypted backups into a Storage Bucket
- Redeploy the system using assets
- Use an asset to automatically restore the SQL database schema and data
An Active Directory System:
- Use Windows Server Backup to export the configuration
- Push/store encrypted backups into a Storage Bucket
- Redeploy the system using assets
- Add an asset to automatically restore the backup
A WSUS System:
- Redeploy the system using assets
- Pull fresh from original source (Arcus, DoD, or Microsoft). It is not worth the time or cost to backup content that is readily available elsewhere.
A user Workstation:
- Redeploy the system using assets
- Pull data from/connect to a remote home directory
Arcus Infrastructure Backups
Back up of the data on the Arcus infrastructure, services, CollabTools, and shared storage run on the following schedule:
- Hourly (saved for last 24 hours)
- Daily (saved for last 7 days)
- Weekly (saved for last 52 weeks)
All backup storage is local to the cloud where the data is generated. In addition, a second copy of the daily and weekly backups are stored off-site in an alternate cloud provider. They are retained for the same 7 days and 52 weeks cycles, respectively. All data is encrypted in transit and at rest, with program-specific encryption keys.