There is nothing more frustrating for a customer than being unable to restore data from an otherwise perfect backup system that has consistently backed up their data for years. The system had worked flawlessly so far. It had achieved 100% on the Service Level Agreement (SLA) by creating multiple copies of the customer’s data daily and reported close to 100% uptime over the last month.
And yet, the moment the customer lost a set of files and wanted to restore them, they couldn’t. This was not an example of data loss or a durability problem. The data was there, replicated many times. Moreover, it wasn't an instance of downtime; the system was up and running, available to service requests. This is an example of “everything looks right, and yet, restores are failing”.
The above example is unfortunately not fictional, and we hear it all the time from customers of legacy backup systems, including SaaS backup providers. When we designed Alcion, we wanted to raise the bar on addressing this main, simple, and rightful question a customer has: “When I most need the system to work well (during data loss), will it do so (will it restore)?”
Yes it will, and this blog post provides some context on how we achieved that.
The problem of a backup system unable to restore data when a customer needs it the most is an old and fundamental issue. But one root cause for this problem has mostly gone away by the time we’re writing this blog (2024), so let’s start with some positive news.
In the pre-cloud world, when folks maintained their own hardware, a main reason for restores not working was that backups were somehow corrupted. This was usually due to bit rot: contents of the backup files were damaged by faulty hardware that flipped bits from 0 to 1 semi-randomly. Sometimes bad software was to blame too, where the code that copied the files did not maintain checksums over the data and wrote incorrect bits. The solution to this problem was to continuously scrub and scan the data to check for bit rot and correct it when possible.
Fortunately, this is not the problem this blog post addresses. Alcion is a next-generation cloud-native data protection system that uses cloud storage (AWS S3) to store backups. AWS S3 in turn internally provides extremely robust protection against low-level data corruption.
So why is this still a hard problem? There are a couple reasons.
First, code evolves and new features are added all the time. It is common at Alcion to have multiple deployments to production each week, and we strive for a Continuous Deployment (CD) model. For example, the schema used to encode a backup may become more efficient and change over the years. Or as another example, the source of the backup (e.g., Microsoft 365) could change its APIs over time, deprecate certain resource types while adding new ones, change the semantics of the APIs, or modify the encoding of data. So, we might inadvertently introduce non-backwards compatible changes on the restore path. That means that a restore that would succeed today might fail a couple of weeks from now. While we have backwards compatibility tests, we wanted something stronger to validate the resiliency of our data formats.
Second, no amount of unit, integration, or API tests can fully represent the rich diversity of how real customer data is organized. While we have great unit and integration testing coverage, it isn’t always possible to cover (or even know of) all the corner cases with the cross product of the extremely large and constantly evolving Microsoft 365 API surface and the diversity of how customers use the Microsoft 365 platform. These corner cases could lead to bugs that include for example, inability to handle restoring from deep folders, restoring to calendars, or bringing data back to Microsoft Teams.
At Alcion we work hard to earn and maintain our customers’ trust with their data protection. We have found that a culture built around raising the bar on testing is key to achieving that trust. The table below shows the types of tests we have in our system. While the focus of this blog is on the last row --- continuous verification --- we briefly describe the rest to set the context (and show why they are not sufficient).
Most software products have unit, integration, and API tests and so does Alcion. As we operate a cloud-native service in regions around the world, we also implement periodic canaries to ensure our service is running smoothly. These canaries catch and report any availability drops before our customers do. Typically, these tests and canaries use synthetic data to test the system, e.g., they might create a test user with 1GB of auto-generated data and attempt to backup up and subsequently restore and verify this data. Longer-term canaries typically test a chain of operations over time, e.g., a series of backups taken 3 times per day over a year using long-lived production systems owned and operated by Alcion.
Code deployment also needs testing. For Continuous Deployment (CD), we adhere to best practices, employing cell-based architectures. A small percentage of the real customer traffic is directed to a OneBox cell, where any new code or feature is initially deployed and closely monitored.
Finally, we have Continuous Verification, which addresses the motivation of this blog post. This feature is not directly visible to customers but runs continuously in the background, randomly selecting data to exercise and verify end-to-end code paths, including code paths that are responsible for restoring data when the customer needs to. Let’s dive deeper into this feature next.
Alcion continuously and automatically verifies customer backups. Several considerations led to our design. First, security and privacy of customer data is the number one priority at Alcion. All the functionality we have built towards continuous verification builds on the fundamental table stake features Alcion already has, such as per-tenant encryption keys and data, and metadata tenant isolation. The verification functions have a limited set of permissions. They never alter customer data or copy it to any external system for verification.
Second, the verification functions make use of Alcion’s data plane’s “export” functionality that takes a backup, including its items, and prepares them for exporting. Customers use exports to create a local copy of their backup data. Exporting complements restoring and utilizes much of the same code path. The reason verification uses the export functionality, rather than the restore functionality is because we did not want to create customer-visible state as part of backup verification. Restoring would overwrite the customer’s data or create a new folder with the restored data, thus making backup verification visible, and potentially confusing to customers (because they did not initiate it). Alcion uses the export functionality to verify that items in the backup can be reconstructed, but ultimately exports them to a `/dev/null/` sink, a system location where data is immediately discarded.
Third, Alcion’s backup verification needs to be scalable. Customers have petabytes of data backed up and verifying all of it would be costly and unnecessary. Alcion uses sampling of the items to test a percentage of the backups. This sampling is done with consideration for the diversity of data types to verify, such as emails, calendar entries, OneDrive files, and SharePoint pages. The testing occurs over a period of time, encompassing backups from years ago as well as recent ones from hours ago.
Continuous backup verification has played a crucial role in enabling Alcion to promptly identify and rectify any code changes that led to unexpected alterations in functionality—issues that may have gone undetected by other testing methods. Early detection of these problems, well before they impact the customer, has allowed us to deliver an excellent restore experience. This ensures data correctness for our customers precisely when they need it the most: during the restoration of lost data. If you want to learn more you can join our Discord community. Haven't tried Alcion yet? It is easy to get started with a 14-day free trial (no credit card required).