Interview. Our UniSuper Google Cloud disaster story spurred a discussion with Ricardo Mendes, CEO and co-founder of Portugal-based Vawlt – a software startup looking at moving securely data to the cloud. His views raised interesting points. Its software orchestrates multiple cloud providers for disaster recovery, backup, and data agility.
Blocks & Files: What do you think of UniSuper’s Google Cloud disaster?
Ricardo Mendes: This situation illustrates a critical point we’ve been advocating for years: All suppliers, including cloud giants, can fail, and cloud-managed geo-replication isn’t synonymous with comprehensive fault tolerance or resiliency. Many of our customers often ask about the difference between spreading data across multiple clouds and regions versus relying on a single provider’s geo-replication.
The analogy we frequently use is that clouds are very good and reliable “supercomputers” with all the challenges such an approach brings – Single point of failure and dependency, to name just those related to the incident you mentioned. So geo-replication provides resiliency against events like natural disasters but doesn’t safeguard against cloud-level incidents.
Blocks & Files: How do you view customers depending on a single cloud?
Ricardo Mendes: Blind trust in single cloud providers means abdicating control over critical disaster recovery strategies and activities, weakening an organization’s ability to respond effectively to incidents. As you stated in your article, this dependency isn’t just operational, since companies depend on their cloud providers’ responses and transparency when something goes wrong (and I agree with you that, probably, it tends to be increasingly worse the smaller a company is). Relying solely on a single cloud provider compromises business continuity, independence, and organizational sovereignty (and, as a side note, precludes economic and operational efficiencies).
For these reasons, cloud independence should be a more frequently discussed topic. A world where this independence is the standard is, in my opinion, the way to go. And let me be clear: Vawlt does not fully solve the issue, but one thing I know is that it helps. I believe this event is a crucial lesson for the industry on the importance of this matter and the existence of the space for solutions (and services) to solve it.
Blocks & Files: Could a CSP use logical air-gapping to separate out an internal-to-their-cloud disaster recovery facility such that a user subscription cancellation would not automatically cancel all the customer’s cloud infrastructure? The DR site would need multi-factor authentication for delete changes or could have an immutability characteristic with set-in-stone retention periods.
Ricardo Mendes: From the CSP’s point of view, they can (and should) have the maximum measures in place to segregate different internal systems, making each individual failure independent. A solution like the one you suggest could solve the specific problem of UniSuper, but I think the issue is much broader.
No matter how many measures CSPs take internally, given that they are a single organization, they could simply be inoperable as a whole – not just due to technical issues related to unwanted incidents, but also due to the CSP’s decisions (discontinuing a particular service, deciding to change prices).
My stance is that the measures to ensure the independence, sovereignty, and business continuity of companies should be taken by those companies themselves, completely independently of the guarantees provided by CSPs.
The concept of Supercloud that is starting to be discussed, although there is still no consensus on its definition, is fundamentally about creating an “abstract cloud” that abstracts away from the CSPs it is built upon. Aside from another potential buzzword, I advocate that there is room for the creation of software layers that leverage resources provided by clouds, ensuring vendor independence (and, by the way, with other benefits such as cost reduction).
Using Vawlt as an example, what we do with data storage can and should be done for other areas of infrastructure and cloud services in general. Aviatrix, for example, operates in this space with cloud network solutions distributed with fault tolerance at the CSP level.
Blocks & Files: Would a user having two CSP subscriptions, with subscription 2 being the DR facility for subscription 1, work such that it would prevent a UniSuper-type disaster?
Ricardo Mendes: Operationally, yes. However, in my view, I would add the concern of how the DR scheme is set up – the client should have control.
An approach where the client uses a software layer to replicate their DR capability between two CSPs entirely independently of each of them is, in my opinion, the way forward.
I am talking about operating this software layer to respond to disasters (and the migration process) to avoid the operational unavailability that occurred in the UniSuper case, but also to ensure that the response to the incident is entirely under the organization’s control, not the CSP’s, and thus not dependent on the CSP’s support response time.
Here, I should add that there is a market space exactly for solutions that offer this kind of guarantees in a very simple way by design – there are technologies that allow things like what I am referring to when combined, but they tend to be extremely complex and expensive, which means they are only within reach of large players.
Blocks & Files: Could two CSPs set up a mutual cross-cloud DR facility for customers such that AWS would have a GCP agreement for an AWS customer to have a DR facility in GCP and vice versa? My simple mind says this is theoretically possible but likely to be commercially impossible.
Ricardo Mendes: Technically speaking, it seems to me that there are already solutions on the market that would allow the technical implementation of such a solution. And yes, I think the immediate problem would be commercial and the message it would send to the market by the CSPs.
However, as I mentioned, I think a problem would remain: control being on the CSPs’ side in operation, configuration, and response to potential incidents.
In a way, taking it to the limit, this would be a new service with a sort of consortium geo-replication. There would certainly be greater segregation between systems and everything else, but to what extent is this very different from geo-replication of a single CSP, considering that a rather deep integration between the two CSPs would be needed?
Would it not be a new abstraction of a single CSP that could then fail like a single CSP? I always return to the initial point when these questions arise: taking advantage of the resources of the multi-cloud world, but independence and incident response should be ensured by the client, who should maintain the maximum control over these mechanisms and processes.