Sometimes, Acts of God are bizarre and that is what happened recently when a storm struck Ghislain in Belgium. Lightning struck the same place, not once, but four times near a data center owned by Google. Following the storm, Google experienced heavy read / write errors between 13 Aug to 17-Aug, 2015, and a small data loss at its Google Compute Engine Data Center in the europe-west1-b zone. What was it that they said? Lightning does not strike twice?
It is quite surprising though and the cause of such a loss of data is unclear, as data centers are usually protected from such events. The strikes had affected the persistent state disk equipment, and it is unclear whether a power failure at that time could lead to such event. Emergency power switched on automatically as per plan, but the battery backup for the disc systems did not work as expected.
The summary of the incident posted as Google Compute Engine Incident #15056 in Google Cloud Status page also indicates that
From the start of the incident, the number of affected disks progressively declined as Google engineers carried out data recovery operations. By Monday 17 August, only a very small number of disks remained affected, totaling less than 0.000001% of the space of allocated persistent disks in europe-west1-b. In these cases, full recovery is not possible.
0.000001% loss is meaningless for those customers who were doing frequent read / writes at that time and a more meaningful figure would have been the amount of data or bytes lost and the number or percentage of writes lost.
Google explains the root cause leading to the incident in a few words,
At 09:19 PDT on Thursday 13 August 2015, four successive lightning strikes on the local utilities grid that powers our European datacenter caused a brief loss of power to storage systems which host disk capacity for GCE instances in the europe-west1-b zone. Although automatic auxiliary systems restored power quickly, and the storage systems are designed with battery backup, some recently written data was located on storage systems which were more susceptible to power failure from extended or repeated battery drain. In almost all cases the data was successfully committed to stable storage, although manual intervention was required in order to restore the systems to their normal serving state. However, in a very few cases, recent writes were unrecoverable, leading to permanent data loss on the Persistent Disk.
This means that the lightning struck the utility power substations and telecommunication lines outside the data centers affecting the equipment within the building with voltage surges within the electrical lines. This could happen if the voltage surges are not controlled appropriately within the data center. However, by all means data centers are protected from voltage surges, as the surge protection is in built in the network. It is possible that repeated surges or unexpected electrical phenomenon caused an equipment failure thereby causing temporary power failure.
While Google claims full responsibility and is in the process of transitioning the storage hardware at fault to make it fail safe, it still warns its customers saying that the GCE and persistent disks are in a single data center, therefore vulnerable to data center scale disasters. So, worried users would do well to use GCE snapshots and Google Cloud Storage as geographically replicated repositories of data.
The entire episode of this loss of data has evoked a question again. Is our data safe, available, reliable and secure with the cloud service providers? While Cloud service providers do commit to a high level of such parameters with the flexibility for the user to work remotely, there are practical concerns :
- Cloud outages: Instances like lightning affecting Google data centers causing outage are not rare. In June, 2012, a powerful storm knocked out an entire data center owned by Amazon and hosting Amazon Web Services causing a large number of big companies like Netflix, Instagram and Pinterest using AWS to be non-operational for more than 6 hours.
- Data Loss: Data loss due to data breach could have huge implications as stated by a recent report from Health information Trust Alliance (HITRUST). The number of data breaches are 495 for 21.2 million records at a cost of $ 4.1 billion. These figures are quite staggering in themselves.
- Cloud and Security: As Cloud Computing becomes ubiquitous, the malicious security attacks on Cloud Computing Service networks are going to increase, and therefore, no users’ data is secured.
While there have been data losses, and breaches in specific cases, the good news is that the data has been safe and secure for more of the cases where Cloud Service Vendors are involved. We have not witnessed large scale hacking or breaching for the Cloud Service providers, especially the Big Four. This does give us a hope that Cloud Architecture is designed in such a way that the data is segregated, networks are well designed and appropriate security technologies are in place.
However, we cannot be complacent. As more companies move to the cloud, advanced level of threats could follow and therefore the Enterprise risks of Cloud computing, both internal and external need to be properly recorded, monitored, mitigated, and appropriate strategies are to be worked on to improve the levels of availability, reliability, safety and security provided at the data centers by the Cloud Service Providers.