Google Cloud Accidentally Deletes UniSuper's Online Account Due To 'Unprecedented Misconfiguration' (theguardian.com) 52
A "one-of-a-kind" Google Cloud "misconfiguration" resulted in the deletion of UniSuper's account last week, disrupting the financial services provider's more than half a million members. "Services began being restored for UniSuper customers on Thursday, more than a week after the system went offline," reports The Guardian. "Investment account balances would reflect last week's figures and UniSuper said those would be updated as quickly as possible." From the report: The UniSuper CEO, Peter Chun, wrote to the fund's 620,000 members on Wednesday night, explaining the outage was not the result of a cyber-attack, and no personal data had been exposed as a result of the outage. Chun pinpointed Google's cloud service as the issue. In an extraordinary joint statement from Chun and the global CEO for Google Cloud, Thomas Kurian, the pair apologized to members for the outage, and said it had been "extremely frustrating and disappointing." They said the outage was caused by a misconfiguration that resulted in UniSuper's cloud account being deleted, something that had never happened to Google Cloud before.
While UniSuper normally has duplication in place in two geographies, to ensure that if one service goes down or is lost then it can be easily restored, because the fund's cloud subscription was deleted, it caused the deletion across both geographies. UniSuper was able to eventually restore services because the fund had backups in place with another provider. "Google Cloud CEO, Thomas Kurian has confirmed that the disruption arose from an unprecedented sequence of events whereby an inadvertent misconfiguration during provisioning of UniSuper's Private Cloud services ultimately resulted in the deletion of UniSuper's Private Cloud subscription," the pair said. "This is an isolated, 'one-of-a-kind occurrence' that has never before occurred with any of Google Cloud's clients globally. This should not have happened. Google Cloud has identified the events that led to this disruption and taken measures to ensure this does not happen again."
While UniSuper normally has duplication in place in two geographies, to ensure that if one service goes down or is lost then it can be easily restored, because the fund's cloud subscription was deleted, it caused the deletion across both geographies. UniSuper was able to eventually restore services because the fund had backups in place with another provider. "Google Cloud CEO, Thomas Kurian has confirmed that the disruption arose from an unprecedented sequence of events whereby an inadvertent misconfiguration during provisioning of UniSuper's Private Cloud services ultimately resulted in the deletion of UniSuper's Private Cloud subscription," the pair said. "This is an isolated, 'one-of-a-kind occurrence' that has never before occurred with any of Google Cloud's clients globally. This should not have happened. Google Cloud has identified the events that led to this disruption and taken measures to ensure this does not happen again."
Re: (Score:3)
Think it'll be a scapegoat?
It seems so unlikely that 'a misconfiguration' could be the cause and somewhat likely that a rogue/malicious/fooled employee did this, so we should assume a preponderance of odds until evidence is provided to the contrary.
At least they didn't get the Whole Parler treatment.
Re: (Score:2)
or maybe their in-house API for environment provisioning can do more than they though it could....
Re: (Score:2)
Maybe this is to do with the changes to introduce the data caps. Maybe deleting stuff is now a little bit too automatic. Misconfiguration might mean the system is running as designed, it just triggered -- maybe it suggests there's very few slices of Swiss cheese now.
Re: Uh oh, (Score:1)
Itâ(TM)s a misconfiguration in the sense that these accounts are supposed to be set up that they donâ(TM)t just shut down in the case their payments get behind.
In the end this could happen to anyone for any reason. Whether it is your bank screwing up or someone pushing the wrong button. If you are a cloud customer, you should have 2 backups (1 cloud, 1 on-prem), 1 active-active with another provider and a third party that keeps your GitOps configurations and secrets.
Re: (Score:2)
It seems like there were actually 2 critical failures, (1) the account was deleted and (2) It took more than a week for them to get around to fixing it.
Re:Uh oh, (Score:5, Informative)
Google didn't fix it. Customer restored from a backup done to another cloud provider.
Dan
Re: Uh oh, (Score:1)
Re: Uh oh, (Score:2)
In some countries financial institutions are expected to keep secondary third party backups.
Re: (Score:3)
"Customer restored from a backup done to another cloud provider."
HOLY SHIT someone actually did some form of best practices in a financial institution? I'm almost having a heart attack from the sheer shock, here.
Shame they weren't making nightly backups to more than one location, but at least they're only a week out.
Not good (Score:3, Interesting)
In Azure, you get a dozen reminders over several weeks when deleting a subscription before it's actually deleted.
Dan
Re:Not good (Score:5, Funny)
This should not have happened.
But it did.
Google Cloud has ... taken measures to ensure this does not happen again.
But it might.
Re: (Score:2)
When it happens again...
but it's extremely rare.
Re: (Score:2)
And I'm so sorry my friend
I didn't know the gun was loaded
And I'll never, never do it again
Re: Not good (Score:2)
Re: (Score:2)
Doing business with a company so big, they can survive your failure.
Re: (Score:2)
Wanna bet Google's statement on this begins "At Google, we take data safety very seriously".
Or at least it's mandatory to begin every security breach response with "we take security very seriously".
Re: (Score:2)
On the other hand, Azure security is to badly broken it is a complete joke (read the report on the 2023 breach...). Pick your poison. All clouds suck.
Cloud is for ass-covering (Score:4, Insightful)
Re:Cloud is for ass-covering (Score:4, Interesting)
" blame anyone but the idiot that pushed the services into The Cloud"
Remember THAT idiot works for your company.
Would REALLY want them running your internal infrastructure? Do you think that's going to end any better?
Re: (Score:2)
Got an outage? Blame Google, blame Microsoft, blame anyone but the idiot that pushed the services into The Cloud. That's why management loves Cloud. Idiots.
You think your in-house sys-admin can't screw up? Cloud providers have more servers, more data centres, more admins, etc, etc. By going to the cloud you reduce your probability of downtime. And if you're smart you backup locally and to another cloud (just like they did).
Now, all that complexity, if one of those big Cloud providers ever has a system wide outage? Now that's a big systemic risk to the Internet, possibly the cause of another financial meltdown.
But for an individual organization? Cloud all the w
Re: (Score:2)
You think your in-house sys-admin can't screw up?
Sure, but when he does, he can typically fix it. He doesn't have to spend hours and hours on the phone or trying to slog through some shitty "chat with an Indian" support tool to try to get answers or get things fixed. It's just your in-house sys-admin and full control over the hardware/software.
If you have the training and resources.
And if you're a big customer I'm pretty sure you get very good support very quickly.
I'm one of those "in-house sysadmins". In 20 years we've had two major outages. One was caused by a low-level piece of equipment that could have very easily been two Linux boxes (for 1/3rd of the cost of the service the business purchased). It was completely out of my control and was mandated by someone higher up. When it failed, it failed spectacularly. While the professional company was busy trying to assign blame, get the hardware back, and ship out a new config "over night" (Friday night means Monday delivery), I worked half way through the night replacing it with Linux. Back online before the professional (and very expensive) service was even in-transit.
And if a fire took out your main data centre?
Stupid point-and-click admins *love* turfing their responsibilities to 3rd-parties. It allows them to command Windows Admin Wages while having dick for responsibility. Oh, sorry...we're still working on it. We've got Microsoft on the phone, we've tried rebooting because that's what every Microsoft forum post tells you to do, and sfc /scanno
Re: (Score:2)
If you don't have the knowledge, you shouldn't be doing the job.
And yeah, if you're a big customer they'll pay attention to you. But how many businesses out there aren't "big customers"?
We currently have *one* client that spends about $50k/mo with us. The next largest client spends about $10k/mo, and the rest of the clients are all under $1,500/mo.
Guess who gets the priority service at my company? It's the $50k/mo client. No surprise.
If you're big enough to afford a strong IT department I suspect you should be able to get good cloud support, the big cloud providers have enough scale that any decent sized customer should get good support.
As for the knowledge, even the best IT person is going to have limits and fires they don't know how to put out at a moment's notice.
Their main office hosts everything. If fire takes that out...well...the data is synced every 15-30 minutes to several off-site locations in another state. We test fail-over several times per year. Their main office burning to the ground means they are back up and running in about an hour--which is a completely acceptable amount of time for the business.
Sounds like a good setup. But that means you need multiple locations and qualified personnel at each of those locations, how many outfits don't have those resources?
That's a
Re: (Score:2)
The buck stops at whomever is getting paid to do something. If a paid google service fails then it's absolutely appropriate to put most of the blame on google.
However, there's some person in the 'CIO' or similar decision makers position that should hold some level of responsibility to have separate backup facilities and a disaster plan.
Foghorn Leghorn on backups (Score:5, Funny)
"UniSuper was able to eventually restore services because the fund had backups in place with another provider."
"Fortunately, I keep my feathers numbered for just such an emergency."
https://www.youtube.com/watch?... [youtube.com]
One of a kind? Or a systematic problem? (Score:5, Insightful)
Many recent Boeing incidents were one-of-a-kind too, is there any other plane has its windows ripped out in mid-air before or after?
The fact that a business account could be so simply deleted (and requiring restore from the customer's own backup *elsewhere*!) already pointed to how little thought went into it when Google designed the system.
Did no one in Google asked the simple question "Gee, this delete account function seemed quite powerful, what happens if it was triggered accidentally?" Apparently not, otherwise it would simply take Google one click to restore the account because the data was only mark-deleted and will be kept 90 days before being actually wiped out.
Or we could guess how many levels of human approval this action required (one? or none?), or how come this function did not have a verification with the billing system to safeguard paying accounts won't be deleted, etc, etc. This could happen at all indicates a systematic problem with how Google view their customers when designing systems.
Re: (Score:3)
is there any other plane has its windows ripped out in mid-air before or after?
Yes. Maintenance issues affect all sorts of aircraft. Window blowouts are very rare, but they happen.
British Airways pilot was sucked out of an airplane mid-flight — and lived [businessinsider.com]
Oregon man nearly got sucked out of a busted passenger-plane window. Now he’s a pilot. [oregonlive.com]
Sichuan Airlines co-pilot was pulled back inside by crew after right windshield blew out at 32,000 feet [theguardian.com]
Re: (Score:2)
In this case it was a manufacturing issue and a door plug, not a maintenence issue and a window. And it was indeed a first.
Re:One of a kind? Or a systematic problem? (Score:4)
Your points are excellent, and they bring to mind a situation a bit different but analogous. EMR, electronic medical records in hospitals were mandated during the Obama years, accelerating adoption that was already going on. The few big companies who make these crappy systems also make a lot of money selling the service to the hospitals, many or most of which by now have been co-opted by corporations and private equity. The medical record, once a bastion of proper and quality care has now become just a billing and money capture tool for hospital administration. The EMR has its benefits, but as implemented and used, it has had a profound effect to degrade quality of care in the hospitals.
The companies making EMR's such as Cerner-pos and Epic-pos are now big companies with lots of money, lots of employees, and nominally good in-house expertise on computers and networking systems - that is after all what the EMR is, just a big network capturing data. So, you might think that EMR's would work well from a basic technical point of view, regardless how the hospital admins abuse it.
But, no.
We are subject once every month or two to an announcement that the EMR will have scheduled "downtime" for maintenance and upgrades. Paper charts were never taken offline, never a problem, but the 8 hour emr lapses are disruptive to care. I am pretty sure that Google, Amazon, Ebay, the airlines, and a zillion other big companies know how to do maintenance and upgrades without taking their service offline. One would think that the EMR vendors would know how to manage that, but they don't, or maybe they do but they don't care. Their customer base is "small" by Amazon or Google standards, just a few hundred or thousand clients, the services have no social media presence for public complaints, and the nurses, doctors, and patients don't count because the EMR is mainly a money capture tool for the hospitals that don't care if clinical operations are bumpy as long as billing operations are smooth. So, no one complains except the clinical staff, and that no longer counts. Maintaining up time during a system upgrade should be basic and easy. The problem is simply disregard for the end user, for the social responsibility of delivering a quality product, and for the moral-ethical issues at the center of medical care products and services.
Furthermore, there have been reports lately of hospitals hit by ransomware, with entire patient databases going down. How can the hospital admins and the crappy EMR companies operate without proper data backup - akin to the story reported here. However, the hospitals, at least the ones I work at, do not seem to have any "Customer's own backup" or third party data preservation to be a contingency backup.
One thing I have gleaned from following Slashdot is a feeling among IT pros that corporate admins often have little or zero sense about computer and network technology, so they are apt to make foolish or short-sighted choices. Hiring a vendor for a service like EMR and not expecting robust 100% up time or having secure backups seems to be a typical hospital admin bone-head thing to do, but this would be moot if the companies delivered a properly designed and secure service, but they are not.
The crapification of such services and business is a cancer which has rotted our society in the past 30-40 years. The fact that these problems happen, in the hospitals-emrs and with the parties in the posted story, are just two examples among presumably countless others. Your comments hit the nail on the head of questions that seem so obvious, and the issues are presumably not unknown to the gurus-pundits-"experts"-assholes in the corporate echelons. Honest mistakes can happen, but for companies with the self-aggrandizing arrogance of Google, arrogance born from the fact that they actually do have top notch in house expertise, the lapses that you enumerate should never have happened. The situation reported got resolved with a happy ending, but it doesn't inspire confidence that it won't happen again.
Re: (Score:2)
Thanks for your post, that's a really great read and I think, that alone is the crux of the problem, and we should all be spending our time thinking about that.
It reminds me of Systemsntics which was written by a doctor.
"the fundamental problem does not lie in any particular System but rather in Systems As Such (Das System an und fuer sich)"
-- SYSTEMANTICS. THE SYSTEMS BIBLE by John Gall
Personally I think that we as humans have built systems which are too complex for us to understand at the moment.
The way t
Re: (Score:2)
The fundamental problem is people love obeying orders from someone else who's then responsible for the bad decisions.
Software Engineers, Programmers, Computing Scientists, etc. form the most powerful middle-class profession ever in the History of mankind, bar none. They have access to the most powerful devices and machinery in existence, access to the most complex tools ever devised by human ingenuity, access to the vastest collection of knowledge ever assembled, the knowledge and the means to put it all to
Re: (Score:2)
So where does OpenEMR [open-emr.org] or OpenMRS [openmrs.org] fit into this if EMR in general is a failure?
Re: (Score:2)
Good point - but -
(I don't have experience with OpenMRS).
OpenEMR is practice management software rather than hospital style EMR.
It came on the scene as I recall about 15 years ago, when doctors, clinics, and hospitals still mostly used paper charts but were transitioning to electronic records. Computerized practice management software has been around much longer, and OpenEMR is on that cusp in time when computerized offices and electronic records were starting to blend. It was only around 2014 when emr's
Re: One of a kind? Or a systematic problem? (Score:1)
EPIC was a big company before Obama. I applied to a job there in the early 2000s, it was all IIS and SQL Server and they were writing their platform targeting .NET1 and writing their own custom XML parsers. In short, it was a hellscape for developers and I was glad to turn down their offer.
They bragged about co-writing the EU regulation at that point and I am fairly sure they co-wrote the Obama regulation to the point every provider that was on smaller, homegrown or different software simply had to conform
Re: (Score:2)
the outage was caused by a misconfiguration that resulted in UniSuper's cloud account being deleted, something that had never happened to Google Cloud before.
The "something" that never happened before is deleting of this specific account (therefore is one-of-the-kind). They never said no other accounts have ever been deleted the same way, only that this specific UniSuper's account has never been deleted before.
Re: One of a kind? Or a systematic problem? (Score:2)
terraform apply (Score:2)
oh shit
That does not sound good... (Score:3)
For something like that to happen, you usually need 3 or more mistakes by different people. Either Google has inadequate safeguards in place (very, very bad), or they do have people incompetent enough that all made a mistake here (very, very bad) or both (worse).
This looks like tech-rot to me, where "managers" try to make things cheaper until they are done cheaper than possible and then crap like this happens.
actually this aint news, its pattern (Score:1)
Re: (Score:2)
Out of interest, what do you recommend people move to?
Data on the cloud is not yours (Score:2)
Good on their IT staff for not completely trusting Google and having alternate backups. Bad on their IT staff for not hosting their own data. The cloud is not your computer and the data stored on it IS NOT YOURS. Give your IT staff the resources to host the data locally and avoid this kind of crap in the future. Companies are not going to learn this simple lesson until these cloud providers really screw something up. Google really screwed up in this case only to be saved by the company's IT staff foreshadow
Google is not reliable (Score:1)
I really just laught at this point (Score:2)
We all know having your business on the public cloud is silly right? By the time you go do all the constant cost saving auditing, multi-AZ, mutl-cloud DR and such, does it really make sense at any real scale? I mean once you've hired even one dedicated AWS/GCP/Azure person, you've gotta be in too deep.
Within five years, "cloud repatriation" will be as hot a resume as "cloud migration" was five years ago, and everyone who wants to keep their jobs will pretend nothing hilarious happened.