Forgot your password?
typodupeerror
Data Storage

The Sidekick Failure and Cloud Culpability 246

Posted by CmdrTaco
from the lowered-expectations dept.
miller60 writes "There's a vigorous debate among cloud pundits about whether the apparent loss of all Sidekick users' data is a reflection on the trustworthiness of cloud computing or simply another cautionary tale about poor backup practices. InformationWeek calls the incident 'a code red cloud disaster.' But some cloud technologists insist data center failures are not cloud failures. Is this distinction meaningful? Or does the cloud movement bear the burden of fuzzy definitions in assessing its shortcomings as well as its promise?"
This discussion has been archived. No new comments can be posted.

The Sidekick Failure and Cloud Culpability

Comments Filter:
  • Management (Score:4, Interesting)

    by FredFredrickson (1177871) * on Monday October 12, 2009 @09:31AM (#29718647) Homepage Journal
    It's usually a decision on management's side to not use best practices, despite warnings from the tech dept.

    tldr; There's nothing wrong with the technology, just the greedy bastards using it.
  • by Anonymous Coward on Monday October 12, 2009 @09:34AM (#29718681)

    With Cloud Computing, those who modify FOSS software do not have to redistribute the code, because they are only providing a service and not a functional program.

    This is an unforeseen hole in the bulletproof Gandhi mechanism, so I foresee a quick "GPL V3.1" to close this. And then all is well.

  • Re:Management (Score:5, Interesting)

    by Splab (574204) on Monday October 12, 2009 @09:42AM (#29718799)

    Well there is one difference. Cloud computing and virtual servers are to computers what keychains are to keys, it enables you to lose everything at once.

    Yes it is highly convienient and more effective to have everything in one place, but so much more fun when you drop your "chain" in the sewer.

  • Need more info (Score:0, Interesting)

    by LS1 Brains (1054672) on Monday October 12, 2009 @09:47AM (#29718847)
    Can't form a complete picture without an inside look at what really happened. Danger/Microsoft obviously isn't going to just come out and tell us the who or how, they have enough egg on their face as it is.

    We can throw hunches around all day long, but it all boils down to human error somewhere - or more likely, a series of errors. Perhaps backups weren't properly taken. Perhaps they were performing a platform shift to .NET and something went awry. Perhaps a dev was tapping out a query and forgot part of his where clause, irreversibly damaging an entire table. Perhaps the cleaning crew poured milk in the disk cluster. These are all quite valid possibilities, which singly probably wouldn't be an issue.

    I don't think there's any argument for instability or reliability issues with a "cloud" platform, any more than one could form an argument for a traditional arrangement. If the system as a whole isn't managed and maintained, you are at a very high risk for disaster. The only universal truth is things WILL fail, and you have to plan for them.
  • by garcia (6573) on Monday October 12, 2009 @09:47AM (#29718853) Homepage

    Didn't that throw up any red flags for ANYONE?

    I was a Sidekick user from 4/2004 until 10/2008. There had been only one 'catastrophic' failure in that time that left Sidekick users without data service for an extended period. Danger produced one of the best mobile devices, which in many ways is still better than anything out there even though the OS and devices that utilize it (the various Sidekick models that exist these days) is quite a bit outdated compared to devices like the iPhone.

    I miss my Sidekick immensely. I loved true multitasking, a fully capable QWERTY keyboard, and incredible battery life. Unfortunately it didn't sync well with calendaring software, didn't keep up with music playing, and is now partially controlled by Microsoft. There have been immense trade offs with moving to the iPhone but based on my main reason for owning an iPhone (I ride the bus and enjoy the music/video player and screen size) it was the right choice for me.

    That said, "cloud computing" is something which usually works (and did, in the case of the Sidekick since 2002). I don't think that this is a proven warning sign that "cloud computing" isn't as reliable as everyone believes, I just think it's proof that companies need to do a much better job of ensuring data integrity than they could have ever imagined before.

    Will I stop using Flickr, Google products, and other future "cloud" devices/software because of this? No. I am smart enough, as a computer savvy end-user, to keep my own backups of my data but I do believe people need to become better educated in what can and will happen as we move to the model we have slowly done in the last 10 years.

  • by TheLoneGundam (615596) on Monday October 12, 2009 @09:53AM (#29718939) Journal

    Leaving aside the fact that a "data center" could consist of two servers under Mabel's desk, this is not a "data center" disaster, nor is it a cloud catastrophe.

    This a contract and contract management failure: the contract with the outsource was probably written without specifying that they must do the backups, AND no one established any sort of audit (formal or informal) test to ensure that there _were_ backups being taken and that the outsourcer was performing according to the contract.

    Too often, the MBA doing the contract thinks "there, that's handled" once they've gotten all the signatures on the dotted line. "There, backups are handled now" he thinks, because many business folk (not ALL, I don't think it's fair to generalize that far) see these kinds of things as milestones, rather than ongoing processes to be managed.

  • by FlyingBishop (1293238) on Monday October 12, 2009 @09:59AM (#29719009)

    If someone tells you that they can cheaply prevent catastrophic failure, expect a catastrophic failure. Nothing can correct something like this, which involved an error propagating to the backups.

  • Re:Management (Score:2, Interesting)

    by Anonymous Coward on Monday October 12, 2009 @10:13AM (#29719159)

    Except the problem here is that when a large service goes down in the Cloud, millions of people can be affected.

    For example, what if Google has their way with universities integrating with their system (Docs, Gmail, etc.) and Google has the sort of problems this Sidekick failure does? Now not just one university (if they own data center) has lost all of its hosted data, but any university relying on Google is out all the data hosted on the Cloud.

  • Definition of terms (Score:1, Interesting)

    by Anonymous Coward on Monday October 12, 2009 @10:20AM (#29719233)
    AC'ing due to rumor and innuendo, and completely unconfirmed insider info:

    There's 3 terms being interchangably thrown around as "Cloud" here, so let's back up and make sure we're all talking about the same thing.

    1. Managed hosting - Traditional "the hosting provider owns the box and runs your code on it" outsourcing.
    2. Cloud hosting - A large cluster of virtual machines running on a platform that 100% abstracts hardware, such as vmotion, combined with per-minute billing and web-based provisioning. A marketing term coined by Amazon for their hosting service. This is insanely lucrative, by the way.
    3. Offsite storage managed for you by a service provider, typically built on resold managed hosting or cloud hosting.

    This is clearly a failure of cloud definition 3. So the question here is: Should you ever trust your data to a single outside provider? Of course not. Putting all your eggs in one basket has been a bad idea re: storage for as long as we've had computers. The first rule is always MAKE BACKUPS. You don't trust your disk, you don't trust your backup disk, you don't trust your live data, you don't trust someone else to back up your live data. The pitch for cloud has never been "We'll keep your data safe." It's been "We'll make your data available."

    I'm going to come down on the side of two bad practices: First, T-Mobile made it very, very difficult to get your personal data off of a SK. It was a conscious business decision, designed to keep the barrier to migration onto other platforms / carriers high enough that the average celebrity SK owner wouldn't bother. Second, scuttlebutt is that T-Mob/Danger/MS lost all of this data because they brought in an outside consultant to upgrade the microcode on a SAN controller, which went wrong, leadingto a cascade failure.

    If true, this means that a national carrier with hundreds of thousands of users' worth of data, if not millions, did not have a DR site available. If all the information was on a single storage array, then they didn't even have segregated databases on physically independent storage hardware.

    That's a failure of architecture, a failure of engineering, and a failure of management. There are known best practices here when dealing with customer data, and a failure of this scale indicates that T-Mobile/Danger followed none of them. I simply can't think of a single reason as to why they're unable to restore from an offsite backup, unless those backups doesn't exist.

  • by postbigbang (761081) on Monday October 12, 2009 @10:23AM (#29719273)

    Trust? All that data's gone without much chance of it being recovered, as in bye-bye.

    Do you think that perhaps T-Mobile, or their "trusted partners" might have had a full backup (an IT 101 sort of plan), a mirror or highly available machine (an IT 201 sort of plan), a disaster plan (IT 301), or maybe just an encrypted torrent out there somewhere?

    No.

    Heads oughta roll. Cloud computing is only as good as you make it; it only represents a server outside of your office's NOC or physical boundaries. Nothing else is guaranteed. In this case, it was a service running on somebody else's host (and not properly done) and so it's not a matter of doubting the cloud, it's a matter of firing an incompetent vendor, then getting ready for the barrage of litigation and shame. Stupid stupid stupid. Put a bell on these guy's necks. I don't want them around me.

  • by rwa2 (4391) * on Monday October 12, 2009 @10:38AM (#29719477) Homepage Journal

    We'll, I was hoping to just google cloud vs. grid vs. distributed vs. cluster vs. etc. computing, but there doesn't seem to be much official-sounding distinction out there. Which means if we start our own thread here it might become definitive!

    "cloud" computing: fluffy term used by people who really don't know anything other than that they run their applications from a web page and their data appears to be stored on the web because they can access it from more than one web browser.

    "hosted" / "server farm" computing: buying server resources from someone who has a real datacenter who tries to take care of your hardware. You access all of your data over the network "cloud". Redundancy & support varies based on pricing & services.

    "grid" / "utility" computing: computing infrastructure where you should be able to simply scale up CPU, data, etc. resources for your operation simply by throwing money at turning on more boxes. You don't necessarily need to share it with others, though.

    "cluster" computing: a computing system made up of more or less independent, generally homogeneous nodes, where problems can be partitioned out. Generally has some form of redundancy so you don't lose work when a single node dies, but probably won't survive a data center failure.

    "distributed" computing: special applications that can be farmed out to the net to break parts of computing or storage across a heterogeneous network of computers distributed over many locations. Ideally it's written to be highly redundant and tolerate faults such as nodes joining / leaving the cluster.

    As far as reliability goes, the TIA data center tiers seems to be the only common way of talking about maintaining "business continuity". I've read through it briefly, and can somewhat paraphrase the intent (mildly inaccurately, mostly because the standard itself is kinda loose and not defined in too much detail with regards to servers) as:

    Tier 1 "basic" : You have a room for servers with a door to keep random people from tripping over the plugs. Maybe you have a UPS on your server so it can do a graceful shutdown without data loss when the power or AC goes out.

    Tier 2 : You have your stuff in racks with a raised floor for air conditioning and some wire racks hanging from the ceiling for cable management.

    Tier 3 : You have redundant UPS's and RAIDs, CRACs, network links, and stuff, so you can make repairs when common things break without turning off the system (typically anything with moving parts or high currents, like power supplies, fans, disks, batteries needs to be hot-swappable). Which means you should also have some sort of monitoring and alert system so you know when that stuff actually fails so you can replace it before the redundant components also fail. This is intended to reach 24x7 availability with high uptimes... , maybe 3-5 nines.

    Tier 4 : Like Tier 3, but certified for mission-critical / life-critical use, like in hospitals and maybe for airplanes and stuff. It should survive prolonged power outages (so you have a diesel generator with a day or two worth of fuel.)

    Unfortunately, it just covers build specs for individual data centers, so it doesn't really cover other business continuity things like maintaining offsite backups so you can somewhat easily rebuild from scratch if a natural disaster takes out one of your data centers or something. But it's kind of different worlds of IT between designing facilities and architecting "cloud" services, which unfortunately don't seem to communicate or collaborate as much as they should to reach the kinds of "distributed grid of redundant load-sharing data centers" configurations we'd expect.

  • by swschrad (312009) on Monday October 12, 2009 @10:48AM (#29719635) Homepage Journal

    not just stuffy history book stuff or national security, IMPHO it fully applies to "the cloud."

    if Microsoft can't even build a robust cloud environment, that experiment is done.

    "danger," indeed.

  • by Jezza (39441) on Monday October 12, 2009 @10:51AM (#29719685)

    Mod the parent up!

    There are two sides to this (at least). If you're moving your data "to the cloud" you'd expect that "the cloud" is one hell of a lot more reliable than you are. Let's face it, they should be - the economics of scale mean it's a lot cheaper for them to host your data and lots of other's data, than it is for you alone.

    But that isn't what's happened in this case, here Microsoft (!) haven't even covered the basics. This is stunning.

    So does this call into question "cloud computing" or just Microsoft's "cloud computing"? This is a difficult question to answer, without being able to see for yourself your cloud partner's infrastructure and procedures you can't really be sure... But would anyone make such a foolish mistake? Microsoft have proven that the answer is "yes, if it's Microsoft", the real question is should that be just: "yes"?

    I think most of us now want a more hybrid approach, "in the cloud" is nice, but I also want a "local copy".

    Then you have to think about the other kind of "lose" where others gain access to data they shouldn't see...

  • This is an unforeseen hole in the bulletproof Gandhi mechanism, so I foresee a quick "GPL V3.1" to close this. And then all is well.

    How is it a hole when people who don't redistribute code aren't required to redistribute the source that created it? If you maintain a local branch of my code and use it to process your data, more power to you. It'd be nice if you did give back your changes, but that wasn't the offer I made to you and I don't have any right to expect it of you. End-user licenses like the AGPL are dangerous hacks that'll get more bad press than they'll make up for with the minor community good they do.

What the world *really* needs is a good Automatic Bicycle Sharpener.

Working...