Slashdot is powered by your submissions, so send in your scoop

 



Forgot your password?
typodupeerror
×
Open Source Facebook Hardware IT

Open Compute Project Comes Under Fire 86

judgecorp writes: The Open Compute Project, the Facebook-backed effort to create low-cost open source hardware for data centers has come under fire for a slack testing regime. The criticism was first aired at The Register where an anonymous test engineer described the project's testing as a "complete and total joke." The founding director of the project, Cole Crawford has penned an open letter in reply. The issue seems to be that the testing for standard highly-reliable hardware used by telcos and the like is very thorough and expensive. Some want the OCP to use more rigorous testing to replicate that level of reliability. Crawford argues that web-scale data centers are designed to cope with hardware failures, and "Tier 1" reliability would be a waste of effort.
This discussion has been archived. No new comments can be posted.

Open Compute Project Comes Under Fire

Comments Filter:
  • by An Ominous Coward ( 13324 ) on Wednesday July 08, 2015 @04:39PM (#50072217)

    Probably Cisco trolling against a movement that's going to put them out of business.

    Sooner the better, I say.

  • by Anonymous Coward on Wednesday July 08, 2015 @04:41PM (#50072225)

    Some people just have to get a burr up their ass [arse] about everything.

    Wait, Register is still up? Do they still say 'boffin' every paragraph? I couldn't bear to click through.

  • Web-scale? Way to be tone-deaf there Mr. Crawford.

    Or maybe the ridicule heaped on users of that particular term is something indulged only by the neckbeard wannabes that haunt Slashdot. In which case, carry on!

  • by biojayc ( 856286 ) on Wednesday July 08, 2015 @04:48PM (#50072263)
    You don't need expensive hardware to run datacenters. You need cheap commodity hardware with smart software on top. Just ask Google or Facebook.
    • Yep. This thread is full of people pooh-poohing this idea and meanwhile it's the strategy used by the most successful corporations on the internet. Welcome to Slashdot!

      • by Anonymous Brave Guy ( 457657 ) on Wednesday July 08, 2015 @05:48PM (#50072581)

        I think the point is that so far it is only used by "the most successful corporations on the internet". In fact, you can probably count the number of organisations in the entire world that qualify on the fingers of one hand, though it will take a few more fingers to count how much money they have invested to reach this point.

        Unfortunately, as lovely and friendly as all the Software Defined X advances seem with their mantra of openness, almost no-one is actually building a "web-scale data centre" with a 24/7 staff dedicated to just swapping out broken hardware and effectively unlimited resources to devote to designing hardware architectures and building control software that can cope with frequent failures without losing significant amounts of real money. For normal organisations, even those with heavy IT requirements and 12 figure market caps, running your critical infrastructure on hardware that does have a serious level of testing and consequent robustness may still be advantageous.

        (Full disclosure: I sometimes work for clients in the networking industry, though whether an industry shift towards things like OCP would benefit or harm them would be open to debate so I think I'm still reasonably neutral here.)

        • Unfortunately, as lovely and friendly as all the Software Defined X advances seem with their mantra of openness, almost no-one is actually building a "web-scale data centre" with a 24/7 staff dedicated to just swapping out broken hardware and effectively unlimited resources to devote to designing hardware architectures and building control software that can cope with frequent failures without losing significant amounts of real money.

          I think that's because most customers don't want that, partly because they don't understand how they would use it yet — but also because there is the fundamental problem of paying a middleman. If you are depending on someone to build the cloud for you, you're going to have to accept that they're going to want to get paid for their trouble. And nobody likes to write checks, they like to cash 'em.

        • Isn't this the point of the cloud: don't buy/build/maintain your own, rent from us and save because we do it cheaper and better than you ever could on your own?

          I think by the time you reach a scale where you have 24/7/365.24 staffing adequate to handle the failures as they happen, you can take advantage of the higher failure rate / lower cost equipment. You don't need to be Google scale to do this.

          • by Anonymous Brave Guy ( 457657 ) on Wednesday July 08, 2015 @08:55PM (#50073335)

            Well, I have a few issues with the cloud hype, starting with the scarcity of evidence to support claims about cloud services being cheaper and/or more secure and/or more reliable than doing things yourself. Every major cloud provider has had serious downtime, and there is only so much you can attribute to being more visible at greater scale or to users not configuring HA tools properly. Far too many on-line services also run into significant security/privacy problems. And cost-wise going with the cloud rather than your own systems tends to be favourable at certain levels (other things being equal) but it can be outrageously expensive in other cases.

            These myths aren't really the point here anyway. The point in this case is that no matter how fast your recovery time may be, whatever was happening on your hardware at the time it failed is lost, and in some cases you simply can't make that transparent to your users. Not everything in the world of programming is a distributed map-reduce where losing a hardware node means you just redistribute the 0.0001% of the job it was doing to another and no-one notices. Not everything in the world of networking can tolerate a multi-second failover process without an observable blip in connectivity. As for redundant/HA storage, the CAP theorem called and asked to speak with you about your database, but I think you were on with physics at the time so I just took a message.

            It's not just about whether the wastage due to more frequent failures works out cheaper economically than paying a premium for better hardware. It's also about how much downtime you (or your customers) are willing to tolerate and what proportion of overall system time is spent just recovering from failures. If you've ever had the joy of watching the (N+1)-th drive fail in your RAID with N-way redundancy while it's still rebuilding from replacing the earlier failures, you'll know what I mean.

            • I've never had an N+1 drive fail in a RAID setup. What I have had happen is the power supply to the whole array fail... then we can talk about redundant power supplies, but, really, the data needs to be mirrored offsite at a place where a serious (fire / flood / riot / meteor strike / whatever) event doesn't take down all copies of the data / service. This was sort of the founding principle of ARPANET, anyway.

              Economics varies, people negotiate bad contracts all the time that lead to higher costs of whatev

            • by Chirs ( 87576 ) on Wednesday July 08, 2015 @10:25PM (#50073603)

              I don't think I'd ever go to the cloud because it's cheaper or more secure or more reliable. The main benefit that I see is flexibility.

              If your loads are stable and known in advance, it's likely cheaper to buy hardware and staff people to take care of it. On the other hand if loads spike wildly from one day to the next the cloud makes perfect sense. Need a thousand cores of compute power right this second? Amazon/Google/Rackspace/HP would be happy to rent it to you.

    • For *some* datacentre tasks you can use cheap, commodity hardware. For others, you need expensive, certified, bullet-proof hardware.
      • by HiThere ( 15173 )

        There is no such thing as "bullet-proof hardware" except in the sense that some of it would stop a 45 bullet.

        Cheaply build hardware fails more often, but *ALL* hardware fails, and you need to plan for it. Ever hear of "RAID"? That's the way all (almost all?) hard disks are built these days. But they still fail. They used to fail more frequently. ("RAID" == "Redundant Array of Inexpensive Disks").

        • by mysidia ( 191772 )

          There is no such thing as "bullet-proof hardware"

          Uh no... there definitely is. There's no X86 based system that really falls into this category though. Many mainframe systems are bulletproof, in the sense the mainframe won't fail or crash, or lose work, or corrupt data, upon any component failures. Tandem computers' systems and some other past solutions on the market were pretty darned bullet proof.

          That didn't mean no components failed -- only that when components died - CPUs and system bus inc

        • I worked on a telecom switch that ran processing on cards that had two CPUs in lockstep. If the output of the two ever differed the card was taken out of service and its last transaction was rolled back. Memory contents were stored in at least three places at any given time. The dataplane was inductively coupled to avoid the possibility of DC current damaging things.

          We replaced it with commodity hardware and smarter software. It wasn't *quite* as reliable, but it was a whole lot cheaper and the speeds r

    • Re: (Score:3, Informative)

      by Anonymous Coward
      While I was working at Amazon we were told to expect hardware failures and to build our software around it. I have a couple of friends doing hardware testing for AWS and all of their hardware is of extremely low quality and has major visable issues such as bowing, flimsy connectors, and little to no hardware redundancy in the hardware itself(no dual power supplies or hot swappable anything). This really isn't a surprise at all, its just where the industry is going.
    • by mbkennel ( 97636 )
      The problem is when managers want to replicate this with cheap commodity developers and cheap commodity IT support on top of unreliable hardware infrastructure instead of the expensive, and rare, high-end personnel and internal resources that Google and Facebook have.

      Since most companies won't be able to hire the top 1% of those people, might it be more worthwhile to buy more reliable and expensive hardware?
      • by mysidia ( 191772 )

        instead of the expensive, and rare, high-end personnel and internal resources that Google and Facebook have.

        Then they are destined to fail, if they are unwilling to invest in suitably skilled personnel AND high enough quality development for the chosen architecture to implement their intended plan.

        might it be more worthwhile to buy more reliable and expensive hardware?

        Paying up to keep the more qualified personnel on staff can have other benefits. I think the competition for good people is much less

    • by mysidia ( 191772 )

      You need cheap commodity hardware with smart software on top. Just ask Google or Facebook.

      The software used by the rest of us (e.g. MySQL) isn't that smart, and it's very expensive to get software that is that smart --- requires hundreds of thousands of ops engineer developer man hours, potentially to build that software system.

      There are open source products that can be that smart, with enough deployment work. Developing smart custom applications is a bear.

      It may very well be cheaper in many cases f

  • by digsbo ( 1292334 ) on Wednesday July 08, 2015 @04:48PM (#50072267)
    But testing well is really, really hard. And expensive, especially for data center scenarios. If you haven't put it in an oven and observed the effects, it's not tested for telco data centers.
    • by GerryGilmore ( 663905 ) on Wednesday July 08, 2015 @05:02PM (#50072351)
      And there is the rub. NEBS testing (telco-level) is horrifically expensive and - for DC applications - totally unnecessary. NEBS servers have to withstand that because they are often the *only* server performing a certain function in the CO. Not anywhere near the same use-case.
      • by digsbo ( 1292334 )

        Agreed, but still, even in a non-NEBS scenario, there's still a lot to be tested because you're putting something potentially flammable in someone's data center. It's really easy to think of designing so a server failure doesn't bring a cluster down, but a server failure that results in a fire has the potential to do more.

        The one time I had a fire in a test lab, it really scared me, and made me realize as rare as that kind of thing is, it's potentially disastrous. And that's why they test for it.

        • toxic material is an important consideration.
          but NEBS test servers for a data center is ridiculous !

          Major manufacters (HP,IBM,SUN,etc) only test one or two hardware chassis for NEBS.
          one basic 2u server & the next size up multi processor.

          NEBS servers are designed to be utility server in a telco switch site.
          The power is DC and the site has a big bank of batteries to power the site during outages.
          A telco is aiming for NO outages and is very hardware focused.

          Anyone elses datacenter is

      • Telco switches are ghost towns... big empty buildings out in the boonies that used to hold massive racks of relays with a little box in the middle that replaces all that, or tiny shacks built after the tech came up to speed that just holds the little box. They aren't manned, they are critical, and they need to have reliability due to their geographic dispersal.

        Datacenters are, eponymously, centralized. Keep a staff of 4-5 guys on-hand at all-times, give them a PC gaming center to play epic COD on when thi

    • by Anonymous Coward

      Financially, hardware Tax depreciates in three years anyway. Lately, hardware is a little slow on Moore's Law but power efficiency/computing performance has been about the same pace... If you're at the top end you're losing money not replacing fairly often. What happens after isn't their problem. There's no purpose in testing something to last in the desert for ten years because the vas majority of hardware is "disposable". If you want to complain about the waste push for more recyclable materials, and of

      • by KGIII ( 973947 )

        and of course boards that use fewer parts they don't need...

        I now have a picture in my head of a guy, his name is Ralph, sitting there, drilling holes, and soldering on random extra bits like capacitors, diodes, a spare bios chip bracket, and a USB port. I know what you meant but, really, that is how my brain works.

  • "web-scale data centers are designed to cope with hardware failures". So.... it's OK if you use my motherboard design and they randomly fail, because you should just make up for that in software or hardware redundancy? Um, no.
    • "web-scale data centers are designed to cope with hardware failures". So.... it's OK if you use my motherboard design and they randomly fail, because you should just make up for that in software or hardware redundancy? Um, no.

      That's exactly what it means, and how it works. When you have tens of thousands of nodes, some of them WILL eventually fail during operation, no matter how good the hardware is. Thus, the software must be designed to accommodate hardware failures and seamlessly continue operation without interruption or data loss. If you already have to design the software to handle that anyway, then there is not much incentive to go to great lengths to improve hardware reliability. Whether the failure rate is 1:100000

  • by Anonymous Coward

    Crawford thinks that web-scale data centers are designed to cope with hardware failures but hasn't tested it

    FTF Crawford.

  • by fuzzyfuzzyfungus ( 1223518 ) on Wednesday July 08, 2015 @04:57PM (#50072323) Journal
    I don't know if it's a good idea or not(probably depends on who you are, and I'm sure that there will be some people who chose incorrectly); but is it really a surprise that OCP would be doing their testing on the cheap 'n cheerful side of things?

    It was my understanding that their premise, from the beginning, was that existing hardware vendors were excessively focused on adding costly, thermally demanding, and often proprietary, features at the hardware level that were unnecessary if you were willing to compensate for their absence in your software design.

    There is obviously some level of reliability below which no compensation at the software level is possible(if you can't run the algorithm for detecting errors because it keeps glitching out, it's probably not going to work); but the impression they always conveyed was that many of the more sophisticated reliability mechanisms are really features aimed at people who are substantially less able to cope with failure; and are therefore willing to pay substantially more for hardware that can invisibly paper over a variety of moderately serious failures and allow the software on top to run without incident; rather than buying lots of cheap hardware that has a risk of going down in a screaming heap.

    So long as nobody gets any stupid optimistic ideas, I don't really see the issue. Sure, if Facebook were about sending men to mars, they should seriously consider having three CPUs running in lockstep and voting on all operations and so on; but this project is about delivering as many ad impressions per dollar as possible; no reason to get worked up over the occasional glitch.
    • by mysidia ( 191772 )

      if you can't run the algorithm for detecting errors because it keeps glitching out, it's probably not going to work

      Chances are you can't make good assurances about tolerating any kind of byzantine fault.

      I realize there are finally some options for tolerating certain kinds of Byzantine faults in specific kinds of scenarios. In general, it is too hard or expensive, so the fact is, less reliable hardware does mean the application will be less reliable. Buying cheaper hardware is still a cost tradeof

  • 5 9's (Score:5, Insightful)

    by The Raven ( 30575 ) on Wednesday July 08, 2015 @04:58PM (#50072337) Homepage

    I'm gonna side with OCP on this one. It is far more economical to deal with reliability via redundancy than it is via expensive parts. This is why we use RAID rather than speccing our drives to last 10 years minimum. All the big players in the datacenter market have put thousands of hours each into designing systems tolerant of missing parts.

    The downside is that writing custom stacks tolerant of missing pieces is fucking hard and a huge up-front investment for a company. Most off-the-shelf software does not have that level of redundancy and fault tolerance baked in already. This means that for many small to medium sized deployments it's cheaper to buy a really expensive fault tolerant rack of servers and run your off-the-shelf software on it than it is to buy into OCP with inexpensive hardware that's more open to failure, because your software is NOT open to failure.

    Different strokes for different folks and all. Use the right tool for the job. And OCP was made by companies with massive data farms to fit their needs... and their needs are probably not your needs.

    • by romanr ( 113283 )

      Exactly this. Pick the right tool for the right job. If you are just serving up simple web pages to the masses, go cheap, they can always hit refresh if things fail.

      If you have serious money flowing through the platform, plan and purchase accordingly. What is an outage going to cost you? A $50,000 server may end up being very, very cheap if an outage costs you $100,000 per hour.

      • by hawguy ( 1600213 )

        Exactly this. Pick the right tool for the right job. If you are just serving up simple web pages to the masses, go cheap, they can always hit refresh if things fail.

        If you have serious money flowing through the platform, plan and purchase accordingly. What is an outage going to cost you? A $50,000 server may end up being very, very cheap if an outage costs you $100,000 per hour.

        If an outage costs you $100K/hour, you better not be running it on a single server.

  • Sounds like Hooli XYZ! Where's Nelson Big Head Bighetti?
  • Pick two...

    It all boils down to what you want, but of the three things we all say we want, you get only two...

  • ...Crawford argues that web-scale data centers are designed to cope with hardware failures...

    By that logic, the telco data centers are not designed to cope with hardware failures?

    .
    Of course, I really don't care if facebook has downtime due to hardware reliability issues. facebook is more a waste of time than anything else.

    • Of course, I really don't care if facebook has downtime due to hardware reliability issues. facebook is more a waste of time than anything else.

      Facebook's customers would tend to disagree. They are paying a lot of money to Facebook and they want their money's worth.

      Facebook's users are not the customers, they are the product.

      • I'd imagine Facebook puts more resources into keeping the tracking and Ad-serving hardware 100% operational. The rest of the infrastructure is just the chicken feed sprinkle.

        • The rest of the infrastructure is just the chicken feed sprinkle.

          That "chicken feed sprinkle" is precisely what the customers are paying for. Facebook is not just selling ads, they are selling everything you type.

  • by FranTaylor ( 164577 ) on Wednesday July 08, 2015 @05:29PM (#50072465)

    it doesn't matter how many redundant servers you have, if they are all going to fail in the same way

  • by poopie ( 35416 ) on Wednesday July 08, 2015 @05:31PM (#50072473) Journal

    I suspect open compute project welcomes additional testing resources for the benefit of everyone... as long as it doesn't involve an oppressive amount of process that simply serve to slow down progress.

    But... Web scale IS different, so I can't blame the main sponsors for not prioritizing what isn't as important to them. Once you accept that ALL hardware fails, and that you can either pay more for more reliable hardware, or you can develop better software architecture to handle failures, you look at things differently. Spend your money once on good software engineering, instead of over and over on every server.

    • Once you accept that ALL hardware fails, and that you can either pay more for more reliable hardware

      If you have all the same hardware and it's not adequately tested, then all of your hardware is vulnerable to the same issues, and your application will possibly fail on all of them! Throwing more hardware at the problem just means more failures.

      or you can develop better software architecture to handle failures

      How can you develop software to work around systemic hardware problems? How can you write software that automatically detects if your floating point hardware is always correct? You say "do it on multiple systems and compare the results" but what if they all hav

  • by viperidaenz ( 2515578 ) on Wednesday July 08, 2015 @06:49PM (#50072823)

    MongoDB is Web-scale.

  • by tlambert ( 566799 ) on Wednesday July 08, 2015 @07:37PM (#50072989)

    Test engineer says... big companies need to hire more test engineers.

    Are we surprised?

    • the reality of massive system outages affecting NYSE and airlines says that more test engineers are needed

      • If you'd been watching the attack maps, you'd know that:

        (1) It's China
        (2) It's likely at the government level

        If you'd been watching current events, you know that:

        (3) China's economy has been crashing, going on three weeks now
        (4) They're really unhappy about people taking money out of, and shorting, Chinese stocks, adding to the crash
        (5) They've lost $3.25T in market cap since June 12th
        (6) That's just over 20% of their Gross National Product

        So it's likely they are attacking our financial markets over that.

        Se

    • Software engineers say 'give us much more money to make software that is ten times as complex so you can throw it on cheap hardware to run.'

      Are we surprised?

      The trick is, robust hardware is robust hardware. It's done, you test it, then you build quality metrics into the process of building it and you're done. Complicated software to accommodate less robust hardware is bigger, more complex, and thus more prone to software bugs. You fix it by making it even more complex.

      But the software guys will be there

Think of it! With VLSI we can pack 100 ENIACs in 1 sq. cm.!

Working...