Forgot your password?
typodupeerror
Data Storage Hardware

Build Your Own $2.8M Petabyte Disk Array For $117k 487

Posted by Soulskill
from the we-know-exactly-what-you'd-do-with-that-much-storage dept.
Chris Pirazzi writes "Online backup startup BackBlaze, disgusted with the outrageously overpriced offerings from EMC, NetApp and the like, has released an open-source hardware design showing you how to build a 4U, RAID-capable, rack-mounted, Linux-based server using commodity parts that contains 67 terabytes of storage at a material cost of $7,867. This works out to roughly $117,000 per petabyte, which would cost you around $2.8 million from Amazon or EMC. They have a full parts list and diagrams showing how they put everything together. Their blog states: 'Our hope is that by sharing, others can benefit and, ultimately, refine this concept and send improvements back to us.'"
This discussion has been archived. No new comments can be posted.

Build Your Own $2.8M Petabyte Disk Array For $117k

Comments Filter:
  • Cool. (Score:2, Interesting)

    by SatanicPuppy (611928) * <Satanicpuppy&gmail,com> on Wednesday September 02, 2009 @10:16AM (#29285049) Journal

    Nominally a Slashvertisement, but the detailed specs for their "pods" (watch out guys, Apples gonna SUE YOU) are pretty damn cool. 45 drives on two consumer grade power supplies gives me the heebie jeebies though (powering up in stages sounds like it would take a lot of manual cycling, if you were rebooting a whole rack, for instance), and I'd be interested to know why they chose JFS (perfectly valid choice) over some other alternative...There are plenty of petabyte capable filesystems out there.

    Very interesting though. I tried to push a much less ambitious version of this for work, and got slapped down because it wasn't made by (insert proprietary vendor here). Of course, we're still having storage issues because we can't afford the proprietary solution, but at least there is no non-branded hardware in our server room.

  • by TheGratefulNet (143330) on Wednesday September 02, 2009 @10:32AM (#29285253)

    where's the extensive stuff that sun (I work at sun, btw; related to storage) and others have for management? voltages, fan-flow, temperature points at various places inside the chassis, an 'ok to remove' led and button for the drives, redundant power supplies that hot-swap and drives that truly hot-swap (including presence sensors in drive bays). none of that is here. and these days, sas is the preferred drive tech for mission critical apps. very few customers use sata for anything 'real' (it seems, even though I personally like sata).

    this is not enterprise quality no matter what this guy says.

    there's a reason you pay a lot more for enterprise vendor solutions.

    personally, I have a linux box at home running jfs and raid5 with hotswap drive trays. but I don't fool myself into thinking its BETTER than sun, hp, ibm and so on.

  • by parc (25467) on Wednesday September 02, 2009 @10:37AM (#29285343)

    At 67T per chassis and 45 drives documented per chassis, they're using 1.5T drives. 1 petabyte would then be 667 drives.

    The worst part of this design that I see (and there's a LOT of bad to see) is the lack of an easy way to get to a failed drive. When a drive fails you're going to have to pull the entire chassis offline. Google did a study in 2007 of drive failure rates (http://labs.google.com/papers/disk_failures.pdf) and found the following failure rates over drive age (ignoring manufacturer):
    3mo: 3% = 20 drives
    6mo: 2% = 13 drives
    1yr: 2% = 13 drives
    2yr: 8% = 53 drives

    Their logic is probably along the lines of "we're already paying someone to answer the pager in the middle of the night," but jeez, you're going to have to take a node offline ever 2-3 days for the first year and then almost 2 a day after that!

  • Re:Ripoff (Score:1, Interesting)

    by Anonymous Coward on Wednesday September 02, 2009 @10:38AM (#29285349)

    No, it's the google model: when a drive dies it's dead and doesn't matter anymore; when a server dies it's dead and doesn't matter anymore. The infrastructure built on top of the pods takes care of replicating data so a failure only removes one of several copies of the data.

  • Re:Not ZFS? (Score:5, Interesting)

    by anilg (961244) on Wednesday September 02, 2009 @10:41AM (#29285403)

    Get both Debian and ZFS.. Nexenta. Links in my sig.

  • Re:Ripoff (Score:5, Interesting)

    by timeOday (582209) on Wednesday September 02, 2009 @10:42AM (#29285427)
    Depends on how it works. Hopefully (or ideally) it's more like the google approach - build it to maintain data redundancy, initially with X% overcapacity. As disks fail, what do you do then? Nothing. When it gets down to 80% or so of original capacity (or however much redundancy you designed in), you chuck it and buy a new one. By then the tech is outdated anyways.
  • by maxume (22995) on Wednesday September 02, 2009 @10:49AM (#29285529)

    William Shatner has continued to be awesome into well into his 70s. He even went on Conan and mocked Sarah Palin (while gently ribbing himself).

    Of the personalities in Hollywood, he is one I like quite a bit.

  • Re:Ripoff (Score:1, Interesting)

    by Anonymous Coward on Wednesday September 02, 2009 @10:55AM (#29285619)

    Looks like a cheap downscale undersized version of a Sun X4500/X4540.

    Or, if you also want software in appliance form, along with flash accelerator drives and support, the Sun Storage 7210 [sun.com] which holds 46 TB in its 4U chassis and is expandable to 142 TB.

    Sun has been undercutting NetApp prices with these ZFS-based "Unified Storage" systems, especially since they don't charge for software features (NFS, CIFS, HTTP, replication, etc.) separately like NetApp does.

    By the way, if you want to try the software, there's a VMware/VirtualBox VM image [sun.com] of the storage appliance. You can replace the simulated drives with real ones if you like.

  • Re:cheap drives too (Score:1, Interesting)

    by Anonymous Coward on Wednesday September 02, 2009 @10:58AM (#29285681)

    The point is, you could buy NetApp and install it yourself with cheap off-the-shelf consumer drives and end up spending about the same magnitude amount of money.

    You haven't bought a NetApp (or an EMC, Compellent, or XXX brand SAN) before - it's doesn't work that way.

    You get to buy NetApp Shelves of NetApp drives which sit behind your NetApp Controller. The drives, while mechanically identical to those you buy from NewEgg, run a special FW version. If you did manage to get it working, you sure as hell aren't going to get any support from your storage vendor.

    Some of the newer NetApp controllers can sit in front of another SAN, but a bunch of commodity drives does not a SAN make.

    Consumer drives don't work behind a pair of SAN controllers from ANY dominant storage vendor. Period. It sucks - maybe this should be what we're aiming to change.

  • by MoonBuggy (611105) on Wednesday September 02, 2009 @10:59AM (#29285699) Journal

    The lowest cost of an (apparently) comparable solution on their site is from Dell, at $826,000 per PB. That includes hardware and support but still requires hosting, cooling and so on at extra cost. To quote backup and redundancy as part of the cost seems misleading, since none of the solutions appear to include that.

    Basically, in order to compare favourably to the Dell units simply requires that one can get support for less than $709,000. If you want to throw in backup and redundancy, then buy twice as many units - you've still got change from half a million compared to the single Dell unit in order to cover the extra power, support and cooling costs, not to mention that support costs don't necessarily scale linearly.

  • by sockonafish (228678) on Wednesday September 02, 2009 @11:05AM (#29285781)

    Running on the cheapest hardware possible and engineering the software to gracefully deal with hardware failure is exactly how Google runs their datacenters, as well. As long as you've got the talent to pull it off, it's much more cost effective than buying a prefab solution.

  • Re:Not ZFS? (Score:3, Interesting)

    by chudnall (514856) on Wednesday September 02, 2009 @11:13AM (#29285883) Homepage Journal

    What do you mean by more expensive? OpenSolaris [opensolaris.org] with ZFS costs the same as Linux. And yes, You'll have to get up a lot less often in the middle of the night, since a few bad sectors aren't going to force a fail of the entire disk.

  • Re:Not ZFS? (Score:5, Interesting)

    by ajs (35943) <ajs@@@ajs...com> on Wednesday September 02, 2009 @11:15AM (#29285921) Homepage Journal

    Are you saying that with the more expensive system, disks never fail and nobody ever has to get up in the night?

    Well... yes and no. When you've worked with high-end arrays, you learn that storage is only the beginning. NetApp and EMC provide far, far more. I was damned impressed when I first heard a presentation from NetApp about their technology, but the day that they called me up and told me that the replacement disk was in the mail and I answered, "I had a failure?" ... that was the day that I understood what data reliability was all about.

    Since that time (over 10 years ago), the state of the art has improved over and over again. If you're buying a petabyte of storage, it's because you have a need that breaks most basic storage models, and the average sysadmin who thinks that storage is cheap is going to go through a lot of pain learning that he's wrong.

    Someday, you'll have a petabyte disk in a 3.5" form-factor. At that point, you can treat it as a commodity. Until then, there are demands placed on you when you administrate that much storage which demand a very different class of device than a Linux box with a bunch of raid cards.

    As evidence of that, I submit that dozens of companies like the one in this article have existed over the years, and only a handful of them still exist. Those that still do have either exited the storage array business, or have evolved their offerings into something that costs a lot more to build and support than a pile of disks.

  • Re:Disk replacement? (Score:3, Interesting)

    by TooMuchToDo (882796) on Wednesday September 02, 2009 @11:28AM (#29286141)
    What kind of drives are you using? We've got 4800+ spinning drives, and we only have 1-2 failures a month.
  • by rijrunner (263757) on Wednesday September 02, 2009 @11:50AM (#29286549)

        Having a couple decades of working both sides of the Support Divide, I am now of the opinion that the sole purpose of a Support Contract is to have someone at the other end of the phone to yell at. It makes people feel better and have a warm fuzzy. But, having had to schedule CE's to come onto site to replace failed hardware, I have generally found that that adds hours to any repair job. I would guess that you could power off this array, remove every single drive, move them to a new chassis, reformat them in NTFS, then back to JFS and still finish before a CE shows up on site. I recall that in the winter of 1994, *every* Seagate 4GB drive in our Sun boxes died.

        What happens now when a drive goes bad now is that a drive goes bad. You spot it through some monitoring software. You pick up the phone and call a 1-800 number. Someone asks a few questions like "What is you name? What is your quest? What is your favorite color?", then you hear typing in the background. After a bit, if you're lucky, they have you in the system correctly and can find your support contract for that box. Then, they give you a ticket number and put you on hold. Then, after a bit, an "engineering" rep will come appear and say "What is the nature of the emergency" and you then tell them the same stuff, except you get to add works like "var adm messages" or something. They'll tell you to send them some email so they can do some troubleshooting. You send them what they ask for. About an hour or so later, you get an email or call back saying that the drive has gone bad and need replaced, which is pretty much the same thing you told them when you called in. They then tell you that you are on a Gold Contract with 24/7 support and that the CE has a 4 hour callback requirement from the time the call is dispatched to the CE. By this point, you are about 3-4 hours after the disk drive failed in the first place. Finally, the CE will call back after some amount of time to schedule a replacement. And here comes the real kicker.... In almost every instance for the last 10 years, we have had to do all maintenance during a scheduled window. At 1AM.

        What happens now when something breaks is that someone fixes it.

        Any business is faced with a Buy-It-Or-Build-It dilemma for any service or equipment. Since this was their core business, it certainly makes sense. And, it makes sense for any business of a certain size or set of skills. The reality is that the math is favoring consumer electronics for most applications because they are good enough for 85% of the business needs out there. The whole Cost-Benefit analysis must be periodically re-addressed. If you do not have $1 million a year in billed repair from a Support contract, is it worth $1 million a year for the contract? Seriously.. Even if you have a support contract, you're probably going to get billed time and materials on top of everything else.

        With the math on this unit, you can build in massive layers of redundancy to greatly reduce even the possibility of the data being inaccessible and still come in far, far cheaper than any support contract and you can schedule downtown because you have redundancy across multiple chassis.

  • Please.... (Score:3, Interesting)

    by mpapet (761907) on Wednesday September 02, 2009 @12:04PM (#29286753) Homepage

    where I work we pay a premium for what happens when the power goes out, what happens with a drive goes bad,

    Whomever spec'd your systems should have accommodated obvious failures like this. As in, paying for colo, using servers with dual power supplies that fail over, sensible RAID strategy. Giving money to EMC in this situation is not sensible.

    but they also left out the people who manage these massive beasts. I mean, how many hundreds (or thousands) of drives are we talking here?
    I have a couple of hundred drives going at any one time and I get an SNMP alert when a drive goes bad. I take one out of the closet and destroy the broken one. The RAID does the rest.

    someone to get up in the middle of the night when their pager goes off because something just went wrong and you want 24/7 storage time.
    Our storage strategy is N+1 all the way and required to be online 24/7 so failures are part of the plan. They are probably part of the plan at this startup.

    We pay premiums so we can relax and concentrate on what we need to concentrate on.
    I don't understand this. If your job is 89% software dev, then EMC may be the way to go. Expensive! But, it makes a little business sense. If you aren't spending most of your time writing software that adds value to your service/product, then EMC is doing your job and you are some kind of TPS generator. Do you pay a premium to blame someone else? I've had the opportunity to work in places like this and I've always passed because of the veiled contempt for IT.

    Please, explain this to me.

  • by Anonymous Coward on Wednesday September 02, 2009 @12:23PM (#29287051)

    I've never spent more than 10 minutes on the phone with the Dell Guy(tm) for a failed drive.

    Always had them in under 4 hours, too.

    Just my MMV.

  • Re:Not ZFS? (Score:1, Interesting)

    by Anonymous Coward on Wednesday September 02, 2009 @01:02PM (#29287599)

    That is scary as hell. You didn't know the drive failed??? Why?? How the heck did they know? Do you really provide them access to your data 24/7?? That's crazy!

    The biggest argument against the large storage companies, is that large, dynamic companies don't use them. Amazon doesn't. Google doesn't. Facebook doesn't. Think smarter, not more expensive.

  • Re:Not ZFS? (Score:3, Interesting)

    by iphayd (170761) on Wednesday September 02, 2009 @01:28PM (#29288035) Homepage Journal

    On a similar note, they claim that they will backup any one computer for $5/month. Well, my one computer happens to be the backup node for my SAN, so they're going to need about 15 TB (It's a small SAN) to have 30 day backups for me. Please note, that all of the files on my SAN are under 4GB and I have a SAN, not a NAS, so my servers see it as a native hard drive.

  • Re:Not ZFS? (Score:3, Interesting)

    by FoolishBluntman (880780) on Wednesday September 02, 2009 @01:40PM (#29288259)
    I have news for you. The high end boxes from EMC, NetApp and the like have silent data corruption too!
  • by PRMan (959735) on Wednesday September 02, 2009 @01:45PM (#29288341)

    I used to work at a company that paid a 20% premium on hardware for support from HP that was COMPLETELY WORTHLESS. I told them they would be better off just ordering a 6th computer for every 5 that they bought.

    The guy would show up with no tools, not even a screwdriver, and then he would need to come back the next day (with a screwdriver). Then he didn't have the part (say RAM) that we told them in the first call and the day before. Then he showed up the next day with RAMBUS instead of DDR RAM. After 3 weeks, we got the machine back online.

    Which means, in the meantime, since the person whose machine it was had to have something to work on, we had to cobble together a PC from no spare parts and then try to transfer their stuff off of their drive (because nobody ever heeded the store everything on the U: and S: drive mantra) and we worked like crazy to do it, eating up our whole day.

    If we had had spare machines instead, we could have just replaced her RAM in 1 minute. Or, if it was the motherboard, put her drive in an identical replacement machine in 1 minute.

  • by Score Whore (32328) on Wednesday September 02, 2009 @02:32PM (#29289053)

    Redundancy can be had for another $117,000.
    Hosting in a DC will not even be a blip in the difference between that and $2.7m.

    EMC, Amazon etc are a ripoff and I have no idea why there are so many apologists here.

    First these aren't even storage arrays in the same sense that EMC, Hitachi, NetApp, Sun, etc. provide. The only protocol you can use to access your data is https? WTF! Second the Hitachi array in my data center doesn't put 67 TB storage behind half a dozen single points of failure the way this thing does. Third the Hitachi array in my data center doesn't put 67 TB behind a dinky gigabit ethernet link. My Hitachi will provide me with 200,000 IOPS with 5 ms latency. I can hook a whole slew of hosts up to my SAN. I can take off-host, change-only copies of my data so backups don't bog down my production work. I can establish replication between the Hitachi here in this building and the second array four hundred miles away with write order fidelity and guaranteed RPOs.

    Comparing this thing to enterprise class storage is like some sixteen year old adding a cold air intake and a coat of red paint to his Honda civic then running around bragging that his car is somehow comparable to a Ferrari ("look they're both red!") Every time I see something like this the only thing I learn is that yet another person doesn't actually "Get It" when it comes to storage.

    HelloWorld.c is to the Linux kernel as this thing is to the Hitachi USP-V or EMC Symmetrix.

  • Re:wtf? (Score:3, Interesting)

    by Rich0 (548339) on Wednesday September 02, 2009 @03:08PM (#29289641) Homepage

    Yup.

    You can do even better than the price quoted in this article. On Newegg I found a 1TB drive for $95 - that is only $95k/PB. What a bargain!

    Except that I don't have a PB of space with my solution. I have 0.001PB of space. If I want 1PB of space then I need hundreds of drives, and some kind of system capable of talking to hundreds of drives and binding them into some kind of a useful array.

    This sounds like criticizing the space shuttle as being wasteful as you can cover the same distance in a truck for 1/10000000 x the cost. Except of course for the minor detail that the truck can't fly in space, and can't do all that distance on a single load of fuel in a few hours.

    Or, I can generate completely green energy at a very low price per gigawatt using a small generator and a hamster wheel. Except that I'm not generating a gigawatt - I'm generating maybe a few mW and scaling it up. Unless I bury China in rats I'm not going to be competing with the Three Gorges Dam.

  • by MartinSchou (1360093) on Wednesday September 02, 2009 @03:09PM (#29289651)

    You raise an "interesting" train of thought in my mind.

    Encoding in 720p x264 you get something like 45 minutes in 1.1 GB. This gives you 60,900 episodes per 4U unit or 609,000 episodes per 40U rack.

    In 1080p x264 you get something like 45 minutes in about 2.5 GB. This is 27,000 episodes per 4U unit or 270,000 episodes per 40U rack.

    Assuming 22 episodes per season and a five year average run time, you end up with 220 episodes per show (typical science fiction shows).
    Assuming 5 shows per week, 40 weeks a year, 10 year run time, you end up with 2,000 episodes per show (typical soaps).

    So you could easily store 100 full sci-fi shows and 100 full soaps on in one rack (that'd be 222,000 episodes), all stored in glorius 1080p.

    IMDb lists the following statistics [imdb.com]:

    452,982 movies released theatrically.
    792,565 TV episodes.
    75,316 made for TV movies.
    61,440 TV series.
    77,624 direct to video movies.

    Leaving out "TV series" (they average 12.9 episodes/series, which seems reasonable with the amount of cancelled series) I'll make the following assumptions about average run time:
    Theatrical releases: 120 minutes
    TV episodes: 35 minutes
    TV movies: 90 minutes
    Direct to video: 100 minutes

    That's a total of 96,638,455 minutes. Encoding that in 720p would require 2,362,274 GB or 5,315,117 GB for 1080p.

    What's my point? Well, for one thing you couldn't ever watch it, as it's 183 years, so no, that wasn't my point ;)

    That it is entirely within the realm of feasibility to offer downloads of every single movie and tv-show on IMDb from a hardware point of view. One of the complaints I've heard from the production companies is that it would be impossible to set up the hardware needed for it. Even at Sun's prices, you'd "only" need to pay 10 million dollars to store everything in both 720p and 1080p quality. Set up redundant servers in 10 different locations, 5 in the US, 5 in Europe, and you're still only out 100 million dollars.

    From a cultural point of view, think of all the things that are lost when the copyright holders let these things rot away on shelves, throw it out or it's lost in some kind of calamity. And this is just movies and tv-shows. Add in music and news and I suspect you could easily get hugely redundant back-ups of it all for 1 billion dollars. Even if you had to replace the storage arrays every 3 years, it's still really really cheap. Figure twice that for maintenance, and we have an annual cost of about a billion dollars - cheap when we're saving all knowledge for our successors. That's roughly the cost of building 125 miles of rural freeway in Michigan [michigan.gov]. It'd be cheap at 10x the price. And in ten years - we will probably still be using high bit rate encoding (1080p+), but will the cost of storage still be as high? I suspect it'll slowly fall, slightly faster than inflation.

    Having to reencode everything from time to time, would obviously take a huge amount of time, but that is the price we pay for progress. On the other hand, even with 1:1 encoding time, it'd only take 183 computer-years to do it.

    Imagine what it would be like if 25 years from now your kids could, at the touch of a button, gain access to every bit of entertainment and news as from the last 25 years. I don't mean going to Wikipedia and looking up The Terminator [wikipedia.org] but actually watch the film, read all the news about it, as it looked at the time, five years on, seven years on after Terminator 2: Judgement Day had its effect on the new franchise etc.

    Imagine them not having to settle for what history books said happened in the year 2010 or about specific events in that year, but be able to pull up every single news article and tv news report on the subject and make up their own mind, de

  • Re:Not ZFS? (Score:4, Interesting)

    by plover (150551) * on Wednesday September 02, 2009 @06:04PM (#29292209) Homepage Journal

    They're betting on the MTTF of the drives, on RAID, and on redundant system backups.

    Yes, it's cheap hardware. Yes, cheap hardware fails more often than expensive hardware. Yes, cheap hardware is slower than expensive hardware. But you have to look at the offsets: they are building a backup service, where they don't need "instant" data access speeds. As for drive failures, I have some experience there. I have 57,000 cheap-ass consumer drives in service, and over 10,000 of them are 11 years old. They're dying at the rate of about ten failures per day. The key is to build your processes to tolerate and handle failures.

    As long as your redundant systems are keeping copies of the data, and you understand exactly what the impact is of a failed component as well as have a recovery plan in place, why not use cheap hardware? Let's do a bit of math. The guy had a photo of himself standing behind about 18 of these boxes. That's 810 drives. If we lowball cheap drives at 300,000 hours MTBF, he'll see an average of two failures per month. It might take him $200 and an hour to recover each failed drive. We could keep doing the math on each component, but I suspect this is still a complete and total bargain that will meet his business needs very well.

    It may not be as shiny as EMC or NetApp, and you have to do the legwork yourself, but why spend the extra money on a system that would provide him with "too much service"? From an ROI perspective, this guy is probably going to do very well, even though he may drive a few sysadmins crazy in the process.

  • Re:Not ZFS? (Score:3, Interesting)

    by therufus (677843) on Wednesday September 02, 2009 @08:10PM (#29293537)

    You need to look at the grand scheme of things. Sure, you may get 5-10% of customers using massive amounts of data (over 500Gb) but when 90-95% of your customers are home users and small businesses who don't have their own data centers, and they may only have a 50Mb backup, their lack of use offsets the heavy users.

    Imagine if in a 1Pb server, 750Tb of data was used by 10,000 individuals paying $5/mth and the other 250Tb was used by 50 individuals paying $5/mth. I failed at mathematics at school, but I'm sure the 10k will pay the data center costs that would be incurred by the 50.

  • Re:Ripoff (Score:3, Interesting)

    by PAjamian (679137) on Wednesday September 02, 2009 @08:46PM (#29293923)

    Fine then, replace just the broken drives but as far as I'm aware Linux software raid 6 does not require the drives be the same model, or even the same size. You can get newer drives for the same or less cost as the old drives and just plug them in. Who cares if they have more capacity? Just let it go to waste if you must but it'll work just fine and certainly you won't have to be scrounging drives off of ebay.

    Also consider that five years down the road we may have 10tb drives or better, but 1.5 tb drives should still be available on the consumer market (and keep in mind these are cheap consumer drives) for dirt cheap and these guys will probably be quite happy to use their same design with newer high capacity drives available at the time.

Help! I'm trapped in a PDP 11/70!

Working...