Forgot your password?
typodupeerror
Data Storage Hardware

Build Your Own $2.8M Petabyte Disk Array For $117k 487

Posted by Soulskill
from the we-know-exactly-what-you'd-do-with-that-much-storage dept.
Chris Pirazzi writes "Online backup startup BackBlaze, disgusted with the outrageously overpriced offerings from EMC, NetApp and the like, has released an open-source hardware design showing you how to build a 4U, RAID-capable, rack-mounted, Linux-based server using commodity parts that contains 67 terabytes of storage at a material cost of $7,867. This works out to roughly $117,000 per petabyte, which would cost you around $2.8 million from Amazon or EMC. They have a full parts list and diagrams showing how they put everything together. Their blog states: 'Our hope is that by sharing, others can benefit and, ultimately, refine this concept and send improvements back to us.'"
This discussion has been archived. No new comments can be posted.

Build Your Own $2.8M Petabyte Disk Array For $117k

Comments Filter:
  • by elrous0 (869638) * on Wednesday September 02, 2009 @10:20AM (#29285081)

    Soon I shall have a single media server with every episode of "General Hospital" ever made stored at a high bitrate. WHO'S LAUGHING NOW, ALL YOU WHO DOUBTED ME!!!!

    And how big is a petabyte you ask? There have been about 12,000 episodes of General Hospital aired since 1963. If you encoded 45 minute episodes at DVD quality mpeg2 bitrate, you could fit over 550,000 episodes of America's finest television show on a 1 petabyte server, enough to archive every episode of this remarkable show from its auspicious debut in 1963 until the year 4078.

  • by Desler (1608317) on Wednesday September 02, 2009 @10:21AM (#29285095)
    It's not your math that's rusty it's your reading skills.

    Linux-based server using commodity parts that contains 67 terabytes of storage at a material cost of $7,867.

  • Re:Disk replacement? (Score:2, Informative)

    by markringen (1501853) on Wednesday September 02, 2009 @10:23AM (#29285121)
    slide it out on a rail, and drop in a new one. and there is no such thing as consumer grade anymore, they are often of much higher quality stability wise than server specific drives these days.
  • by SatanicPuppy (611928) * <Satanicpuppy@g m a i l .com> on Wednesday September 02, 2009 @10:23AM (#29285127) Journal

    The focus of the article was only on the hardware, which was extremely low cost to the point of allowing massive redundancy...This is not an inherently flawed methodology.

    If you can deploy cheap 67 terabyte nodes, then you can treat each node like an individual drive, and swap them out accordingly.

    I'd need some actual uptime data to make a real judgment on their service vs their competitors, but I don't see any inherent flaws in building their own servers.

  • by ShadowRangerRIT (1301549) on Wednesday September 02, 2009 @10:24AM (#29285139)
    You misread. It's $7,867 per 67 terabytes. So at the hard disk standard for a petabyte (base 10, not base 2), 1000 TB == 1 PB:
    (1000 TB / 67 TB) * $7,867 = $117417.91
  • by CoolCash (528004) on Wednesday September 02, 2009 @10:24AM (#29285141) Homepage
    If you check out what the company does, they are an online backup company. They don't host servers on this array, just backup data from your desktop. They just need massive amounts of space which they make redundant.
  • by staeiou (839695) * <staeiou@@@gmail...com> on Wednesday September 02, 2009 @10:27AM (#29285187) Homepage

    We don't pay premiums because we're stupid. We pay premiums so we can relax and concentrate on what we need to concentrate on.

    They actually do talk about that in the article. The difference in cost for one of the homegrown petabyte pods from the cheapest suppliers (Dell) is about $700,000. The difference between their pods and cloud services is over $2.7 million per petabyte. And they have many, many petabytes. Even if you do add "a few hundred thousand a year for the people who need to maintain this hardware" - and Dell isn't going to come down in the middle of the night when your power goes out - they are still way, way on top.

    I know you don't pay premiums because you're stupid. But think about how much those premiums are actually costing you, what you are getting in return, and if it is worth it.

  • by Tx (96709) on Wednesday September 02, 2009 @10:28AM (#29285197) Journal

    We don't pay premiums because we're stupid. We pay premiums because we're lazy.

    There, fixed that for you ;).

    Ok, that was glib, but you do seem to have been too lazy to read the article, so perhaps you deserve it. To quote TFA, "Even including the surrounding costsâ"such as electricity, bandwidth, space rental, and IT administratorsâ(TM) salariesâ"Backblaze spends one-tenth of the price in comparison to using Amazon S3, Dell Servers, NetApp Filers, or an EMC SAN.". So that aren't ignoring the costs of IT staff administering this stuff as you imply, they're telling you the costs including the admin costs at their datacentre.

  • cheap drives too (Score:3, Informative)

    by pikine (771084) on Wednesday September 02, 2009 @10:38AM (#29285351) Journal

    Reliant Technology sells you NetApp FAS 6040 for $78,500 with a maximum capacity of 840 drives, without the hard drive (source: Google Shopping). If you buy FAS 6040 with the drives, most vendors will use more expensive and less capacity 15k rpm drives instead of the 7200rpm drives the BlackBlaze Pod uses, and this makes up a lot of the price difference. The point is, you could buy NetApp and install it yourself with cheap off-the-shelf consumer drives and end up spending about the same magnitude amount of money. I estimate that NetApp would cost just 1.5x the amount.

    NetApp FAS 6040 at $78,500 + 840 x 1.5TB drives at $120 each = $179,300 which gives you 1.26PB. Cost per petabyte is $142,500, only slightly more expensive than BlackBlaze $117,000 from the article. The real story is that BlackBlaze is able to show a competitive edge of $30,000, or being 20% cheaper.

  • by fulldecent (598482) on Wednesday September 02, 2009 @10:49AM (#29285541) Homepage

    >> You might as well add a few hundred thousand a year for the people who need to maintain this hardware and also someone to get up in the middle of the night when their pager goes off because something just went wrong and you want 24/7 storage time.

    >> We don't pay premiums because we're stupid. We pay premiums so we can relax and concentrate on what we need to concentrate on.

    Or... you could just buy ten of them and use the left over $1m for electricity costs and an admin that doesn't sleep

  • Re:Disk replacement? (Score:2, Informative)

    by maxume (22995) on Wednesday September 02, 2009 @10:54AM (#29285595)

    It sounds like they just soft-swap a whole chassis once enough of the drives in it have failed.

    If their requirements are a mix of cheap, redundant and huge (with not so much focus on performance), cheap disposable systems may fit the bill.

  • by SatanicPuppy (611928) * <Satanicpuppy@g m a i l .com> on Wednesday September 02, 2009 @11:04AM (#29285769) Journal

    This sort of attitude is how Sun got it's lunch eaten in the market in the first place.

    Yes, your hardware rocks. It's so fucking sexy I need new pants when I come into contact with it.

    It also costs more than a fucking italian sports car.

    Turns out that if your awesome hardware is 10 times better than commodity hardware, but also 25 times as expensive, people are just going to buy more commodity hardware.

    I've got some Sun data appliances and I've got some Dell data appliances, and the only difference I've seen between them is purely one of cost. The only thing that ever breaks is drives.

  • by Anonymous Coward on Wednesday September 02, 2009 @11:09AM (#29285835)

    RTFA - they are not saying one of these is a mission critical enterprise storage system. In fact they said:

    No One Sells Cheap Storage, so We Designed It

    When you are talking about multiple petabyte scale paying 5x as much for 5 temperature sensors, SAS drives, LEDs etc becomes pretty stupid.

    • Treat the 67TB system as an $8,000 hard drive.
    • Deploy a few tens or hundreds of them with redundancy between them.
    • In 2-3 years when they start to fail, replace them with a larger capacity drives.
    • ???
    • Take your hundreds of thousands of dollars not payed to SUN, IBM, EMC, NetApp etc and PROFIT!!!
  • Re:Ripoff (Score:3, Informative)

    by ciroknight (601098) on Wednesday September 02, 2009 @11:17AM (#29285961)
    Since most modern commercial-grade HDs come with a 3-5 year or better warranty these days [1] [wdc.com], it's easier just to cash those in when the drives go bad and build a new box around the newer-model drives they ship you in return.

    This is truly RAID, as Google, etc. have realized and developed. When the drives die, you don't cry over having the exact same drive stocked. You don't cry at all. At $8k a machine, you could actually afford to flat-out replace the entire box every 4 years and not affect your bottom line (since, you know, you're saving better than three times that by not going with one of the 'cloud vendors').
  • by Anarke_Incarnate (733529) on Wednesday September 02, 2009 @11:18AM (#29285981)
    The hardest part will be identifying the bad drives. That is ANOTHER feature that you pay for on expensive disk systems. The controllers will alert you to where the failed drive is, as well as often times alerting the manufacturer of the failure. There have been times I have been called by a vendor to let me know a part and on site engineer was being dispatched for a failure my users were not even aware of yet due to it being off hours (and ops were asleep at the wheel).
  • by ianpatt (52996) on Wednesday September 02, 2009 @11:36AM (#29286283)

    From the credits list: "Protocase for putting up with hundreds of small 3-D case design tweaks", which I assume is http://www.protocase.com/ [protocase.com].

  • Re:Not ZFS? (Score:2, Informative)

    by ImprovOmega (744717) on Wednesday September 02, 2009 @11:43AM (#29286439)

    I was damned impressed when I first heard a presentation from NetApp about their technology, but the day that they called me up and told me that the replacement disk was in the mail and I answered, "I had a failure?" ... that was the day that I understood what data reliability was all about.

    Agreed. We've had similar experiences with HP EVA systems here at work with things like that, it's wonderful =)

    Someday, you'll have a petabyte disk in a 3.5" form-factor. At that point, you can treat it as a commodity.

    As much as I want to believe this, I know that just as in the past the business will find a way to fill an array of such drives. They'll decide to do something silly like 24/7 recording of 1000 different cameras, or hourly snapshots of critical systems going back 3 months "just in case", or something. If you have seemingly unlimited amounts of cheap storage, the business *will* find a way to fill it.

  • by cowbutt (21077) on Wednesday September 02, 2009 @12:00PM (#29286703) Journal

    they used incredibly cheep-ass HBA's for no good reason.

    In their defence:

    A note about SATA chipsets: Each of the port multiplier backplanes has a Silicon Image SiI3726 chip so that five drives can be attached to one SATA port. Each of the SYBA two-port PCIe SATA cards has a Silicon Image SiI3132, and the four-port PCI Addonics card has a Silicon Image SiI3124 chip. We use only three of the four available ports on the Addonics card because we have only nine backplanes. We don't use the SATA ports on the motherboard because, despite Intel's claims of port multiplier support in their ICH10 south bridge, we noticed strange results in our performance tests. Silicon Image pioneered port multiplier technology, and their chips work best together.

  • by pyite (140350) on Wednesday September 02, 2009 @12:13PM (#29286895)

    are you a project manager by any chance?

    Of course not. A project manager would look at this and go, "wow, we saved a lot of money!" It's pretty simple. ZFS does what most other filesystems do not; it guarantees data integrity at the block level by the use of checksums. When you're dealing with this many spindles and dense, non-enterprise drives, you are virtually guaranteed to get silent corruption. The article does not once have any of the words corrupt.*, checksum, or integrity mentioned in it once. The server doesn't use ECC RAM. The project, while well intentioned, should scare the crap out of anyone thinking about storing data with this company.

  • by Anonymous Coward on Wednesday September 02, 2009 @12:44PM (#29287375)

    You realize they are USING this NOT SELLING it, right? They tell YOU how YOU can build one, nowhere are they offering to sell some schmuck a storage array.

    If you don't know how to maintain it, do not try to do it yourself! however if you do, and you can save the kind of money they are saving, then go for it.

  • Re:Not ZFS? (Score:3, Informative)

    by FoolishBluntman (880780) on Wednesday September 02, 2009 @01:50PM (#29288419)
    >That is scary as hell. You didn't know the drive failed??? Why?? How the heck did they know? Do you really provide them access to your data 24/7?? That's crazy! No moron, high end disk arrays "phone home" either by dedicated phone line or email when a disk failure occurs. The disk array immediately starts rebuilding a RAID set using a hot spare. The disk you receive in the mail or from an on-site call is to replace the failed drive. They don't need access to your data, just the status of the array subsystem. >The biggest argument against the large storage companies, is that large, dynamic companies don't use them. Amazon doesn't. Google doesn't. Facebook doesn't. The only company in your list that doesn't use a large storage company is Google. Most companies don't have the in-house expertise to keep trace of their data. They out source a lot of the work so they can concentrate on their core business.
  • Re:Not ZFS? (Score:2, Informative)

    by anegg (1390659) on Wednesday September 02, 2009 @02:04PM (#29288623)
    NetApp provides a function in the storage servers that they sell whereby significant events such as drive failures as well as general health check information can be sent to NetApp if you choose. The information is sent via e-mail or an HTTP POST (if I recall correctly). If you have support services, they monitor your installation via these messages, and will automatically send out a new drive if you have drive replacement services, for example. They do not have remote command access to your storage server (unless you chose to give them that by making the interface available outside of your firewall).
  • Re:Not ZFS? (Score:3, Informative)

    by iphayd (170761) on Wednesday September 02, 2009 @02:12PM (#29288753) Homepage Journal

    So you are saying that they're happy to get their return of investment on their hardware alone in 44 years? I doubt it.

  • by Sandbags (964742) on Wednesday September 02, 2009 @02:55PM (#29289409) Journal

    "Redundancy can be had for another $117,000." ...plus the inter SAN connectivity ...plus the SAN Fabric aware write plitting hardware and licensing ...plus the redundancy aware server connected to that SAN fabric ...plus the multipath HBA licensing for the servers ...plus multiple redundant HBAs per server and twice as many SAN fabric switches ...plus journaling and rollback storage, and block level deduplication within it (having a real-time copy is useless if you get infected with a virus). ...plus another real-time asynchonously replicated SAN at an offsite location at least 100 miles away ...plus the ISP connection to the offsite ...plus the staff to support an additional site and all the complex software and clusters ...plus cluster aware operating systems

    This is why Tier 0 arrays cost in the millions...

  • by ToasterMonkey (467067) on Wednesday September 02, 2009 @04:32PM (#29290891) Homepage

    My Hitachi will provide me with 200,000 IOPS with 5 ms latency.

    While that is just a TAD overkill for disk backup, these guy's $.11/GB is not something I'd trust my backups on.

    HelloWorld.c is to the Linux kernel as this thing is to the Hitachi USP-V or EMC Symmetrix.

    You nailed it.

    Service Time/IOPS is less important here than trustworthy and proven controller hardware & software, and built in goodies like replication. That's why I would trust disk backups to Sun, NetApp, Hitachi, EMC, and not these people. Possibly home systems I guess, but bragging about homemade storage is a real turnoff.

  • Re:Not ZFS? (Score:1, Informative)

    by Anonymous Coward on Wednesday September 02, 2009 @05:47PM (#29291985)

    As long as there is enough intelligence in their software that runs higher up in the stack that it can accommodate for the expected device failures, they should be good to go.

    Having read their blog post on the hardware, I'm going to just go ahead and assume that their software stack is every bit as half-assed, cheap, and hackish as their hardware.

    I mean, these guys are pinching pennies to an unbelievable extent. Did you see how rickety their mechanical support system for the drives is? Real hardware vendors build things which can actually be shipped. You could never ship these guys' box anywhere -- it is so fragile looking that I'd be afraid to do so much as tip it on its side once it's assembled. They probably only get away with it by hand building each box on site.

    They used a consumer grade Core 2 Duo motherboard, the Intel DG43NB. By chance, this is the board I used to build my girlfriend's gaming PC earlier this year. It is a reasonable (but not great - I was rushed and it was what Fry's had in stock) board to use for a medium performance gaming/email/web PC. It is NOT, however, a server board. Not even close. Not even if you're trying to build storage on the cheap.

    As a consequence of this motherboard choice, they had to use shitty SATA cards to talk to the drives. Three consumer grade 1x PCIe 2-port cards, and - get this - one 4-port PCI card. Yes, that's right, they bottleneck 40% of the SATA ports through a single 32-bit 33 MHz PCI slot. Noooo, it's just too spendy to buy a $500 board actually designed for server purposes with lots of PCIe slots... they just HAD to have the $85 consumer grade motherboard!

    They used two gamer ATX power supplies instead of selecting a server grade 4U power supply.

    On and on. Everything about it screams "we have no actual experience building hardware". In fact, everything also screams "We don't even have any experience operating decent server grade hardware", because they'd run away screaming from their own hardware design if they had said experience.

  • by sholto (1149025) on Thursday September 03, 2009 @01:55AM (#29296037)

    I'd need some actual uptime data to make a real judgment on their service vs their competitors,

    I did an extensive interview with the Backblaze CEO. No hard data on uptime but he says they lose one drive a week from the whole 1.5petabyte system and have never had a pod fail. They've been running for a year. Here's the link to the story. Also comments about the designing/testing process. http://www.crn.com.au/News/154760,want-a-petabyte-for-under-us120000.aspx [crn.com.au]

APL is a write-only language. I can write programs in APL, but I can't read any of them. -- Roy Keir

Working...