Build Your Own $2.8M Petabyte Disk Array For $117k 487
Chris Pirazzi writes "Online backup startup BackBlaze, disgusted with the outrageously overpriced offerings from EMC, NetApp and the like, has released an open-source hardware design showing you how to build a 4U, RAID-capable, rack-mounted, Linux-based server using commodity parts that contains 67 terabytes of storage at a material cost of $7,867. This works out to roughly $117,000 per petabyte, which would cost you around $2.8 million from Amazon or EMC. They have a full parts list and diagrams showing how they put everything together. Their blog states: 'Our hope is that by sharing, others can benefit and, ultimately, refine this concept and send improvements back to us.'"
Not ZFS? (Score:2, Insightful)
Good luck with all the silent data corruption. Shoulda used ZFS.
You know why Amazon charges that much? (Score:5, Insightful)
Support.
A Very Shortsighted Article (Score:3, Insightful)
Before realizing that we had to solve this storage problem ourselves, we considered Amazon S3, Dell or Sun Servers, NetApp Filers, EMC SAN, etc. As we investigated these traditional off-the-shelf solutions, we became increasingly disillusioned by the expense. When you strip away the marketing terms and fancy logos from any storage solution, data ends up on a hard drive.
That's odd, where I work we pay a premium for what happens when the power goes out, what happens with a drive goes bad, what happens when maintenance needs to be performed, what happens when the infrastructure needs upgrades, etc. This article left out a lot of buzzwords but they also left out the people who manage these massive beasts. I mean, how many hundreds (or thousands) of drives are we talking here?
You might as well add a few hundred thousand a year for the people who need to maintain this hardware and also someone to get up in the middle of the night when their pager goes off because something just went wrong and you want 24/7 storage time.
We don't pay premiums because we're stupid. We pay premiums so we can relax and concentrate on what we need to concentrate on.
Ripoff (Score:5, Insightful)
Looks like a cheap downscale undersized version of a Sun X4500/X4540.
And as others have pointed out, you pay a vender because in 4 years they will still be stocking the drives you bought today, where as for this setup you will be praying they are still on ebay
That's great but what about all the hidden costs? (Score:1, Insightful)
Disk replacement? (Score:4, Insightful)
How do you replace disks in the chassis? We've got 1,000 spinning disks and we've got a few failures a month. With 45 disks in each unit you are going to have to replace a few consumer grade drives.
wtf? (Score:5, Insightful)
But when we priced various off-the-shelf solutions, the cost was 10 times as much (or more) than the raw hard drives.
Um..and what do you plan on running these disks with? HD's don't magically store and retreive data on their own. The HD's are cheap compared to the other parts that create a storage system. That's like saying a Ferrari is a ripoff because you can buy an engine for $3,000.
Re:A Very Shortsighted Article (Score:5, Insightful)
Yeah, but with Amazon you get FREE SHIPPING !! (Score:2, Insightful)
I love free shipping, even if it costs me more !! I like FREE STUFF !!
Re:That's great but what about all the hidden cost (Score:2, Insightful)
They designed and built it so they should know how to support it. If someone else builds one, just learning how to get that beast up and running is excellent hands on training.
Re:Ripoff (Score:3, Insightful)
why wouldn't you just build an entirely new pod with current disks and migrate the data? You could certainly afford it.
Not that shortsighted for their purposes (Score:5, Insightful)
Yeah, this only works if your the geeks building the hardware to begin with. The real cost is in setup and maintenance. Plus, if the shit hits the fan, the CxO is going to want to find some big butts to kick. 67TB of data is a lot to lose (though it's only about 35 disks at max cap these days).
These guys, however, happen to be both the geeks, the maintainers, and the people-whos-butts-get-kicked-anyway. This is not a project for a one or two man IT group that has to build a storage array for their 100-200 person firm. These guys are storage professionals with the hardware and software know how to pull it off. Kudos to them for making it and sharing their project. It's a nice, compact system. It's a little bit of a shame that there isn't OTS software, but at this level you're going to be doing grunt work on it with experts anyway.
FWIW, Lime Technology (lime-technology.com) will sell you a case, drive trays, and software for a quasi-RAID system that will hold 28TB for under $1500 (not including the 15 2TB drives - another $3k on the open market). This is only one fault tolerant, though failure is more graceful than a traditional RAID). I don't know if they've implemented hot spares or automatic failover yet (which would put them up to 2 fault tolerant on the drives, like RAID6).
Re:You know why Amazon charges that much? (Score:5, Insightful)
Lets try to be a bit more supportive here! (Score:5, Insightful)
If an article went up describing how a major vendor released a petabyte array for $2M the comments would full of people saying "I could make an array with that much storage far cheaper!"
Now someone has gone and done exactly that (they even used linuxto do it) and suddenly everyone complains that it lacks support from a major vendor.
This may not be perfect for everyones needs, but it's nice to see this sort of innovation taking place instead of blindy following the same path everyone else takes for storage.
Re:You know why Amazon charges that much? (Score:5, Insightful)
That 2.683M also pays for salaries, pretty building(s), advertising, research, conventions, and more advertising.
I could hire a couple of dedicated staff to have 24x7 support for far less than 2.683M, plus a duplicate system worth of spare parts.
This stuff isn't rocket science. Most companies don't need high-speed, fiber-optic disk array subsystems for a significant amount of their data, only for a small subset that needs blindingly fast speed. The rest can sit on cheap arrays. For example, all of my network accessible files that I open very rarely but keep on the network because it gets backed up. All of my 5 copies of database backups and logs that I keep because it's faster to pull it off of disk than request a tape from offsite. And it's faster to backup to disk, then to tape.
BackBlaze is a good example of someone that needs a ton of storage, but not lightening fast access. Having a reliable system is more important to them than one that has all the tricks and trappings of an EMC array that probably 10% of all EMC users actually use, but they all pay for.
What's all the hate? (Score:5, Insightful)
It's like looking at KDE and saying "But we pay Apple and Microsoft so we get support" (even though, no you don't). The company is just releasing specs, if it fits in your environment, great, if not, bummer. If you can make improvements and send them back up-stream, everyone wins. Just like software.
I seem to recall similar threads whenever anyone mentions open routers from the Cisco folks.
Re:You know why Amazon charges that much? (Score:5, Insightful)
Backup: depends on the backup strategy. I could make this happen for less than an additional 10%. But ok, point taken.
Redundancy: You mean as in plain redundancy? These are RAID arrays are they not? You want redundancy at the server level? Now you're increasing the scope of the project which the article doesn't address. (Scope error)
Hosting: Again, the point of the article was the hardware. That's a little like accounting for the cost of a trip to your grandmother's, and factoring in the cost of your grandmother's house. A little out of scope.
Cooling: I could probably get the whole project chilled for less than 6% of the total cost, depending on how cool you want the rig to run.
I think you're looking for a wrench in the works where none exist.
Re:You know why Amazon charges that much? (Score:5, Insightful)
Redundancy can be had for another $117,000.
Hosting in a DC will not even be a blip in the difference between that and $2.7m.
EMC, Amazon etc are a ripoff and I have no idea why there are so many apologists here.
Re:A Very Shortsighted Article (Score:5, Insightful)
You will more than likely NOT have to take a node offline. The design looks like they place the drives into slip down hot plug enclosures. Most rack mounted hardware is on rails, not screwed to the rack. You roll the rack out, log in, fail the drive that is bad, remove it, hot plug another drive and add it to the array. You are now done.
They went RAID 6, even though it is slow as shit, for the added failsafe mechanisms.
Re:they are missing hardware mgmt (Score:5, Insightful)
Re:Not ZFS? (Score:3, Insightful)
Are you saying that with the more expensive system, disks never fail and nobody ever has to get up in the night?
Re:A Very Shortsighted Article (Score:4, Insightful)
Why would you bother? Just start off by writing the data to three nodes, and then you can swap new ones in and out silently. If your space really is cheap, then that's not a problem.
Re:Liability insurance (Score:2, Insightful)
If you build a petabyte stack using 1.5TB disks you need about 800 drives including RAID overhead. With an MTBF for consumer drives of 500,000 hours, a drive will fail roughly every 10-15 days, if your design is good and you create no hotspots/vibration issues.
Rebuild times on large RAID sets are such that it is only a matter of time before they run a double drive failure and lose their customers data. The money they saved by going cheap will be spent on lawyers when they get the liability claims in.
If you RTFA, you will see that they are using RAID6 with 2 parity drives per raid, so a double drive failure can be handled, and it is only the less likely triple drive failure that will ruin them. It seems weak that they don't have hot-swappable drives in this configuration, but they have software that is managing the data across disk sets, and presumably they have redundant copies of data that keep the data accessible when one of their servers is taken down to replace a drive (if they don't, the downtimes due to replacing drives will make the service useless). This redundancy may also save them in the case that they actually lose a RAID set.
Re:they are missing hardware mgmt (Score:5, Insightful)
personally, I have a linux box at home running jfs and raid5 with hotswap drive trays. but I don't fool myself into thinking its BETTER than sun, hp, ibm and so on.
I don't these folks guy believe their solution is better -- just cheaper. MUCH cheaper. So much cheaper that you can employ a team of people to maintain the "homebrew" solution and still save money.
Re:Not ZFS? (Score:5, Insightful)
And I think I would use dual micro ATA motherboards, perhaps in their own cases to make them replaceable in case of failure.
I realize that the layout of the drives was done with an eye toward airflow, but I personally don't like to see drives set on their edges. It's probably a personal bias, but I like to see drives set flat. The bearings seem to last longer that way. Just my personal experience.
And, one final point, storage density is reaching the point where we can jam a lot of storage into a small space. Perhaps we have reached the point where we can start to spread things out and do things like put the drives in a separate enclosure or multiple enclosures. It makes designing, installing, and servicing easier. Use eSATA ports on the SATA cards to make external storage easier.
Re:Disk replacement? (Score:1, Insightful)
It's the google model: you don't replace failed components. (This isn't meant for a case where you have 1 'server'; this is meant for when you have hundreds of these pods.) The labor is better served deploying a new pod with 45 new disks than replacing one disk in 45 pods.
Re:Not ZFS? (Score:4, Insightful)
As evidence of that, I submit that dozens of companies like the one in this article have existed over the years, and only a handful of them still exist. Those that still do have either exited the storage array business, or have evolved their offerings into something that costs a lot more to build and support than a pile of disks.
Or they have been bought by one of the bigger storage companies.
are you a project manager by any chance? (Score:5, Insightful)
I like how you dismiss a detailed real world design example based simply on a claimed feature without any further substantiation. Very classy. I'm not saying you are wrong, but would it kill you to go into a little more detail about why these folks need "luck" when they are clearly very successful with their existing design?
Don't forget where the real value is (Score:3, Insightful)
The real value in a data storage system isn't in the hardware, it's in the data. And the real cost incurred in a data storage system is measured in the inability of the customer to access that data quickly, efficiently and (in the case of a disaster) at all.
If you need to crunch the data quickly, a higher-performing system is going to save you money in the end. Look at all the benchmarks: no home-grown systems are anywhere on the lists. If you want to stream through your data at several gigabytes per second, you need to pay for a fast interconnect. Putting 45 drives behind a single 1GbE just doesn't cut it.
Similarly, if you want to ensure that the data is protected (integrity, immutable storage for folks who need to preserve data and be certain it hasn't been tampered with, etc) and stored efficiently (single instance store, or dedupe, so you don't fill your petabytes of disks with a bajillion copies of the same photos of Anna Kournakova) then you need to pay for the extra goodness in that software and hardware as well.
Finally, if you want extremely high availability, then the cost of the hardware is miniscule compared to the cost of downtime. We had customers that would lose millions of dollars per service interruption. They're willing to pay a million dollars to eliminate or even reduce downtime.
These folks are essentially just building a box that makes a bunch of disks behave like a honking big tape drive. It's a viable business--that's all some folks need. But EMC et al are not going to lose any sleep over this.
Re:are you a project manager by any chance? (Score:4, Insightful)
What failure rate are you using to "virtually guarantee" that you'll get data corruption with 45 drives?
What failure rate in your RAM, CPU, and motherboard are you using to guarantee that the ZFS checksum are not themselves corrupted? Not to mention the high possibility of bugs in a younger file system, and the different performance characteristics among FSes.
I'm not say ZFS is a bad plan, at least if you're running enough spindles, but if you're going to "virtually guarantee" silent corruption with less than 100 drives I'd like to see some documentation for the the non-detectable failure rates you're expecting.
It's also worth noting that in a lot of data, a small amount of bit-flips might not be worth protecting against at all. Or they might be better protected at the application level instead of the block level -- for example, if the data will be transmitted to another system before it is consumed, as would be typical for a disk-host like this, a single checksum of the entire file (think md5sum) could be computed at the end-use system, rather than computing a per-block checksum at the disk host and then just assuming the file makes it across the network and through the other system's I/O stack without error.
*sigh* (Score:5, Insightful)
How about reading the section "A Backblaze Storage Pod is a Building Block".
<snip> the intelligence of where to store data and how to encrypt it, deduplicate it, and index it is all at a higher level (outside the scope of this blog post). When you run a datacenter with thousands of hard drives, CPUs, motherboards, and power supplies, you are going to have hardware failures — it's irrefutable. Backblaze Storage Pods are building blocks upon which a larger system can be organized that doesn't allow for a single point of failure. Each pod in itself is just a big chunk of raw storage for an inexpensive price; it is not a "solution" in itself.
Emphasis mine. I believe there are quite a few successful and reliable storage vendors not using ZFS. We get the point, you like it. Doesn't mean you can't succeed without it. Be more open minded.
Re:Not ZFS? (Score:3, Insightful)
RAID6? Bzzzzt! Wrong answer! (Score:1, Insightful)
Raw storage will always be cheaper than the effort of designing of fault-tolerant, high-availability systems, but it's worth the effort to at least implement "good enough" systems to attempt to achieve these qualities rather than sticking with the dumb "stack-em-high" approach. Scalability matters, or else your "super cluster" will quickly be overtaken by the next dumb implementation when the next 18-month increment rolls around.