Facebook's Corona: When Hadoop MapReduce Wasn't Enough 42
Nerval's Lobster writes "Facebook's engineers face a considerable challenge when it comes to managing the tidal wave of data flowing through the company's infrastructure. Its data warehouse, which handles over half a petabyte of information each day, has expanded some 2500x in the past four years — and that growth isn't going to end anytime soon. Until early 2011, those engineers relied on a MapReduce implementation from Apache Hadoop as the foundation of Facebook's data infrastructure. Still, despite Hadoop MapReduce's ability to handle large datasets, Facebook's scheduling framework (in which a large number of task trackers that handle duties assigned by a job tracker) began to reach its limits. So Facebook's engineers went to the whiteboard and designed a new scheduling framework named Corona."
Facebook is continuing development on Corona, but they've also open-sourced the version they currently use.
Re:Misleading headline (Score:4, Funny)
And why the fuck should I care about Windows 8 tablets? You are not making any sense!
Re: (Score:1)
No, I'M Spartacus!
Re: (Score:1)
Much as those who were holding their iPhone wrong were at fault?
Seriously, the Job Tracker just didn't scale well and applications had to worry about it - that's a broken architecture, not a broken application or deployment. Blaming the application or deployment for serious fundamental architectural flaws of the platform is much like blaming an application programmer in 1980 for using a=a+1 which a compiler happened to implement less efficiently than a++ or even a+=1 (or, for you old timers, a=+1 not to be
Re: (Score:3)
But between you and 1000 other people who care about slightly different sets, much of it is stuff that someone cares about.
Re:Junk. (Score:5, Insightful)
Too bad that's 99.9% junk I don't care about.
But between you and 1000 other people who care about slightly different sets, much of it is stuff that someone cares about.
This. 99.9% (at least) of the entire internet is junk that any one person doesn't care about. But every bit has someone who cares about it (or did at one time) or it wouldn't be there.
Well. I opened the story expected some reflexive Facebook-bashing, and I wasn't disappointed. When are people going to realize that FB's just another internet company with a reasonably successful business model, and worthy of neither adulation nor hatred?
Re: (Score:2)
s/expected/expecting/
[sigh] I do so wish Slashdot would allow editing posts, at least for a limited time (say, until they've been moderated or replied to). C'mon, even Facebook can manage that. ;)
Re: (Score:2)
This. 99.9% (at least) of the entire internet is junk that any one person doesn't care about.
I've done a crawl of a few billion pages.
No person at all cares about 99% of the content available on the interent. In fact, nearly that much is completely unreadable and was machine generated gibberish (real words, not sentences) in an attempt to fool Google and other search engines.
There are a few servers which host millions of subdomains with millions of manufactured pages under each subdomain.
In short, it's far
Re: (Score:3)
Wrong. FB is worthy of hatred because what they do is inherently evil. They spy on people, and sell off that information.
The "it's just a job/business" excuse doesn't work when the job/business is evil. For example, when the local Mafia goons come to collect protection money, it's "just a job" for them right? Nothing personal. They're just
Re: (Score:2)
What you say is sort of true, but I disagree that it is inherently evil. Evil implies a malicious intent. At worst, it's simply sociopathic. Facebook is doing what it's doing so that it can make money, and it's methods arn't even remotely secret. They would have no power at all if it wasn't handed to them gleefully by people.
Further it's disingenuous to compare them to the mafia and similar, for one simple reason. The mafia does what it does against people who are unwilling participants. Facebook on th
Re: (Score:2)
Re:Junk. (Score:5, Funny)
This post has been removed because it is of no interest to Anonymous Coward. Please try posting things more in line with the following categories:
1. Linux
2. Open-source software
3. Richard M Stallman
4. OMG!!! PONIES!!!
Re: (Score:3)
So, you care about (1 - 0.999) * 500 TB = 500 GB of Facebook information every day??? Dude, where do you get the time?
Re: (Score:2)
You may not care, but the people doing datamining to find new ways to push ads at us or find the next serial killer care greatly. You know, the ones that actually pay the bills.
Re: (Score:3, Informative)
Hadoop: massive data storage system framework... "Apache Hadoop is an open-source software framework that supports data-intensive distributed applications"
MapReduce: a way of managing distributed clusters of data sets... "MapReduce is a programming model for processing large data sets, and the name of an implementation of the model by Google. MapReduce is typically used to do distributed computing on clusters of computers"
Scheduling framework: a framework for providing optimal scheduling of something such t
Re: (Score:1)
No snark intended... no sarcasm given. The terms describe things that are technical. If you want something more generic, I could go as far as "Database management architecture" and "database communication architecture" but that dumbs things down to the point where it ads nothing to the discussion. If you don't understand what a database is and how it works (and that we're talking about database management here), you're going to find this entire article over your head, not just the industry buzzwords.
Kind
java (Score:1)
after paging through the code a bit, i found it interesting that they use java in their implementation (not just corona, but hadoop as well). i was wondering why, and after some googling found this link [nabble.com] which helped explain the situation a bit clearer.
pretty interesting stuff. but id be willing to bet googles map reduce is written in c/c++
Re: (Score:2)
Hadoop is not real time, it's a batch processing system, no one gives a damn if a node spend 50ms garbage collecting or not every so often.
Re: (Score:2)
Facebook (Score:4, Interesting)
I have to admit, while I hate using Facebook, and hate most of their business practices, I like how they're not just writing new infrastructure software, but are open-sourcing it all. I don't think it quite makes up for everything else, but it helps.
How many IT projects... (Score:2)
Have been code-named corona these last few years?? Seems like every org's got a project named corona nowadays.
Re: (Score:2)
Have been code-named corona these last few years?
The only one I can think of involves me remotely managing a server from the beach with only a lime wedge and cold beer.
I have little sympathy for them (Score:1)
Re: (Score:2)
They could start by actually deleting deleted content
They could, but why should they put themselves at a disadvantage over Google, every other corporation that buys such data and the NSA, who all most certainly do not delete stuff in the way you'd like them to?