Intel Launches Its Own Apache Hadoop Distribution 18

Posted by Soulskill on Tuesday February 26, 2013 @05:51PM from the if-you-want-something-done-right-do-it-yourself dept.

Nerval's Lobster writes "The Apache Hadoop open-source framework specializes in running data applications on large hardware clusters, making it a particular favorite among firms such as Facebook and IBM with a lot of backend infrastructure (and a whole ton of data) to manage. So it'd be hard to blame Intel for jumping into this particular arena. The chipmaker has produced its own distribution for Apache Hadoop, apparently built 'from the silicon up' to efficiently access and crunch massive datasets. The distribution takes advantage of Intel's work in hardware, backed by the Intel Advanced Encryption Standard (AES) Instructions (Intel AES-NI) in the Intel Xeon processor. Intel also claims that a specialized Hadoop distribution riding on its hardware can analyze data at superior speeds—namely, one terabyte of data can be processed in seven minutes, versus hours for some other systems. The company faces a lot of competition in an arena crowded with other Hadoop players, but that won't stop it from trying to throw its muscle around."

Intel Launches Its Own Apache Hadoop Distribution

This discussion has been archived. No new comments can be posted.

Load All Comments

Search 18 Comments Log In/Create an Account

Comments Filter:

- Big Data != big (Score:2)
  
  by oneiros27 ( 46144 ) writes:
  
  Not always. It's been used as such a buzzword that it's come to be used any time when the amount or complexity becomes a limit to what you're trying to do.
  So in the case of NRT (near real time), it might be a relatively small amount of data. Or it might be that there's enough different formats of data or other complexity that it's a problem.
  And it's also discipline specific ... I've heard of groups complaining about 50GB being a lot of data ... because they're dealing with tens of thousands of Excel sprea
Speed (Score:2)

by stewsters ( 1406737 ) writes:

How does that compare to something like spark [spark-project.org]?
- Re:Speed (Score:5, Informative)
  
  by Anonymous Coward writes: on Tuesday February 26, 2013 @06:36PM (#43019227)
  
  The performance claim in the summary seems to come from page 15 of this presentation [intel.com], where the speedup for a 1TB sort (presumably distributed) is 4 hours -> 7 minutes. I can't find the details for that test, but most of the speedup comes from using better hardware - faster CPU and network adapter, and SSDs instead of HDDs - while they get a 40% speedup from using their Hadoop distribution over some other Hadoop distribution, which is a fairly modest gain.
  The biggest performance benefit of Spark comes from avoiding disk and network access, so improving those bottlenecks will presumably reduce Spark's lead over Hadoop somewhat. But it's hard to say how well Spark would do with this particular hardware and test setup. I would guess it's still much faster than their Hadoop distribution. (Note: I'm a Spark power user but not an expert in its performance.)
  
  - Re: (Score:1)
    
    by Anonymous Coward writes:
    
    Yeah, the details in that presentation describe something far less impressive than the top-line "4 hours -> 7 minutes" claim. You are absolutely correct that only a very modest amount of the ~35x speedup claimed is attributable to the Intel Hadoop distribution itself, with the bulk of the speedup coming from significant hardware upgrades across the cluster. Spark wouldn't benefit from the hardware changes in exactly the same way, but it would still see significant gains from upgrading the cluster hardw
  - Re: (Score:1)
    
    by Anonymous Coward writes:
    
    Approximated results from the presentation:
    - Hadoop 1.0.3, old Xeon, HDD, 1G Ethernet -> 240 minutes
    - Hadoop 1.0.3, new Xeon, HDD, 1G Ethernet -> 120 minutes
    - Hadoop 1.0.3, new Xeon, SSD, 1G Ethernet -> 24 minutes
    - Hadoop 1.0.3, new Xeon, SSD, 10G Ethernet -> 12 minutes
    - Hadoop 2.1.1, new Xeon, SSD, 10G Ethernet -> 7 minutes
    The only useful conclusion is that changing Hadoop version from 1.0.3 to 2.1.1 can give you 40% reduction of duration. I wonder how it would work for other hardware config
- Re: (Score:3, Informative)
  
  by Anonymous Coward writes:
  
  It's impossible to say without the details of apples-to-apples comparisons, but superficially, none of the announcements of "improved Hadoop" from Intel, Greenplum, Hortonworks, etc. is all that impressive in comparison to Spark even if you assume that none of their improvements can or will be integrated into Spark. Take, for example, a couple of the claims that Intel is making for their new Hadoop distribution. First, the "four hour job reduced to seven minutes" claim is the same ballpark 30-40x claim ma
neat, but (Score:3)

by masternerdguy ( 2468142 ) writes: on Tuesday February 26, 2013 @06:12PM (#43018961)

So they've migrated an open solution to a vendor locked in solution? Sweet.

- Re: (Score:1)
  
  by wlj ( 204164 ) writes:
  
  The (stated) speed-up could be nice, but:
  (1) how locked-in is it (just some tuning, serious modification, what?)
  (2) have they actually released it?
  - Re: (Score:3)
    
    by networkBoy ( 774728 ) writes:
    
    Even if it's completely locked in, you data isn't.
    Simple really, if you have Intel hardware use this distro to take advantage of it, otherwise use the Apache one. No reason AMD or nVidia can't do the same...
    -nb
Because AES is the true bottleneck in hadoop (Score:4, Insightful)

by citizenr ( 871508 ) writes: on Tuesday February 26, 2013 @06:20PM (#43019065) Homepage

...

- Re: (Score:1)
  
  by Anonymous Coward writes:
  
  AES-NI is not just AES processing, it includes significant improvements to general vector processing instructions.
  - - - Re: (Score:2)
        
        by dkf ( 304284 ) writes:
        
        One could argue that engineers should pay attention in physics class, or conversely, theorists should get their hands dirty once in a while.
        Could we have both? A bit of realism on both sides would be nice...
gentoo (Score:1)

by Anonymous Coward writes:

must run gentoo

There may be more comments in this discussion. Without JavaScript enabled, you might want to turn on Classic Discussion System in your preferences instead.

Intel Launches Its Own Apache Hadoop Distribution 18

Intel Launches Its Own Apache Hadoop Distribution More Login

Intel Launches Its Own Apache Hadoop Distribution

Big Data != big (Score:2)

Speed (Score:2)

Re:Speed (Score:5, Informative)

Re: (Score:1)

Re: (Score:1)

Re: (Score:3, Informative)

neat, but (Score:3)

Re: (Score:1)

Re: (Score:3)

Because AES is the true bottleneck in hadoop (Score:4, Insightful)

Re: (Score:1)

Re: (Score:2)

gentoo (Score:1)

Related Links Top of the: day, week, month.

Slashdot Top Deals

Slashdot