Want to read Slashdot from your mobile device? Point it at m.slashdot.org and keep reading!


Forgot your password?
Databases Intel Hardware

Intel Launches Its Own Apache Hadoop Distribution 18

Nerval's Lobster writes "The Apache Hadoop open-source framework specializes in running data applications on large hardware clusters, making it a particular favorite among firms such as Facebook and IBM with a lot of backend infrastructure (and a whole ton of data) to manage. So it'd be hard to blame Intel for jumping into this particular arena. The chipmaker has produced its own distribution for Apache Hadoop, apparently built 'from the silicon up' to efficiently access and crunch massive datasets. The distribution takes advantage of Intel's work in hardware, backed by the Intel Advanced Encryption Standard (AES) Instructions (Intel AES-NI) in the Intel Xeon processor. Intel also claims that a specialized Hadoop distribution riding on its hardware can analyze data at superior speeds—namely, one terabyte of data can be processed in seven minutes, versus hours for some other systems. The company faces a lot of competition in an arena crowded with other Hadoop players, but that won't stop it from trying to throw its muscle around."
This discussion has been archived. No new comments can be posted.

Intel Launches Its Own Apache Hadoop Distribution

Comments Filter:
  • Re:Speed (Score:5, Informative)

    by Anonymous Coward on Tuesday February 26, 2013 @06:36PM (#43019227)

    The performance claim in the summary seems to come from page 15 of this presentation [intel.com], where the speedup for a 1TB sort (presumably distributed) is 4 hours -> 7 minutes. I can't find the details for that test, but most of the speedup comes from using better hardware - faster CPU and network adapter, and SSDs instead of HDDs - while they get a 40% speedup from using their Hadoop distribution over some other Hadoop distribution, which is a fairly modest gain.

    The biggest performance benefit of Spark comes from avoiding disk and network access, so improving those bottlenecks will presumably reduce Spark's lead over Hadoop somewhat. But it's hard to say how well Spark would do with this particular hardware and test setup. I would guess it's still much faster than their Hadoop distribution. (Note: I'm a Spark power user but not an expert in its performance.)

  • Re:Speed (Score:3, Informative)

    by Anonymous Coward on Tuesday February 26, 2013 @06:38PM (#43019265)

    It's impossible to say without the details of apples-to-apples comparisons, but superficially, none of the announcements of "improved Hadoop" from Intel, Greenplum, Hortonworks, etc. is all that impressive in comparison to Spark even if you assume that none of their improvements can or will be integrated into Spark. Take, for example, a couple of the claims that Intel is making for their new Hadoop distribution. First, the "four hour job reduced to seven minutes" claim is the same ballpark 30-40x claim made for some of the other "improved Hadoop" offerings. For each of these, I'd be surprised if 30-40x speedup could be expected in the general case, and not just for some quite specific use cases. In contrast, Spark achieves 30-40x speedups across a wide range of jobs, and often does significantly better. Second, Intel claims an 8.5x speedup for Hive queries. That is much less than speedups that are routinely achieved with Shark (Hive on Spark), and the best-case scenarios for Shark speedups are more than a full order of magnitude greater than Intel's claim.

    In short, the "improved Hadoop" distributions do make significant advances over current Hadoop, but they don't really do anything to change my mind that in-memory data clusters are the way forward and away from many of the limitations of Hadoop/MapReduce, or that Spark is the leading implementation of such an in-memory cluster computing framework. At this point, the main advantages the various Hadoops have over Spark are in the areas of the maturity of the technology and the coverage and usefulness of management and integration layers on top of the basic cluster computing framework. As long as it remains disk-oriented and doesn't retain the working dataset in memory, I wouldn't expect Hadoop to close the raw performance gap with Spark.

Evolution is a million line computer program falling into place by accident.