Is Big Data Leaving Hadoop Behind?

Please create an account to participate in the Slashdot moderation system

Is Big Data Leaving Hadoop Behind? 100

Posted by samzenpus on Wednesday May 13, 2015 @06:30PM from the don't-believe-the-hype dept.

knightsirius writes: Big Data was seen as one the next big drivers of computing economy, and Hadoop was seen as a key component of the plans. However, Hadoop has had a less than stellar six months, beginning with the lackluster Hortonworks IPO last December and the security concerns raised by some analysts.. Another survey records only a quarter of big data decision makers actively considering Hadoop. With rival Apache Spark on the rise, is Hadoop being bypassed in big data solutions?

This discussion has been archived. No new comments can be posted.

Is Big Data Leaving Hadoop Behind?

Load All Comments

Search 100 Comments Log In/Create an Account

Comments Filter:

Nope. Not happening. (Score:5, Informative)

by Art Popp ( 29075 ) * writes: on Wednesday May 13, 2015 @06:33PM (#49685707)

FTA: ...biggest problem is that people allegedly still can’t use Hadoop... Hadoop is still too expensive for firms...
Hadoop is an ecosystem with lots of moving parts. Those are real problems above, but Spark (Particle) is not a stand alone replacement for an ecosystem the size of Hadoop. Moreover it has no problem running integrating with Yarn on Hadoop where you can run Hbase, Cassandra, MongoDB, Rainstor, Flume, Storm, R, Mahout and plenty of other Yarn-compatible goodies.
It's also worth noting that Hortonworks and Cloudera may not be "taking off as hoped" because the branded big-iron players are finally in the ring. They hide the (rather hideous) complexity and integrate well with any existing systems you have with those vendors. Teradata for instance has a Hadoop/Aster integration that's impressive and turn key. They bought Rainstor, and will soon have it integrated, and that's Spark-fast and hassle free. IBM's BigInsights is very impressive if you have the means.
So, no, Hadoop is in no danger of being replaced. The value proposition that my $4.2M cluster outperformed two $6M "big name" vendor supported appliances is undeniable, but only that stark when your $'s have an M suffix. What will probably occur though is that we'll end up replacing every component in Hadoop with a faster one, and MapReduce will become a memory as things like Spark and Hive/Tez move away from that methodology.

Share
twitter facebook
- Re: (Score:1, Funny)
  
  by TechyImmigrant ( 175943 ) writes:
  
  Yarn on Hadoop where you can run Hbase, Cassandra, MongoDB, Rainstor, Flume, Storm, R, Mahout and plenty of other Yarn-compatible goodies.
  It's also worth noting that Hortonworks and Cloudera
  I know R. My wife has a Yarn store. WTF are those other things?
  - Re: (Score:1)
    
    by QRDeNameland ( 873957 ) writes:
    
    I've heard of MongoDB. It's Web Scale!! [youtube.com]
  - Re: (Score:1)
    
    by peragrin ( 659227 ) writes:
    
    Funny I was thinking they were all children's books. cloudera, horton hears a works, etc.
  - Re: (Score:1)
    
    by sfcat ( 872532 ) writes:
    
    I know R. My wife has a Yarn store. WTF are those other things?
    Its a distributed exec for Java processes. That's really it. It has crappy monitoring built in that's unnecessary due to SNMP but they built it in anyway because...well I don't know why.
- - Re:Nope. Not happening. (Score:5, Insightful)
    
    by Hognoxious ( 631665 ) writes: on Wednesday May 13, 2015 @07:21PM (#49685993) Homepage Journal
    
    As a storage admin for a Multi-Hospital organization using anything open source is not really an option if we want to keep the HIPPA-potamus away.
    Makes sense. If they can see your source (which you have to show them, or it wouldn't be open) then it makes absolute sense they can totally see your data.
    You weren't previously the city manager of Tuttle, Oklahoma, were you?
    
    Parent Share
    twitter facebook
  - Re: (Score:2)
    
    by flink ( 18449 ) writes:
    
    I tend to agree. As a storage admin for a Multi-Hospital organization using anything open source is not really an option if we want to keep the HIPPA-potamus away.
    I worked for over a decade as an SE for an org that was both a hospital-IT vendor and a covered entity in its own right (we sold a PMS, a PACS, operated multiple HIEs, and were a claims clearinghouse). When choosing libraries and server technology, never once was the open source status of a piece of technology a consideration with regards to HIPAA. We would occasionally have to run things by the legal team to evaluate a new license or check our compliance, but that was it. HIPAA considerations were most
  - Re: (Score:2)
    
    by number6x ( 626555 ) writes:
    
    It is HIPAA, not HIPPA. Just remember it is an 'Accountability Act', and you know where the double letters are.
- Re: (Score:1)
  
  by ampsicora ( 145573 ) writes:
  
  I agree that the problem is that most companies don't know how to run it and it's left to bigger organizations that 1) have the expertise in house and 2) actually need the added complexity.
  Understanding which pieces of the ecosystem you need, how to deploy and running them in a production environment can be daunting, not to mention all the different possibilities of which cloud provider to use, which services, etc.
  Cloudera and Hortonworks are capitalizing on it basically helping sorting out this complexity
  - Re:Nope. Not happening. (Score:5, Insightful)
    
    by Rich0 ( 548339 ) writes: on Wednesday May 13, 2015 @10:22PM (#49686773) Homepage
    
    I agree that the problem is that most companies don't know how to run it
    I think a bigger problem is that most companies don't even know what big data actually is. It is a big buzzword. I hear managers talking about it all the the time. Half the time they're talking about some database table with a few hundred thousand records in it. Other times they're talking about some repository full of documents or binary files that might be terrabytes in size, but it is just random stuff. They don't actually have questions in mind that they want to answer, and ultimately that is what tools like Hadoop are about.
    I've heard "big data" applied to problems that are basically just file shares or the like.
    Then if a company really does have a problem where Hadoop and such is useful, they want to buy some product off the shelf that solves that particular problem, and usually they don't exist. Or they want to hire a bunch of random rent-a-coders and have them solve the problem, and they go about solving it with single-threaded solutions written in .net or whatever the commodity solution in use is at the company.
    Sure, your Facebooks and Googles and Netflixs and Amazons know what they're doing. Your average GE or Exxon or Pfizer generally doesn't do that level of comp sci.
    
    Parent Share
    twitter facebook
    - Re: (Score:3)
      
      by jbolden ( 176878 ) writes:
      
      You are overestimating the difficulty at this point. This not compsci anymore and hasn't been for many many years. It isn't even hard administration. It is probably easier to get a big data system running in 2015 than it was to use Oracle in 1995.
      As far as your examples you went way too big. GE is a huge DevOps shop, they know what Big Data is. Exxon has massive supercomputing datasets. I would bet they were doing big data long before it got cool. Pfizer has an IT department that is some of everythin
      - Re: (Score:1)
        
        by db10 ( 740174 ) writes:
        
        I would strongly disagree. In 1995 relational theory and practice was well understood by a large set of developers and had stable, well documented implementations. Raw Hadoop and the associated computational model is not at that level of stability, documentation and usability. In addition the relational model applies to many business problems, large and small. Hadoop is generally applicable and cost efficient only for larger, more complex problems.
        you can't strongly anything as an AC, sorry buddy
        
        Re: (Score:2)
        
        by jbolden ( 176878 ) writes:
        
        Agree with both your comments. That's from a developers perspective it was certainly easier to use Oracle once setup in 1995 than it was to use Hadoop today (by a bit). What the thread was about was setup. What wasn't understood well in 1995 was how to package complex enterprise software so that sysadmin times to get it installed were reasonable. The original poster was talking about the complexity from scratch.
      - Re: (Score:2)
        
        by Rich0 ( 548339 ) writes:
        
        You are overestimating the difficulty at this point. This not compsci anymore and hasn't been for many many years. It isn't even hard administration. It is probably easier to get a big data system running in 2015 than it was to use Oracle in 1995.
        I think you're misunderstanding my point.
        Sure, it is easy to install Hadoop, and run it.
        The hard part is figuring out WHAT to run on it.
        
        Re: (Score:2)
        
        by jbolden ( 176878 ) writes:
        
        That's easy the big 5:
        1) Datasets to big to use an RDBMS
        2) 360 view of customers (CRM consolidation, sales systems consolidation...)
        3) Security data from network security devices.
        4) Stream in huge amounts of operational data (GPS on employees, physical sensors, machine health...) and do integrated data analysis
        5) data warehouse consolidation
- Re: (Score:2)
  
  by Daniel Hoffmann ( 2902427 ) writes:
  
  So you are basically saying that hadoop will eventually fall in disuse but HDFS (Hadoop file system) will linger on with new platforms built on top of it? Or do you believe that the HDFS will also be replaced eventually?
Rival? (Score:3)

by Culture20 ( 968837 ) writes: on Wednesday May 13, 2015 @06:37PM (#49685743)

I thought Spark worked from within Hadoop. Is that like using emacs to run vi?

Share
twitter facebook
- Re:Rival? (Score:5, Informative)
  
  by Anonymous Coward writes: on Wednesday May 13, 2015 @06:47PM (#49685793)
  
  They need to refer the the pieces of hadoop. HDFS is the storage piece and many things can interface to it, it isn't great but is often good enough especially if you just have a couple local disks per node. YARN is the scheduler piece, it is mostly awful performance-wise but is fairly easy to use...long run it'll lose to something like mesos I think. MR is the map reduce piece that everyone thinks of when you say hadoop. Almost everything will run quicker in spark(still using a map/reduce methodology) than hadoop MR.
  As a side note, I don't know anyone who still writes MR jobs directly, they are all doing pig or hiveql.
  
  Parent Share
  twitter facebook
  - Re:Rival? (Score:4, Interesting)
    
    by careysub ( 976506 ) writes: on Wednesday May 13, 2015 @07:15PM (#49685951)
    
    They need to refer the the pieces of hadoop. HDFS is the storage piece and many things can interface to it, it isn't great but is often good enough especially if you just have a couple local disks per node. YARN is the scheduler piece, it is mostly awful performance-wise but is fairly easy to use...long run it'll lose to something like mesos I think.
    That's a good call. With Cloudera and HortonWorks both adding new components to the Hadoop stack it has exploded in the number of components in the last a year or two, and that can be a bad thing. The complexity of the whole ecosystem is getting horrendous, with a typical configuration file doubling from 250 or so to 500 configuration items, which are almost all undocumented (unless you read the code - which scarcely qualifies as "documented") in the last year. For a practical deployment you are pretty much forced to use a commercial stack to get something up and running in a manageable fashion. And then there is the fact that the HDFS foundation is showing its age.
    MR is the map reduce piece that everyone thinks of when you say hadoop. Almost everything will run quicker in spark(still using a map/reduce methodology) than hadoop MR.
    Spark on Mesos is looking mighty awesome.
    As a side note, I don't know anyone who still writes MR jobs directly, they are all doing pig or hiveql.
    MapReduce is still viable for stable production jobs, but not in a dynamic requirements environment.
    Although HiveQL is alive and kicking, the complete replacement of Hive Server with Hive Server 2, while possibly an improvement in usability overall (I am not convinced), it trashes your skill investment in the (now) obsolete Hive stack component. Maybe I am just grousing, but I start having reservations about technology planning in the data center when a key stack component changes so much it a relatively short period of time
    
    Parent Share
    twitter facebook
    - Re: (Score:2)
      
      by Daniel Hoffmann ( 2902427 ) writes:
      
      You are absolutely right about the complexity of the ecosystem, but from my experience every Java based platform eventually evolves such complexity (it is like a xml fetish)
Who is this question for? (Score:2)

by sanosuke001 ( 640243 ) writes:

Is this a question for Hadoop employees or slashdot? If there's something better, why does it matter to anyone other than the company developing Hadoop if it's relevant?
- Re: (Score:2)
  
  by jbolden ( 176878 ) writes:
  
  Hadoop is open source. The companies building it are LinkedIn, Yahoo, Facebook and then the Hadoop vendors: Hortonworks (tightly tied to Microsoft), IBM, Cloudera (enterprise support vendor)...
- Re: (Score:2)
  
  by Ksevio ( 865461 ) writes:
  
  Hadoop is open source software so it's more significant if it's in decline than a closed commercial alternative.
Relevance of Security (Score:5, Funny)

by Luthair ( 847766 ) writes: on Wednesday May 13, 2015 @06:56PM (#49685845)

Is security really that big of a deal? Isn't the intent to run it on a private network to crunch numbers behind the scene?
We don't ask about the susceptibility of safety deposit boxes to crowbars and dynamite, they're inside a vault.

Share
twitter facebook
- Re: (Score:2)
  
  by Hognoxious ( 631665 ) writes:
  
  Perhaps we should: http://www.bbc.com/news/uk-eng... [bbc.com]
What Fucking Decade Is It? (Score:1, Interesting)

by sexconker ( 1179573 ) writes:

Did I trip into a time warp and come out a decade in the past?
Who the fuck is actually talking about hadoop or map reduce in 2015? The same retards that were creaming their little cunts about it in 2005?
Even when you ignore the joke that is Java, hadoop is unwieldy, unreliable shit if you actually care about storing and retrieving correct, synchronized data.
If you're fine with throwing all of your data in a pot and getting some sort of result that looks mostly correct, then knock yourself out and use hadoo
- Re: (Score:2)
  
  by Tablizer ( 95088 ) writes:
  
  If your data needs to be correct, define it and its relationships then use SQL. You will have to pay someone decent money to do this correctly.
  PHB's have to learn the hard way. They want it cheap, big, and now. Security & reliability issues are something they try to blame on somebody else using their well-honed spin skills.
- Re:What Fucking Decade Is It? (Score:4, Insightful)
  
  by jbolden ( 176878 ) writes: on Wednesday May 13, 2015 @11:03PM (#49686953) Homepage
  
  Hadoop didn't exist in 2005. 1.0 release was December 2011 earliest versions I know of were floating around in 2007.
  As for using SQL, Hadoop supports SQL (mostly). Problem with Hadoop is the data sets are too big for RDBMS engines to handle. It has nothing to do with developer skill it has to do with the type of database engine and how data is being handled.
  
  Parent Share
  twitter facebook
  - Re: (Score:2, Funny)
    
    by Anonymous Coward writes:
    
    Hadoop didn't exist in 2005.
    Unless you work in recruitment.
  - - Manufacturing Data (Score:1)
      
      by clifwlkr ( 614327 ) writes:
      
      Something as simple as manufacturing data far eclipses this number every day. Think of every screw from every supplier in every product. Then tracking the reliability of this product through the entire lifecycle with self diagnostic tests. No, this is not for your toy made in china, but when it comes to real top end products that HAVE to work, then you need this kind of data to figure out what went wrong and fix it fast. That could save your company millions. No, making your latest dot bomb app does no
      - Re: (Score:2)
        
        by jbolden ( 176878 ) writes:
        
        Good examples.
      - Re: (Score:2)
        
        by sexconker ( 1179573 ) writes:
        
        Something as simple as manufacturing data far eclipses this number every day. Think of every screw from every supplier in every product. Then tracking the reliability of this product through the entire lifecycle with self diagnostic tests. No, this is not for your toy made in china, but when it comes to real top end products that HAVE to work, then you need this kind of data to figure out what went wrong and fix it fast. That could save your company millions.
        No, making your latest dot bomb app does not need this, but there are many places that do. Also check out financial apps like credit fraud, insurance, etc.
        Every screw from every supplier in every product? No one dealing in volume tracks that because it's fucking pointless. People dealing with really-expensive shit that requires that tracking don't deal in volumes where it would be a problem.
        But lets pretend we live in a fantasy world where that's true.
        SUPPLIERS
        ID - INT PrimaryKey, Identity
        Name nvarchar(whatever) ...
        PARTTYPES
        ID - INT PrimaryKey, Identity
        SupplierID INT ForeignKey (SUPPLIERS.ID)
        Name nvarchar(whatever) ...
        PARTS
        ID - BIGINT PrimaryKey, Identity
        Pa
    - Re: (Score:2)
      
      by jbolden ( 176878 ) writes:
      
      SQL server is based around the idea of small amounts of changes with data retention being long.
      Assume a system throwing off 3mbs of data which many companies can have if they are aggregating simple stuff like all customers on the websites and sequencing page by page access to look for correlations. There are 28,500 seconds in a workday (more if you have multiple locations). That's 85.5 petabytes of data per day. You need to aggregate this data fast. SQL Server's engine isn't designed for that.
      Or for
      - Re: (Score:2)
        
        by sexconker ( 1179573 ) writes:
        
        SQL server is based around the idea of small amounts of changes with data retention being long.
        Assume a system throwing off 3mbs of data which many companies can have if they are aggregating simple stuff like all customers on the websites and sequencing page by page access to look for correlations. There are 28,500 seconds in a workday (more if you have multiple locations). That's 85.5 petabytes of data per day. You need to aggregate this data fast. SQL Server's engine isn't designed for that.
        Or for example SQL Server doesn't handle queries against unstructured information. Imagine that each record has a field of text and you want to do joins based on fuzzy matching between these text fields. Even with a few gigs of data SQL Server will die.
        Check your math on that, please. 8*3600*3 = 84.375 GB, not 85.5 PB.
        Further, 85.5 PB per day is no problem if you can manage to write all of that out to disk (1037 GB per second non stop lol).
        SQL Server 2014 handles .5 EB per database, so it's not a problem. And if your tracking 3 MB per second per user, you're tracking bots, not users - it would actually be cheaper to just log all packets at that point.
        As for making sense of the data, SQL can handle all of it if you design your database sanely. Even if y
        
        Re: (Score:2)
        
        by jbolden ( 176878 ) writes:
        
        Check your math on that, please. 8*3600*3 = 84.375 GB, not 85.5 PB.
        
        You are correct. Sorry.
        And if your tracking 3 MB per second per user, you're tracking bots, not users
        Absolutely. You are mostly tracking network security events, computers talking to other computers. What you are generally looking for is unusual activity. Server 2047 never talks Asia all the sudden it is talking to Vietnam regularly. But to do that you need to know who is talking to what across the network.
        SQL can handle all of
  - Re: (Score:2)
    
    by sexconker ( 1179573 ) writes:
    
    Hadoop was created in 2005 and named after a toy elephant. It was an open source implementation of some shit Google wrote some papers on.
    The "Apache Hadoop" branded package hit RTM in 2011. Apache only got involved because of all the retards mindlessly jumping onto it. Those retards jumped onto it because they were told it was based on Google's work.
    As for datasets being too big for RDBMS engines to handle, WTF are you talking about? MS SQL can handle all the data you throw at it and has complete cluste
    - Re: (Score:2)
      
      by bingoUV ( 1066850 ) writes:
      
      Something fitting in maximum supported size of a database does not mean that performance of data manipulation with the database will meet the business criteria in the available budget.
- Re: (Score:2)
  
  by ToasterMonkey ( 467067 ) writes:
  
  Did I trip into a time warp and come out a decade in the past?
  Who the fuck is actually talking about hadoop or map reduce in 2015? The same retards that were creaming their little cunts about it in 2005?
  Even when you ignore the joke that is Java, hadoop is unwieldy, unreliable shit if you actually care about storing and retrieving correct, synchronized data.
  If you're fine with throwing all of your data in a pot and getting some sort of result that looks mostly correct, then knock yourself out and use hadoop.
  If your data needs to be correct, define it and its relationships then use SQL. You will have to pay someone decent money to do this correctly.
  None of these complaints seem to keep people from using Splunk.... unstructured data soup isn't going anywhere at any scale, we'll just call it different things.
  I can't even fathom a world where all the data we analyze in Splunk could have been fed into Oracle and turned into usable reports. All of our users would have to be Oracle DBAs.
- - Re: (Score:2)
    
    by sexconker ( 1179573 ) writes:
    
    Tell that to the people who use it as a database and say it heralds the end of SQL and other relational databases.
- Re: What Fucking Decade Is It? (Score:1)
  
  by Anonymous Coward writes:
  
  +2 Interesting? More like -5 Ignorant.
  RDBMSs are not a workable solution for the kinds of problems Big Data is trying to solve. You need something else. There is no such thing as a "simple" Big Data solution.
  The Java-based Big Data solutions are really the only ones that exist in the world, other than those that were developed in-house years ago by companies who had to deal with huge-scale problems in the past.
  So if your solution for Big Data is Oracle (RDBMS), you don't belong in this conversation.
  - Re: (Score:2)
    
    by sexconker ( 1179573 ) writes:
    
    If you're using a term like "Big Data", you don't belong in the fucking building.
    Relational databases are perfectly suited to extremely large and complex datasets. You just have to intelligently design your database. You can't just throw noise into a pot and expect useful results. Hadoop (map reduce) tries to do exactly this. If you care about correctness, completeness, and synchronization of data, it's trash.
Didn't work very well for Hitler (Score:2)

by Billly Gates ( 198444 ) writes:

Especially if an intern set it up [youtube.com]
Only Spark? (Score:3)

by sfcat ( 872532 ) writes: on Wednesday May 13, 2015 @09:56PM (#49686633)

The problem with "big data" is that there are no vendor specs and the implementations are sometimes questionable. There is a provider that does a better which is SQLStream (http://www.sqlstream.com) which has a streaming DB which is controlled via SQL. In addition to normal tables, you have streams which are relational typed conduits though which data flows and windows which are time (and row) based groups of tuples which can be used in agg queries with all the standard SQL functions (there's also Java UDXes and MED support). Designing your middleware on top of a SQL engine is a much better design pattern than doing it all with hand wired Java. All this and about 100x the throughput of a Hadoop program. Disclaimer: I'm an engineer at SQLStream.

Share
twitter facebook
- Re: (Score:3)
  
  by phantomfive ( 622387 ) writes:
  
  I read your post but I still have no idea what your 'streams' are, or why anyone would want to use them.
  - - Re: (Score:2)
      
      by sfcat ( 872532 ) writes:
      
      I do. The term is ill-defined, but in general: if you land the data before you process it's a not-stream.
      
      That's a pretty good way to think about it. A stream is really just a table without disk backing which means you have to be reading from the stream before you write to it. In a streaming system, select queries run forever (or at least with a timeout) and inserts must happen after a select query on the same stream is made for the data to be transmitted through the stream. In this way you can take a stream of incoming data provided by an insert statement(s) and send it to multiple different reading queries
Hadoop was never really the right solution... (Score:5, Insightful)

by rockmuelle ( 575982 ) writes: on Wednesday May 13, 2015 @09:59PM (#49686651)

A scripting language with a good math/stats library (e.g., NumPy/Pandas) and decent raid controller are all most people really need for most "big data" applications. If you need to scale a bit, add few nodes (and put some RAM in them) and a job scheduler into the mix and learn some basic data decomposition methods. Most big data analyses are embarrassingly parallel. If you really need 100+ TB of disk, setup Lustre or GPFS. Invest in some DDN storage (it's cheaper and faster than the HDFS system you'll build for Hadoop).
Here's the break down of that claim in more computer sciencey terms: Almost all big data problems are simple counting problems with some stats thrown in. For more advanced clustering tasks, most math libraries have everything you need. Most "big data" sizes are under a few TB of data. Most big data problems are also I/O bound. Single nodes are actually pretty powerful and fast these days. 24 cores, 128 GB RAM, 15 TB of disk behind a RAID controller that can give you 400 MB/s data rates will cost you just barely 5 figures. This single node will outperform a standard 8 node Hadoop cluster. Why? Because the local, high density disks that HDFS encourages are slow as molasses (30 MB/s). And...
Hadoop has a huge abstraction penalty for each record access. If you're doing minimal computation for each record, the cost of delivering the record dominates your runtime. In Hadoop, the cost is fairly high. If you're using a scripting language and reading right off the file system, your cost for each record is low. I've found Hadoop record access times to be about 20x slower than Python line read times from a text file, using the _same_ file system for Hadoop and Python (of course, Hadoop puts HDFS on top of it). In Big-O terms, the 'c' we usually leave out actually matters here - O(1*n) vs. O(20*n). 1 hour or 20 hours, you pick.
If you're really doing big data stuff, it helps to understand how data moves through your algorithms and architect things accordingly. Almost always, a few minutes of big-O thinking and some basic knowledge of your hardware will give you an approach that doesn't require Hadoop.
tl;dr: Hadoop and Spark give people the illusion that their problems are bigger than they actually are. Simply understanding your data flow and algorithms can save you the hassle of using either.
-Chris

Share
twitter facebook
- Re: (Score:2)
  
  by sfcat ( 872532 ) writes:
  
  Here's the break down of that claim in more computer sciencey terms: Almost all big data problems are simple counting problems with some stats thrown in. For more advanced clustering tasks, most math libraries have everything you need. Most "big data" sizes are under a few TB of data. Most big data problems are also I/O bound. Single nodes are actually pretty powerful and fast these days. 24 cores, 128 GB RAM, 15 TB of disk behind a RAID controller that can give you 400 MB/s data rates will cost you just barely 5 figures. This single node will outperform a standard 8 node Hadoop cluster. Why? Because the local, high density disks that HDFS encourages are slow as molasses (30 MB/s). And...
  Hadoop has a huge abstraction penalty for each record access. If you're doing minimal computation for each record, the cost of delivering the record dominates your runtime. In Hadoop, the cost is fairly high. If you're using a scripting language and reading right off the file system, your cost for each record is low. I've found Hadoop record access times to be about 20x slower than Python line read times from a text file, using the _same_ file system for Hadoop and Python (of course, Hadoop puts HDFS on top of it). In Big-O terms, the 'c' we usually leave out actually matters here - O(1*n) vs. O(20*n). 1 hour or 20 hours, you pick.
  Optimization is usually about creating a small inner loop at the expense of setup cost. You can see this in compilers/languages (creating an optimized binary vs a script interpreter), in databases (prepare vs execute), and in these types of big data systems. Hadoop can't and doesn't optimize its inner loop very well at all due to its basic programming interface. It stores each row in an array of Java objects. A better design would process buffers of data with non-copying access libraries to hide this ab
Netcraft confirms it! (Score:2)

by iamacat ( 583406 ) writes:

BSD is dying for how long again? It's still around and having monthly releases [openbsd.org]. For open source projects, popularity contests are much less important. With massive existing user base, Hadoop will be actively maintained for long time. So if you already familiar with it and it serves the needs of your project, go right ahead.
Big Data != toolset (Score:3)

by Required Snark ( 1702878 ) writes: on Thursday May 14, 2015 @12:09AM (#49687219)

Both Pointy Headed Bosses and Slashdot loooove talking about tools. As the posts generally show, both PHBs and Slashdoters have no clue about what Big Data is used for. It's all about the buzzwords and technology, not about use and utility.
There are no references to any algorithms. Rank ordering? Nope. Social graph analytics? No. Netflix style recommendations? Uh-uh. Statistics? None.
Without talking about data sets, algorithms and expected results, yammering about tools is meaningless. Hot air.
But who cares, because you all get to call each other stupid, and try and prove that you are the biggest baddest tech weenie on the block. From here it seems that you don't even know where the block is. You don't even seem to know which direction you need to go to get to a street. (Like the implied car reference there?)
I'm beyond unimpressed. It's obvious that no one has a clue what they are talking about. Go off and learn something, and then maybe you will be able to write a post that isn't a waste of time. Other then that STFU and get off my lawn.

Share
twitter facebook
- Re: (Score:2)
  
  by David_Hart ( 1184661 ) writes:
  
  I agree. There is a distinct lack of discussion that outlines where Hadoop shines versus a RDBMS and these other tools. I did some reading and it seems like a database system does better with data that is organized and has a distinct relationship between data sets. Hadoop and parallel processing seems to work better for data that is highly unstructured and for which you need to delve deeply to find relationships and create adhoc reports.
  Some have mentioned that one of the reasons for interest in Hadoops
  - Re: (Score:1)
    
    by Rob Fielding ( 3524407 ) writes:
    
    Actually, the biggest problem with RDBMS and similar tools is the fact that you are expected to mutate data in place, and mash it into a structure that is optimized for this case. Most of the zoo of new tools are about supporting a world in which incoming writes are "facts" (ie: append-only, uncleaned, unprocessed, and never deleted), while all reads are transient "views" (from combinations of batch jobs and real-time event processing) that can be automatically recomputed (like database indexes).
- Re: (Score:2)
  
  by Bob9113 ( 14996 ) writes:
  
  Both Pointy Headed Bosses and Slashdot loooove talking about tools. As the posts generally show, both PHBs and Slashdoters have no clue about what Big Data is used for. It's all about the buzzwords and technology, not about use and utility. There are no references to any algorithms.
  Heh. I've been doing big data since 2000. Fifteen years experience in a field that's five years old, I like to say. And let me say this: You nailed it. Your whole post, not just the part I quoted. I've used the tools, from Colt t
- Re: (Score:1)
  
  by Schnee ( 743890 ) writes:
  
  +1. Without analysis, big data is just a bunch of data
- Re: (Score:1)
  
  by Rob Fielding ( 3524407 ) writes:
  
  Except, if you are talking about a centralized database tool, you already know that the default design of "everybody write into the centralized SQL database" is a problem. Therefore, people talk about alternative tools; which are generally designed around a set of data structures and algorithms as the default cases. A lot of streaming based applications (ie: log aggregation) are a reasonable fit for relational databases except for the one gigantic table that is effectively a huge (replicated, distributed
Hadoop is growing but not that much (Score:1)

by Luis Daniel Soto ( 4113123 ) writes:

From 2010 to early on the year I was responsible for Big Data technical marketing at Microsoft, recently joined AWS. I won't comment of any of the specifics for my current or former employer, but it's a fact that other nosql technologies have a higher adoption rate. It's clear that the traditional datawarehouse had limitations, and that hadoop is not replacing the EDW. The largest companies are using proprietary technologies, not adopting hadoop. Hadoop 2.0 is much better, you should use it if you have the
No, the world is leaving big data behind. (Score:2)

by Qbertino ( 265505 ) writes:

Meaning the hype around big data has settled and its back to business. I'd say there less than 10 companies worldwide to whom big data actually might make sense. Others clean and aggregate their data in such a way that its actually useful. .... I don't want my bank guessing my balance with big data statistics, I want them to know it. And so do most other people.
Betteridge's law of headlines (Score:2)

by allo ( 1728082 ) writes:

Betteridge's law of headlines finally proven wrong?
Hadoop alternatives (Score:2)

by jean-guy69 ( 445459 ) writes:

Has anyone considered Joyent's Manta [joyent.com] ?
This is a distributed object storage with integrated compute.
Data is stored on a cluster of SmartOS hosts..
And processed directly on each host inside a OS container (SmartOS zone), no data movement.
Lot of APIs available: R, command-line, python, ruby, node.js etc..
Available on their cloud and as a on-premises commercial product, opensourced [github.com] last November (simulteanously with smartdatacenter).

There may be more comments in this discussion. Without JavaScript enabled, you might want to turn on Classic Discussion System in your preferences instead.

Nope. Not happening. (Score:5, Informative)

Re: (Score:1, Funny)

Re: (Score:1)

Re: (Score:1)

Re: (Score:1)

Re:Nope. Not happening. (Score:5, Insightful)

Re: (Score:2)

Re: (Score:2)

Re: (Score:1)

Re:Nope. Not happening. (Score:5, Insightful)

Re: (Score:3)

Re: (Score:1)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Rival? (Score:3)

Re:Rival? (Score:5, Informative)

Re:Rival? (Score:4, Interesting)

Re: (Score:2)

Who is this question for? (Score:2)

Re: (Score:2)

Re: (Score:2)

Relevance of Security (Score:5, Funny)

Re: (Score:2)

What Fucking Decade Is It? (Score:1, Interesting)

Re: (Score:2)

Re:What Fucking Decade Is It? (Score:4, Insightful)

Re: (Score:2, Funny)

Manufacturing Data (Score:1)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: What Fucking Decade Is It? (Score:1)

Re: (Score:2)

Didn't work very well for Hitler (Score:2)

Only Spark? (Score:3)

Re: (Score:3)

Re: (Score:2)

Hadoop was never really the right solution... (Score:5, Insightful)

Re: (Score:2)

Netcraft confirms it! (Score:2)

Big Data != toolset (Score:3)

Re: (Score:2)

Re: (Score:1)

Re: (Score:2)

Re: (Score:1)

Re: (Score:1)

Hadoop is growing but not that much (Score:1)

No, the world is leaving big data behind. (Score:2)

Betteridge's law of headlines (Score:2)

Hadoop alternatives (Score:2)

Related Links Top of the: day, week, month.

Slashdot Top Deals