



Is It Time For NoSQL 2.0? 164
New submitter rescrv writes "Key-value stores (like Cassandra, Redis and DynamoDB) have been replacing traditional databases in many demanding web applications (e.g. Twitter, Google, Facebook, LinkedIn, and others). But for the most part, the differences between existing NoSQL systems come down to the choice of well-studied implementation techniques; in particular, they all provide a similar API that achieves high performance and scalability by limiting applications to simple operations like GET and PUT.
HyperDex, a new key-value store developed at Cornell, stands out in the NoSQL spectrum with its unique design. HyperDex employs a unique multi-dimensional hash function to enable efficient search operations — that is, objects may be retrieved without using the key (PDF) under which they are stored. Other systems employ indexing techniques to enable search, or enumerate all objects in the system. In contrast, HyperDex's design enables applications to retrieve search results directly from servers in the system. The results are impressive. Preliminary benchmark results on the project website show that HyperDex provides significant performance improvements over Cassandra and MongoDB. With its unique design, and impressive performance, it seems fittng to ask: Is HyperDex the start of NoSQL 2.0?"
Re:wake me in a few years (Score:5, Informative)
Decisions based on cute animals and straw-man arguments without any facts... You must be a manager!
Locally sensitive hashing (Score:5, Informative)
This is a type of index, not a type of database. See locally sensitive hashing. [wikipedia.org] It's an efficient way to find keys which are "near" the search key in some sense.
Such a mechanism could be provided in a key/value store or an SQL database. It's even possible to do it on top of an SQL database. [compgeom.com] It's more powerful in a database that can do joins, because you can ask questions with several approximate keys.
This is an area of active research. Many machine-learning algorithms are scaled up by locally sensitive hashing, so they can work on big data.
Re:Berkeley DB? (Score:5, Informative)
Re:Keys and values? (Score:4, Informative)
Isn't that what XML is for? XML files are also compatible across systems.
XML is more useful for transferring data between systems. For storing data is kind of sucks, since there's no indexes (not the kind we need for fast lookups anyway) and it's extremely verbose.
Re:Why not both? (Score:5, Informative)
NoSQL is a terrible misnomer, in that the difference is far more than just "doesn't use SQL", and there are NoSQL systems that do actually support SQL. It's really just referring to data storage systems that aren't based on relations. That change in paradigm has its advantages (speed (in some cases), scalability, and flexibility) and disadvantages (speed (in some cases), lack of consistency, less restriction on bad programming). Of course, each NoSQL system tries to mitigate the disadvantages, and each RDBMS tries to prove itself better than all of NoSQL's advantages. It's a big fun party involving lots of mud-slinging.
Most NoSQL systems I've worked with are distributed hash tables, in a basic sense. Each value has a key, and that key determines where it's stored on a cluster. Values are not tied to any other values, so things like "foreign-key relations" are silly in a discussion of NoSQL. Rather, the algorithm to retrieve the data does all of the processing to connect data, using massive parallelization across a cluster to handle huge amounts of data at once.
With a traditional RDBMS, the application must fit its data to the schema completely before any data can be stored. This, of course, means that all data in the database can be assumed to be complete. You won't find references that don't exist, which makes queries straightforward.
With NoSQL, the database is treated as a more flexible bucket. Data is dumped in with a key, with little concern for fitting the design of the application's model. This, of course, means a bit more planning at design time, but the data can be arranged to better fit whatever it actually represents. Some details are present, and some aren't, but that's okay. The retrieval algorithm (typically a MapReduce program) should check for the existence of whatever data it needs, and handle errors accordingly. Those MapReduce programs are far more complicated than a simple SQL query, but the database's backend is conceptually simpler as an abstract key/value store. Key/value stores have been around for decades, and studied extensively. They can be made more fault-tolerant and scalable than RDBMS shards, but lack the support for large set-based comparisons.
The comparison to the BASIC-vs-C battles is appropriate. Both BASIC and C serve their purposes well (education and system programming, respectively), but neither should be used where the other is better suited. NoSQL and RDBMSs also both have their places.
Re:Why not both? (Score:4, Informative)
Most NoSQL databases have indexes, and the indexes can be searched to find the key(s) you need. As an example, straight from the MongoDB examples:
In other words "find all the objects which do not contain 'red' in the field 'colors'". The ObjectId that is returned happens to be the key.
Re:Why not both? (Score:5, Informative)
NoSQL 1.0 is usually not much more than a hash-accessed flat-table database. GDBM, QDBM and BerkeleyDB are all hash-accessed flat-table databases. The refinements mentioned as being added to NoSQL databases (such as searchable indexes) are simply sequential indexes that associate some indexed parameters with the hash value.
NoSQL generally works by you pushing an item into the database and getting one or more hash values back. You want the item back, you give the database the hash values and you get the item. Object-oriented and object-based NoSQL both work by allowing objects to point to other objects. This gives you inheritance. (Basically you have a hash value that points to another record, where the structure of that other record is fixed rather than chosen at run-time via a join statement.)
Basically, database theory describes all the various forms of database you can have: flat-file, hierarchical, network, relational, object/relational, relational, semi-structured, associative, entity-attribute-value, transactional and star (aka data warehouse). A description of some of these can be found here [unixspace.com].
This describes how the data is actually laid out, but does NOT necessarily describe how the data is accessed.
Database theory also describes the following underlying methods of accessing data: sequential, indexed, hash. Any combination of these is permitted, so you can have an index that points into sections of a database that are then searched sequentially for example. Or you can have indexes that point to other indexes that in turn point to a hash value. And so on.
SQL is just a meta-language that allows you to apply a restricted form of set theory on the underlying access methods. There were arguments at the time SQL appeared that it should allow all of set theory - and those arguments still go on, with some SQL alternatives using actual set theory notation as opposed to SQL notation.
NoSQL, in some cases, is just direct access to hash tables for directly accessing items. In other cases, it's a lightweight abstraction layer.
In the example advertised in the summary, an object is referenced through a set of indexes. If you have a partial set of indexes, you reference multiple objects but they will be related in some way. There is nothing X.0 about it, it's just a NoSQL database that uses a network database topology rather than a flat-file topology. It is nothing new.
I recognize that marketspeak is what sells things, that calling the systems by what they actually are would not be nearly as impressive to managers. Managers do not, as a rule, read Slashdot. Geeks and Nerds read Slashdot. Geeks and Nerds know Database Theory. (Well, if they don't, they damn well should -- either that, or they can use Google to look the terms up.) The two additions to database theory in the past 30 years have been the Object-Relational and Object-Oriented models.
Re:Why not both? (Score:5, Informative)
A hierarchy IS a relationship. In a hierarchical databases, child segments and parent segments were the main kind of relationship used.
All relational databases did was allow the relationships to be more freely defined.
Further to that, a key / value pair is also a relationship, in that the key symbolically represents the data. That's why it is correct to call them NoSQL databases: They forgo the complexity of a general query language. In doing so, they also lose the ability to inherently store anything except the most basic relationship: the key / value lookup.
Re:Wow! That's some neat Progress! (Score:5, Informative)
300 - 400%? Lol you're doing it wrong. Billions of rows? So what? Easily handled by SQL.
CERN has a database with trillions of rows in a traditional Oracle RDBMS. I saw a presentation on it at Oracle OpenWorld this year by a guy from CERN..
Yahoo also has trillion-rowed databases [computerworld.com], on PostGreSQL.