Catch up on stories from the past week (and beyond) at the Slashdot story archive

 



Forgot your password?
typodupeerror
×
Data Storage Technology

World's Largest Databases Ranked 356

prostoalex writes "Winter Corp. has summarized its findings of the annual TopTen competition, where the world's largest and most hard-working (in terms of load) databases are ranked. The results are in, and this year the contestants were ranked on size, data volume, number of rows and peak workload. I wrote up a brief summary of the top three winners in each category for those too lazy to browse the interactive WinterCorp chart."
This discussion has been archived. No new comments can be posted.

World's Largest Databases Ranked

Comments Filter:
  • Re:No, it's 30,000GB (Score:4, Informative)

    by Cutie Pi ( 588366 ) on Friday December 12, 2003 @09:13AM (#7699710)
    You're off by 3 orders of magnitude. The largest is 30TB.
  • Re:Google (Score:5, Informative)

    by tinrib ( 632120 ) <david.rainsford@org> on Friday December 12, 2003 @09:14AM (#7699717)
    Doesn't Google use 'big files' rather than a database for storing all its data?

    see http://www.cs.rochester.edu/sosp2003/papers/p125-g hemawat.pdf [rochester.edu] which describes the Google filesystem.
  • Re:Google (Score:5, Informative)

    by lewp ( 95638 ) on Friday December 12, 2003 @09:21AM (#7699767) Journal
    Even if Google qualified, which it probably doesn't due to the methods it uses for its data storage, if I read the article properly the database vendors are responsible for naming the participants.

    Since Google's stuff seems to be developed in-house, they don't have a major database vendor to nominate them.
  • Re:Hang on ... (Score:2, Informative)

    by Mr. Dop ( 708162 ) on Friday December 12, 2003 @09:22AM (#7699778)
    Nope, you dont even quallify:

    In order to qualify for the TopTen program consideration, any commercial production database implementation was required to feature a minimum of 500GB of data for Microsoft Corp.'s Windows and NT platforms and 1TB of data for all other platforms.

  • by Peridriga ( 308995 ) on Friday December 12, 2003 @09:22AM (#7699779)
    Well... if you actually read the article it clearly states that 29.2 is not the largest...

    You can find the link to the article yourself but

    1. AT&T @ 94.3TB
    2. Amazon @ 34.2TB
  • Re:Google (Score:5, Informative)

    by stripmarkup ( 629598 ) on Friday December 12, 2003 @09:25AM (#7699806) Homepage
    It seems that they are comparing relational databases. Search engines use proprietary databases which, among other things, do not allow for live insertion of records, SQL commands, etc. As for data volume, Google (or Yahoo or MSN, for that matter) are probably in the ballpark. The average html page is around 10k. Google probably stores at least 10^9 raw web pages in their cache(that's 10 TB alone) plus a lot of meta information about links to-from many others.
  • by MS ( 18681 ) on Friday December 12, 2003 @09:27AM (#7699822)
    Read all, to get the facts:

    Lastly, in the Windows OTLP category HP servers were used by 7 of 10 organizations, and Microsoft SQL Server was the DBMS choice for seven respondents.

    Neither WindowsNT, nor MS SQL are generally a choice for the top databases. In fact, to make the entry in this list, a Windows-Database was required to be only half as big as databases on other platforms:

    In order to qualify for the TopTen program consideration, any commercial production database implementation was required to feature a minimum of 500 GB of data for Microsoft Corp.'s Windows and NT platforms and 1 TB of data for all other platforms

    :-)
    ms

  • by sql*kitten ( 1359 ) * on Friday December 12, 2003 @09:27AM (#7699828)
    I have none, nada, zip experience in big databases.

    S'okay, I have plenty :-)

    But it surprised me that the peak workloads were measured in 100s of concurrent queries. If I had to make a wild guess, I would have guessed 10s of thousands. My blessed ignorance destroyed.

    You would typically see tens of thousands (or more) of concurrent connections to a middleware layer - like Tuxedo - which would then multiplex them down to hundreds of connections to the database. This is because there is a lot of latency in establishing a connection, in fact logging in often takes an order of magnitude longer than running an actual query, yet few users submit transactions nonstop. So there is no sense in maintaining tens of thousands of expensive user contexts on the DB server, and there is no sense in requiring intermittent (relatively speaking) users to log out after a short idle period. Middleware does nothing but manage concurrent user contexts, and it can do so very efficiently. A database can't, because it tries to preallocate as much context as it can, and that doesn't match real-world usage patterns, and anyway, database vendors concentrate on their SQL engines and leave middleware vendors to manage the rest.

    Of course, if you are a big database vendor, you probably also sell middleware, but there's no-one who tries to bundle the two into one, any more than you'd want a web server to have its own filesystem.
  • by Agent 00p ( 568873 ) on Friday December 12, 2003 @09:35AM (#7699879) Journal
    They don't have to put all their data into one database, though ...,
  • Re:No IMS? (Score:5, Informative)

    by John Harrison ( 223649 ) <johnharrison@[ ]il.com ['gma' in gap]> on Friday December 12, 2003 @09:35AM (#7699881) Homepage Journal
    Google is your friend. [google.com]

    IMS is the database that was used to keep track of things for the moonshot. It is an IBM product. It is hierarchical as opposed to relational. Because of this it can do certain things very quickly, though in general it isn't as flexible as say DB2. Because it has been around so long, applications where having a DB was really important tend to have bought IMS a long time ago and developed systems around it. If your system is old enough, large enough and still works well for you there is no need to migrate to relational. Most of the world's financial transactions pass through an IMS system at some point. It is very stable and has uptimes that measure in years if not decades by now.

    Because of this I am surprised that it is not on the list. There are really big IMS databases out there that run a lot of transactions. Because it isn't relational there is some bigotry against it and it is ignored in the popular press.

  • SMP? (Score:5, Informative)

    by paulbd ( 118132 ) on Friday December 12, 2003 @09:36AM (#7699893) Homepage
    does anybody believe that the "SMP" used in reference to the French Telecom DB means "symbol manipulation program" rather than "symmetric multiprocessing"? how are we supposed to take seriously a study (or at least a report about the study) where they just look up acronyms with no understanding?
  • by Anonymous Coward on Friday December 12, 2003 @09:38AM (#7699906)
    but they're not running a big Oracle or IBM style database, they're using a content management and static file system.
  • Re:SQL Server? (Score:5, Informative)

    by azaris ( 699901 ) on Friday December 12, 2003 @09:43AM (#7699943) Journal

    Typical Microsoft calling their product something generic that should apply to any SQL server. Almost like calling a product .. Windows.

    It was originally called Sybase SQL Server but was later picked up by MS who adapted the name. Typical /. objectivity.

  • by kiwimate ( 458274 ) on Friday December 12, 2003 @09:50AM (#7699974) Journal
    At least they don't try to hide it in three point text -- it's right there on the main page. But, anyway...if you want to see another (MS) view, look here [microsoft.com].

    By the way, I must just grumble at the lack of knowledge some people have on SQL Server. I sat in a meeting a few weeks ago with our Oracle-centric architects who decided that, as SQL Server is being used more and more extensively in our company, they'd better understand something about it. They started asking us various questions which rather puzzled me until I thought I knew what the problem was. "You do realize that SQL Server uses transaction logs, don't you? And that it implements transactional integrity, so, for example, will roll back an incomplete transaction?". Blank stares. "Really? Huh, we just assumed it wouldn't have those features because it's not a real database". Well thanks, guys, for doing your homework and being Oracle defensive on the basis of a good solid knowledge of the issues. At least SQL Server doesn't store internal passwords in a table that I can easily run a SELECT query on. Yes, I know they're encrypted -- but SQL Plus is quite happy to allow me to copy and paste the encrypted password into the authentication dialog and accept that as a valid logon.
  • Re:Google (Score:2, Informative)

    by KarmaPolice ( 212543 ) on Friday December 12, 2003 @10:13AM (#7700135) Homepage
    What about visa/mastercard/american express?

    IMHO some of them didn't want to be in that list.


    If you look at "database size", number 4 is listed as anonymous. They probably aren't too interested in telling everyone what database and platform they are using for storing very critical data with.
  • by Anonymous Coward on Friday December 12, 2003 @10:18AM (#7700175)
    The size of the database isn't all that interesting. What is more important from a maintenance and reliability perspective is size in relation to average and peak loads. Who cares if you have 3Tb of data in MS Sql Server, if it takes you 10x longer to run the same query on TeraData and Oracle. For small databases, who cares. Any of the major database can handle several Gb of data without any problems. But there is a huge difference between TeraData, Oracle, Sybase, Db2 and MSSql Server. Sql Server can't handle concurrent queries worth shit from first hand experience. You have to run your queries in an async fashion and have the clients pick up the results later on. Compare it to Db2, Sybase and Oracle, the scalability factor under heavy concurrent without some middleware in between MS Sql Server blows.

    Obviously, you would be crazy to not use some middleware, but things aren't as simple as any of the PR guys claim. Running queries asynchronously creates a different set of problems and complicates the entire architecture. If you look at the biggest installation, they all use middleware and most of them use Tuxedo. This includes most, if not all MS Sql Server deployments. OLEDB can't that kind of load and neither can standard COM+. Just look read the full disclosures for TPC. You'll see all the MS Sql Server tests wrapped Tuxedo with COM+. As much as Microsoft likes to slam EJB and Tuxedo being too expensive, you can't scale Sql Server without using tuxedo for really heavy deployments.

  • Re:Google (Score:5, Informative)

    by Wastl ( 809 ) on Friday December 12, 2003 @10:35AM (#7700299) Homepage
    The term "database" is rather unprecise.

    One might see a database as merely a "big file" with mechanisms to access and modify it consistently (and surely, Google has some means to ensure consistency). A big file does not disqualify for the term "database" just because it is not produced by one of {Oracle, MS-SQL, ...} or cannot be queried by the language SQL.

    It is also possible to consider the Web to be a database (of Web sites). Or an XML, BibTeX, dbm, whatsoever file.

    Sebastian

  • by Anonymous Coward on Friday December 12, 2003 @10:53AM (#7700486)
    We have databases in our organization (Star Schema, Red Brick) where the fact tables literally have billions of rows. I'm sure there are many other organizations (especially government entities) that have huge databases not on this "list". For those interested on operating at this scale, other interesting hardware/software data mining solutions in the same vein as a Teradata are Netezza Corp's database applicances.
  • Re:AmEx (Score:3, Informative)

    by hrieke ( 126185 ) on Friday December 12, 2003 @11:25AM (#7700876) Homepage
    I used to work for a company called Epsilon Data Management[1], in Burlington MA. They've been bought since I left them a while ago, but they where the keeper of AmEx customer transaction database for data mining and direct marketing (junk mail and phone calls).
    Big. 7 data silos big. Each silo holds 50k tapes, each tape was 30gb, and it usually took 4 days to load.

    [1] Epsilon was originally an AmEx division, which was spun off to keep other customers happy (banks and other CC companies).
  • Re:My porn database (Score:2, Informative)

    by Oopsz ( 127422 ) on Friday December 12, 2003 @11:29AM (#7700927) Homepage
    How about automatically categorize, find, and download?

    It exists, and its open source. Welcome to the wonderful world of porn-get [lesbian.mine.nu].
  • Re:SMP? (Score:3, Informative)

    by RapaNui ( 242132 ) on Friday December 12, 2003 @11:32AM (#7700975)
    Yup.

    Methinks the character who wrote the article came across the term 'SMP', went to FOLDOC or The Jargon File, and whaddya know - the first hit returns 'Symbol Manipulation Program - Stephen Wolfram's yadda yadda yadda'.

  • by jgerry ( 14280 ) * <jason...gerry@@@gmail...com> on Friday December 12, 2003 @11:48AM (#7701199) Homepage
    How do they backup a database that is 94.3 TB?

    I support very large Oracle databases for a living (very large meaning > 1TB), databases that must be up 24/7. Backups are done in a number of different ways:

    1) Disk syncs, block by block, between disk subsystems at disparate locations, to retain multiple copies of a database in different locations. They can be synced to more than one location too, so you can have as many copies of the database as you want. Your main database is the only "hot" database, the others can be brought up and recovered if needed. We mainly use EMC disk subsystems to do this, the process is called BCV (can't remember what that stands for right now)

    2) Real-time replication. One-to-one or one-to-many. All databases are "hot" at all times. This can be great for load balancing too since you can have multiple system onine at the same time. Very difficult to maintain and monitor.

    Large databases just can't be put to tape anymore. Even if you did, it would take days or weeks to recover them if they failed. Disk to disk is about the only way to provide backups for really large databases.
  • by BigGerman ( 541312 ) on Friday December 12, 2003 @12:27PM (#7701711)
    To add to that,
    Standby databases are popular when (in Oracle scenario) the archived log files from your hot production database are constantly automatically applied to the cold standby database in some different location and if something happens to the primary it takes very little time to bring the standby up.
    Also Oracle hot backup is by nature incremental, you can do like one tablespace per night, dont have to do the whole database at the same time (while backing up all the archived log files). I have seen sites where last cold backup was done something like 4 or 5 years ago.
  • Re:Google (Score:3, Informative)

    by MattRog ( 527508 ) on Friday December 12, 2003 @12:37PM (#7701839)
    A database is any collection of data. A database management system (which is what most people erroneously call a database) is a system of programs (say Oracle/MS SQL) to maintain the data in a database.
  • by Lovepump ( 58591 ) on Friday December 12, 2003 @01:02PM (#7702194)
    BCV - Business Contingency Volume I think. We call it Snap backup'ing.

    When we dump data, it gets dumped to a VTS (that's Virtual Tape system which is a whopping collection of disk, or DASD pretending to be loads of cartridges). Once the data is on the VTS, it then makes it's way to a selection of real MagStar drives which sit behind the VTS system.

    Works quite nicely.
  • by AnyLoveIsGoodLove ( 194208 ) on Friday December 12, 2003 @01:48PM (#7702775)
    As someone mentioned: Business Continuance Volumes is local copy within the storage array. Sync times depend on data change rate(dirty tracks). Host does not see any performance degradation. Copies are consistent from app level down, if done right.

    SRDF = Symmetrix remote data facility. is a bcv copy across a link (network, fiber, DS3/1, OCs etc...fill in the blank). Again it only copies any changed tracks....

    Good stuff, this is how most of the Fininacials recovered from 9/11 so quickly...

    The databases then are put to tape using the copies. when the db exceed 24 hour backup time, you use multiple copies in rotation. Usually there's a regulatory reason to go to tape, otherwise people just use disk.

Get hold of portable property. -- Charles Dickens, "Great Expectations"

Working...