Become a fan of Slashdot on Facebook

 



Forgot your password?
typodupeerror
×
Hardware

Proposal For Open-Source Benchmarks 118

nd writes: "Van Smith from Tom's Hardware has written a proposal that calls for open source benchmarking. He talks about the need for increasing the objectivity of benchmarking. The proposal is basically to develop a suite of open-source benchmarking tools and new methodologies. It's a rather dramatic column, as he discusses Transmeta, bias towards Intel, among other things. " Well, once you get through the inital umpteen pages of preamble, the generically named A Modest Proposal is the actual point. Interesting idea - but I shall weep for the passing of bogo-MIPs as the definitive measure of system performance. *grin*
This discussion has been archived. No new comments can be posted.

Proposal For Open-Source Benchmarks

Comments Filter:
  • by Anonymous Coward
    Guy 1: raw INTEGER for his web server
    Guy 2: raw FPU/3DNOW and some INTEGER for Quake
    Guy 3: raw SSE for some scientific software
    Guy 4: raw MMX to watch German pay-tv illegaly
    Guy 5: raw [insert CPU unit here]

    I buy my CPUs not based on benchmark numbers.
    I buy them because they have fast units that
    I like. (Athlon FPU, etc.)
  • by Anonymous Coward
    I've been using my own system for benchmarking cross-platform for close to three years now. I only use bogo-MIPs as a funny word I like to say. Say it a few times, you'll see what I mean. there's nothing quite like funny words. My favorite is "futon". I don't know why, but I think it's very funny. "I burned the futon" would make a good band name.

    But i digress. My benchmark system is built around the theory that packets are stuffed by weight not by volume. Kinda like potato chips. They always have that little disclaimer on the bottom of the bag. I don't really like potato chips much, but I generally like the green ones. The extra oil is tasty, and they're a little bit chewier. I like chewy foods. I also like green foods. I love broccoli.

    But I digress. So any random amount of data passed from one process to another (or one computer to another, or what have you) can be measured by a) the size in bits and b) the usefulness of data. For example, if a random packet contains information that just alerts another program that it's there, then it wouldn't really be all that important. Like people, I believe that a program should be able to operate independantly. That way if the dependant processes die, the program doesn't instantly die. If no one I'm working with holds up their end of the project, then I'm going to be set behind writing extra code or configuring servers I hadn't intended to. If I just planned on that from the get-go, then it wouldn't be so much of an issue. Time is money, I've heard it said, and unfortunately I'm running out. Of both actually. It's not fun. Yeah, I thought it was pretty cool to be able to buy a house in cash, but now I'm on the ramen noodle diet for a couple months until I can build my savings back up. Which wouldn't be so bad if I could afford some dried squid to toss in the noodles. That's MIGHTY tasty. I'm left to doing the best I can with what spices I'm not out of. And garlic. I don't care HOW broke I am, I can always afford fresh garlic.

    But I digress. If you set up a system to compare the relative importance of a given data packet, weighed against its size, and multiplied by the time it takes to complete an operation, you can get a good measure of how efficient your system is. The coding is relatively simple. You need something that will run every possibility against the system in order to see where it performs well, and whether or not it gives priority to more important packets, regardless of size. Size is NOT everything. And no, I'm not as endowed as my ex-girlfriend told you, no one is that big. She's just trying to get you interested in me. Nothing personal, I mean, you're a nice person and all, and you have a really pretty smile, but you're just not my type. I don't think that you'd really like me so much once you got to know me. I don't know why the ex is so quick to push me off on someone. It's not like I'd even DREAM of touching her with a stolen dick. Oh, was that out loud.

    But I digress. My benchmarking tool is open sourced and available at . . . . I'm bored. I don't want to do this anymore. Sigh.
  • by Anonymous Coward
    Ah, I notice Transmeta mentioned prominently in the article summary.

    Those are the folks who say that "new benchmarks are needed." I imagine when your product doesn't excel with the present benchmark, you're inclined to want to change it.

    As Transmeta is spending a lot of time hyping their chips for portable use, maybe they should also scrutinize their plan to champion a server OS for use on their chips. Is the additional load of a Timesharing-type system really warranted, any more than desktop or server-grade benchmarks? Why tie up resources with all the clap-trap associated with a multi-user OS (i.e. group and user attributes in the filesystem are a real waste on a handheld device)?

    Come up with new benchmarks if need be.

    But also throw away quaint ideas from the 70's like 'everything ultimately devolves into being a teletype' and 'this is a TimeSharing system, supporting twenty-five users.'

    Probably wouldn't be a bad idea to think about scrapping stuff like termcap, either. Handheld portable devices should obviously still run sendmail and the NNTP server of your choice, however. Heck, here where I work we have a Palm-Pilot running that "L-whatever" OS duct taped to every printer, because we're just darned fond of LPD spools and other quaint stuff from the 70's.

    Hey, it's a new era of computing.

  • by Anonymous Coward
    I'm suprised no one has brought this up. Sites offering up benchmarks typically give USELESS information. All they ever give us is the arithmetic mean. No matter how much they tweak the collection process (throwing out the highest and lowest scores, for instance), we are still only getting the mean. I'd really like to see more comprehensive data, such as the median and mode scores plus the range of scores. Fluke results can EASILY throw the mean out of whack, for one. Also, just having the mean gives you no way of knowing if scores are changing over time (say, caching bringing down execution time after multiple consecutive runs). Ideally, I'd like to be able to get the raw data from these benchmarks as well as the interpretation. As it is, we basically only get interpretations from these reviews and have no way of verifying the validity of said interpretation.

    Open methedology is at LEAST as important as an open source benchmarking tool. Knowing how the numbers are being generated is definately very important. It is also EXTREMELY important to know how those numbers are being massaged.

    As for the issue of a vendor modifying the benchmark to skew results, two things. First, who's to say they don't already by influencing how the closed benchmark is written (admittedly less of a problem with benchmarks intended to be cross-platform)? Secondly, I think this can be solved with liscensing. Require that the source code, including modifications, be distributed with the benchmark. And require the source to be posted when you post numbers generated with this benchmark (ie, liscense the damned thing so people/companies cannot secretly modify it and post whacked numbers without letting us see the modifications).

    I think the idea of forcing these things into the open is an EXCELLENT idea. Even if the article (at toms that is) is mostly an arrogant polemic with very little of real substance...
  • by Anonymous Coward
    Quake is a dangerous benchmark, because Carmack is too good... Few apps are ever optimized in the way the Quakes are. The core code is small, and will fit in caches easily. Example: The coppermine has dethroned the Athlon in Q3A, because of it's matchless cache. But in CAD and scientific apps, the Athlon still rules due to raw FPU horsepower. See anandtech for benchmarks that include both types of apps on 1Ghz cpus.

  • by Anonymous Coward

    SPEC is the best benchmarking suite I've seen, but it has shortcomings which I've been itching to fix with my own benchmarking suite. But it's a huge project, and I haven't had time to do more than poke at it a little. I'll put up what notes I have on my web site here [flyingcroc.net] when I have the time, but they're not there right now.

    A brief summary: I think SPEC has the right idea, in that a benchmark should consist of a suite of real-life applications which are only allowed to be optimized in limited ways (to accurately represent how applications are optimized in the real world), that the components should be like the applications the target audience is interested in running, and that distinctions should be made between applications which stress different parts of the system. I think the target audience could be broadened considerably by selecting a slightly different set of applications, and I think that in addition to an int and fp sub-suite (which stress only the CPU and memory subsystems, to a large degree), there should be a third sub-suite which uses applications with more holistic demands on the system -- system calls and filesystem. I think that the purpose of a benchmark should be to enable the user of the system to predict how well the system is going to perform, and filesystem performance often has a large impact on real-life performance. For better or for worse, this will make the choice of OS and disk subsystem a much more important factor in determining the results of the benchmark, increasing explosively the number of reference systems necessary to generate useful results, but if such a level of complexity is necessary to accurately protray reality, then that is the level of complexity that we should have.

    Making the benchmark open source adds a new level of agenda to the benchmarking effort. It makes it in the best interests of hardware vendors to see better optimizations in the compilers used, and if the level of authenticity assigned to the benchmark report is partially dependent on the openness of the compiler used, then that could mean more corporate contributions to open-source compilers. It would also help avoid the use of bogus compilers which are only really useful for compiling SPEC components, which is a big problem with SPEC right now.

    -- Guges

  • by Anonymous Coward on Friday April 14, 2000 @07:54AM (#1131893)

    You know I'm getting somewhat sick of the whole open source thing. At first I thought it was a Good Thing, a way to allow people to collaborate on code and to keep it from being stolen. But gradually I am becoming more and more cynical about it - not so much the concept, but more the zealotry that surrounds it.

    Just look at the title of the article linked in this story - "A Call to Arms - A Proposal for Open-Source Benchmarks". WTF? Why is this a call to arms? Isn't this just a bit rabid for what is, after all, just an article about benchmarks. Benchmarks may be important, but they're not worth getting worked up over.

    And then the first page of the article is a rambling piece of tabloid "cyber"-journalism far worse than even Jon Katz has ever managed. Why is this diatribe necessary? Surely we all know what open-source is, and we all realise that the net has changed a lot of things. No, it's the same thing I see again and again - the zealotry of the open source proponent who feels the need for grand rhetoric and buzzword-filled arguments.

    There is an ideology behind open source, and a good one, but it has been taken too far. Richard Stallman is not the best person to represent such a diverse group of people - his radical politics and hatred of commercialism make him quick with the denounciation of anything he disagrees with, like the name Linux - after all, he'd rather it was "GNU/Linux" or even worse "Lignux". This kind of ideological zeal is certainly putting me off of the idea, and others I'm sure too, but there seems to be a never-ending parade of people willing to subscribe to his beliefs and zealotry.

    Anyway, what I'd like to see is a return to what open source is about - writing good, free code for the use of all. There's no need for flaming attacks on closed-source software or whatever - that shouldn't be the point of open source, and is just a waste of time better spent coding. Unfortunately /. seems to provoke this kind of hysteria, but even with this I'll still read it :)

    If you disagree, feel free to reply. Nicely :)

  • by Anonymous Coward on Friday April 14, 2000 @07:55AM (#1131894)
    Tom: "Open Source Babble Transmeta Crusoe Linux Ramble Internet Cyber-World Paradigm Revolution"
    Slashdot Multitudes: Yay! (clapclapclapclap)

    Jon Katz: "Open Source Babble Transmeta Crusoe Linux Ramble Internet Cyber-World Paradigm Revolution"
    Slashdot Multitudes: Windbag! Parasite! Media Whore!(boooo, hissssss)
  • Lignux?!? How could anybody forget the ever-popular "Gnulix"? Ahh, the old-school /. trolls, hehe...
  • Just look at the title of the article linked in this story - "A Call to Arms - A Proposal for Open-Source Benchmarks". WTF? Why is this a call to arms?

    Because most of the section names in the article, "A Call To Arms" included, are the names of episodes of Babylon 5.

  • they/'d be really embarassed to sell a coupla million worth of mainframe with a benchmark figure lower than that of an alpha costing a couple of grand.
    Even though the mainframe processors have been getting pretty fast, I think this clashes your earlier statement about mainframes being I/O beats (which is completely accurate). What embarrasment is there in the fact that a system doesn't do well in a scenario completely different from what it was designed for? That'd be like complaining that the diesel locomotive you just bought sucks on 0-60 performance.

    Besides, any IBM salesdroid worth his commission would mention the RS/6000 line.
    __

  • by pb ( 1020 ) on Friday April 14, 2000 @11:02AM (#1131898)
    Anyone remember the Benchmarking HOWTO?

    There are *lots* of open-source benchmarks, and of course we can make new and better ones, and get a test suite together.

    For starters, the LBT (Linux Benchmarking Toolkit):
    Run the BYTEmarks (and the old UNIX ones too, they're funny), Whetstone, XBench... oh, and compile a stock kernel (and don't fiddle with the options, 2.0.0 was recommended then.)

    Personally, I'd also suggest bonnie, it's a good benchmark for disk performance, but you'd have to have a range of options here. (testing disk performance and cache, so you'd really want a large number here too, just to be fair. 2*RAM?)

    Also, when RedHat boots up, it has those RAID checksumming tests, those are good. They test different implementations of the same algorithm, so they say a lot about the individual chip. (whether it likes MMX, works well with different optimizations, and whatnot)
    ---
    pb Reply or e-mail; don't vaguely moderate [152.7.41.11].
  • dmesg | grep -i bogo
    Calibrating delay loop... 897.84 BogoMIPS

    ...comming from a K6-2 450. Now everyone has fast boxes these days...whats the LOWEST BogoMIPS you've ever seen, and what CPU was it runnin? :)

    -----
    If Bill Gates had a nickel for every time Windows crashed...

  • Until I decided to look it up ust there, I had no idea what bogo-MIPS was. Enlightenment can be found either here [kinetic.org] or here [enlightenment.org].

    -- Michael

  • Umm, Tom's Hardware is not all that reputable, especially after that fiasco a while back about his involvement with NVIDIA. I think he's apologized, but the suspicion is still there. Frankly, I don't really trust ANY website with accurate benchmarks - I trust my own judgement after I read all of the benchmarks on all sites.

    Tom and his hardware aside, I think that open benchmarking tools are a good idea. However, we might see a different set of problems, in that if the hardware company knows exactly what code is going to be executed to benchmark their product, they can optimize/cheat for that code.
  • uh no, try again:
    http://www.spec.org/cgi-bin/order/ [spec.org]
    those dont look like open source prices to me...

    --Siva

    Keyboard not found.
  • MIPS = Millions of Instructions Per Second.

    Talking about "a MIP" or "bogo-MIPs" is absolutely idiotic.
  • I've wanted for a while to devise a standard benchmark suite, distribute it to users, and let them upload their results and system configuration info to a server. Then we could have a web site which could recommend alternate configurations and small upgrades that could make a big difference, and list what the best systems for certain applications and in cerain price ranges are. Damn would that be cool. If anyone else is intersted, please email me.
  • Perhaps I'm ignorant but I thought FPS for Quake 3 in high quality 1600x1200 was the standard :)

    "I can only show you Linux... you're the one who has to read the man pages."

  • Yup, I'm aware that mainframes are relatively slow. They do, however, generally run under a much higher 'load' than a typical midrange system.
    --
  • by IntlHarvester ( 11985 ) on Friday April 14, 2000 @01:08PM (#1131907) Journal
    Back in the old days, Cadillac shipped cars with 472 and 500 cubic inch engines (about 8 liters in modern terms). These things put out nearly 400 HP and buttloads of torque. With the exception of some muscle cars and the Corvette, Cadillacs were the fastest cars GM built.

    But, nowhere in their advertising did they mention the size of the engine or the amount of power or anything about "performance". Back in those days everyone just knew Cadillacs had plenty of power. I suspect it's the same with IBM and their mainframes - just too much reputation to even advertise.
    --
  • by BrianH ( 13460 ) on Friday April 14, 2000 @04:12PM (#1131908)
    If a benchmark could be written that would accurately simulate real world applications, then I'd say let them optomize their hardware/drivers for it. If the benchmark is good enough, then any optomizations made for the benchmark should also cause a performance increase in your genuine applications. Of course, therein lies the trick. Can you make a benchmark that realistic?
  • I think that one possible use for open source benchmarks are benchmanrks which _can_ be tweeked for individual processors.

    To explain: define a set of tasks (this could include some of the same set of tasks as some of the current synthetic benchmarks), but define then in the algorithm that must be used rather than the implementation. Then write a C/whatever standard that implements that algorithm as well as possible to use as a base. _Then_ allow the proponents of particular platforms to modify a version of the code (possibly using #ifdefs or whatever to keep it in one code base) as long as they use the same algorithm.

    One possible test (I'm only using it as an example, not suggesting it) would be to calculate a certain portion of the Mandelbrot Set down to a depth of 10000 and put the results in an array of a certain structure, where it must be done using brute force with a presicion of a least 40 binary significant digits (i.e. 64-bit longs or doubles) ... edge following not allowed. Part of doing the whole benchmark is doing the test n times, where the position of the result array keeps moving. With that, we'd start with some base code that does the job fairly well, than people can add #ifdef PPC_G3, #ifdef AMD_K6_2 and write pieces of code (using assembler if they like) to speed it up for their favourite architecture. A little bit of competition could be fun :-).

    The current distributed.net RC5-64 could be considered an example of such a benchmark - using processor tweeks are good as long as you solve the problem.

    Open source can be used to prevent cheating, in that it can be seen that everyone is following the correct algorithm (or strict review by trusted organization as in the case of RC5-64). It also means that people can look over the tweeks for other platforms and see of any of them are applicable.

    The rationale for this approach:
    (1) change the rules so that what is currently 'cheating' becomes part of the process - it becomes very difficult to cheat.

    (2) A lot of 'real world' applications like Photoshop and Quake are presumably using these sorts of tweeks for their inner loops anyway, so this is mirrored by allowing the same tweaks in the tests.

    This idea has several downsides:
    (1) it can only provide synthetic benchmarks, and on fairly small examples (so optimizing if for particular archectures doesn't require huge resources)

    (2) it only tests the speed that can be got using assembler ... how good the compilers are doesn't really get factored in.

    (3) It requires each platform to have some advocates good enough and willing to put time into optimizing code so every platform gets a fair go.

    (4) because the tests are so small, it needs a moderately large number of individual benchmarks - for instance RC5-64 on its own is useless since it doesn't test memory speed, and PowerPC and x86 architectures have the huge advanatage of having rotate instructions.

    (5) rather than give a single number (which is what people tend to want), resulting benchmarks would give a set of results for various aspects of the chip - the would make the results of more interest to technically oriented people.

    I'd be willing to put a little work into PowerPC G3 and possibly G4(Altivec) optimization in such a project.

    A more extreme version of this idea is to allow algorithm optimization too ... like do the Mandelbrot example (allowing edge following etc.) as fast as you can as long as the precision of the results is up to standard. I think that this would require too much time on part of the optimization writers though.
  • I personally prefer the original A Modest Proposal [gutenberg.org].

    (...gee, where did all my Karma go?)

  • How about something which measures how fast Linux stocks are becoming worthless? You could maybe plot it against the frequency of ESR articles at Slashdot in which he tells all of you how rich he is.

    Notable Linux milestones today:

    • Andover.net's market cap drops below $100 million. Stock approaches single-digit value levels and hits its all-time low price after losing 28% of its value (so far).
    • VA Linux stock drops below its all-time low price as it loses another 15% today. After hitting 320 in December, the stock has consistently lost over 2 points a day, now sitting down 90% in 4 short months.
    • ESR's paper worth drops below $5 million (currently at $4.5 million) as VA Linux stock continues its crash. After his ever-so-humble article in December telling us how he was worth $42 million, one can only hope for his sanity that LNUX isn't a penny stock by the time ESR can actually sell any of his stocks in June.
    • Caldera drops to a mere one-eighth of a point above its all-time low price as it drops 14% of its value today.

    Short 'em to the floor, that's what I always say! :)

    Cheers,
    ZicoKnows@hotmail.com


  • Our plasma simulation group has several simulation codes which would be pretty good as part of an open-source floating-point benchmark suite--*provided* this benchmark suite is distributed under the GPL or Berkeley license.

    We considered giving our codes to SPEC, but SPEC wants to be able to *sell* their benchmark suite for $500 a copy. This caused us legal headaches so rather than deal we didn't try to participate in SPECfp2000.

    We can offer C and C++ codes which exercise the FPU and memory subsystem heavily: they tend to be cache friendly though.

    PeterM
  • by Bowie J. Poag ( 16898 ) on Friday April 14, 2000 @07:39AM (#1131913) Homepage
    In an industry where hard disks capacities are still measured in 1,000,000 bytes per megabyte, and 19" monitors are still 17.9" viewable, what makes you think that any company would adopt a benchmarking standard that was actually impartial to their product? The whole point of benchmarking your own product is to give the marketing department something to crow about. So, logically, they gear their hardware (and choose their benchmarks) accordingly.

    Sure, its a great thing for the rest of us, because we dont have anything we're trying to sell. Just dont expect anyone on the outside to hop on the bandwagon.

    Yours In Science,

    Bowie J. Poag
    Project Founder, PROPAGANDA For Linux (http://metalab.unc.edu/propaganda [unc.edu])
  • Quake has been GPL'd, why not use FPS at 640x480 in software mode? The machine gets a good workout and you test many features at a time. Use a standardized config and demo (crusher2?) and a known version (is it in CVS?) and you have a standard, cross-platform benchmark that is at least fun to watch. Works for me.
  • The 486dx2-66 beside me here is 33.18 bogoMIPS. It's running my webserver and a few other network services. (Heck, even X Windows isn't too bad on it anymore since I upgraded from 16M to 24M, but that's not usually running.)
  • Nothing's stopping them from doing that now with the current benchmarking tools.

    Having the source code for it will only make this trick slightly easier (less reverse engineering needed). Besides, if information leaked out that actual HARDWARE cheated on benchmarks, they would be under a LOT of critisism and I suspect they'd be caught rather quickly.
  • Oh where to start..

    Wasn't it the original PowerVR chipset that did this? Or was it one of the early Riva128 ones.. I'm thinking back, way back.. But I do recall there being a big stink over how if you renamed a popular benchmark's .exe file, it's results dropped about 30% on a card who's name currently eludes me.

    It's no secret that companies cheat on benchmarks. Heck, ATI released an entire set of drivers (the Rage Pro TURBO drivers) that made a few benchmarks faster and a few real games SLOWER. Was there critisism and what not? Lord no. For some reason, it was expected in the 3D accelerator market. I'll paraphrase Brian Hook: "2 years ago, if you went into a trade show with vague specs and no real product, you'd be laughed out. Now it's a way of doing business."
  • Is this a benchmark proposal or an episode synopsis for Babylon 5? If it's the former, good luck getting it accepted, if it's the latter, it needs to work on the episode order.
  • What defines a benchmark? Is it not a measurement of the performance of one aspect of a system?
    Benchmarks should be open sourced, the community that uses the system(s) at large should define what the tests
    (torturous as they should be) actually test. That will determine the difference between fluff and actual fact.

    Of course, it's also Standard Operating Procedure to optimize products to perform well on Benchmarks specifically (I hear stories about compilers that seek out "Whetstones" or "Dhrystones" and will substitute hand-optimized machine code for 'em rather than just compile the code).

    Bottom line, is you can't trust 3rd party benchmarks. You need to test a system for your specific application. This, though, is prohibitively expensive for most applications. So you gotta rely on benchmarks.

    Therefore, make your benchmark as close to real-world use as possible! Especially if you're open-sourcing it. Then, optimizing for the benchmark is actually optimizing for real-world use.

    (The problem with this, of course, is that your real-world use may be dramatically different than mine. If I'm rendering 3D graphics, I have different needs than someone running, say, a web server. So this then requires a family of benchmarks, reflecting real-world usage in different domains of endeavor.)

  • Pretty well all Java VMs have varying speeds and degrees of optimization for a given CPU. If Java performance is important to you, then by all means, use a Java benchmark. Otherwise, you've got an open-source program whose performance is significantly affected by a non-essential program.
  • There was a reference to such a beast in Brave GNU World #11 [fsf.org]. However at that time it was still vaporware, and I can find no other mention of it.

  • the reason it was titled "A Call to Arms" was the Babylon 5 theme throughout the article. Every paragraph header was a B5 episode title. I don't think it was a real call to arms.

    --jeff

    "We've got a blind date with Destiny...and it looks like she's ordered the lobster."
    -The Shoveler, "Mystery Men"
  • Lowest Bogo-Mips ive ever seen was .54 on a new 386/16sx, ;) now thats speed ;)
  • by Silverpike ( 31189 ) on Friday April 14, 2000 @09:43AM (#1131924)
    Ol' Tom has a good point. Sysmark really isn't the right solution for comparing processors. What he proposes is a realistic, achievable goal, but you have to define the playing field first.

    The Good:

    There already is a great benchmark for processors, and it's called SPEC [spec.org]. Yes, it's not open source, but it's really quite reliable for comparing CPUs of any architecture. As slashdot user "cweber [mailto]" pointed out in his post, they have been doing this for 11 years, and they periodically revise their benchmark suite to stress CPUs more uniformly.

    The open-source method. This is really good to ensure that there are no cheaters at the benchmark level.

    Tom's interesting ideas [tomshardware.com] on Crusoe. This stems from the fact that SPECmarks don't quite approximate real usage that Crusoe depends on to use it's hotspot optimizations. However, we are interested in the raw sustained speed of the processor (in this case), not the speed of the OS or it's task swap latency. Tough problems to solve.

    Open-source means that the benchmark code will be able to take advantage of the best compiler available for the target CPU (see comment at end).

    The Bad:

    Anyone who has done benchmarks knows that even small variations in system config can have strage or harmful effects on the benchmark results. This open-source effort is going to have to have a database of hardware configs in order for this to be useful.

    The Ugly:

    Vendors are going to oppose this (at least not support it). Why? Because plain and simple they have an interest in promoting the most favorable statistics possible about their products. They want to keep feeding you "polygon fill rates" and "texels per second" because their card may not stand up in a direct test program comparison. Plus, they are just dying to convince you that they have new BogusMarketingAcronym (tm) technology and their competitor does not. Nevermind that SSE and 3Dnow do pretty much the same thing -- companies have an interest in differentiating themselves as much as possible.

    If this benchmark actually takes off (and gets widely accepted), we might get cheaters at the firmware or hardware level. This has happened before -- although which company it was and which benchmark they cheated I can't remember. I can't find it on the net or remember to save my life (sigh)...

    I also need to say something to the people who think a processor should be judged independently of a compiler. This is just plain dumb. Why? Because a processor and it's compiler are a team. You can't use one without the other. When a chip is designed, there is a direct information dependence between the chip architects and the compiler writers. They are designed as a pair (ideally), and they should be tested as such. If a given compiler has great optimizations, then great! That means the compiler understands its target real well. It is a win for both the CPU and the compiler for pulling it off. This compiler is going to do the same kinds of optimizations when vendors use it to write programs, so that helps the comparison between benchmark code and apps.

    However, I can see the need to compare not only the best compiler, but GCC as well, because of its broad acceptance. But if you are serious about performance, and want to get every once of juice out of your chip, you use the vendor provided compilers, not GCC. Don't get me wrong, GCC is great for compliance and portability, but it usually doesn't compare well with vendor compilers for generated code speed (with the possible exception of IA-32).

    Ars Technica also published [arstechnica.com], a while back, some good information regarding CPU benchmarks. Check it out if you are interested in SPEC or CPU benchmarks in general.

  • The only good benchmark is the one when you test a specific application you are interested in running on your particular hardware when it sits on your network running under your normal workload.

    Nuff said.
  • by bgarcia ( 33222 ) on Friday April 14, 2000 @09:44AM (#1131926) Homepage Journal
    He has a point about... wait a sec... Jon, is that you?
  • We already have standard benchmarks and benchmarking standards with SPEC. It's just not really open source. But as another poster detailed here, a SPEC equivalent should be pretty straightforward to implement as open source. Just who or what mechanism would make sure that noone cheats is another issue, though.
  • You would need a benchmarking standard, not just a standard benchmark. And that standard needs to be protected by a license or something to ensure that everyone plays by the rules.

    The benchmarking standard should state that code changes are not allowed, and it should detail how the benchmark was to be run, how it was to be reported and where.
  • I know, SPEC isn't open source in the strict sense, but it IS a broadly accepted benchmark suite of which source is available, and it has served us well for the past 11 years.

    As well as generic benchmark can serve anyway. There is of course no substitute to check out a box with your own apps and workload.
  • That is why open source benchmarks are a good idea -- not only does it allow people to improve on the code directly, but it lets people see exactly what is going on behind the scenes.

    That is a very bad idea, indeed. If the code base of the benchmark changes at all, none of the numbers are comparable between releases. This is exactly why tightly controlled benchmarks like SPEC have been successful. SPEC only changes every few years, there are clear rules of what you can do and what not while compiling and running the benchmarks and there are rules about how to report the resulting numbers.

    Inasfar as one can trust generic benchmarks, SPEC has held up nicely and allowed us to superficially compare systems from Unix vendors with different CPUs, different architecture and different OS. Even with infrequent updates, the transition from one version of SPEC to the next gets in the way sometimes. I can only imagine how bad a true open source solution without additional rules would be.

  • One important reason for open-source benchmarks is that people are able to evaluate the benchmarks themselves. You shouldn't blindly assume that a benchmark is worth anything only because it's published by a famous magazine, company or organization.

    I have produced a couple articles on Java performance (JavaLobby & JavaPro Aug'99) and unfortunately, some popular Java benchmarks exhibit various flaws (Caffeine, jBYTEmark, JMark). Even good benchmarks (Volano) are sometimes hard to analyse because I have no access to the sources and I don't know exactly what the benchmark is doing. Statistics without context are useless in the best case, and dishonest in the worst.
  • What happens when someone alters the source code? I mean, that's what open source is about. So somebody fsck's around the source code so that it works better with their product that anyone elses. It was still an "Open Benchmark", right? I think this would be even worse than the current situation.

    -----------

    "You can't shake the Devil's hand and say you're only kidding."

  • not only that, it also tests the stability of the system. i have been running the quake 3 timedemo on my linux system continously for the last month and the computers hasn't crashed yet! let's see you do that with windoze!

    (=


    _______________________________________________
    There is no statute of limitation on stupidity.
  • ATI has done it before, they too the ATI Rage Pro and optimized the drivers on it... creating the ATI Rage Pro Turbo video card which is the exact same card as the Rage Pro, but with a different driver which is optimized for Winbench 3D. It got much higher scores in winbench 3d but no performance increase in real world apps like quake2. see this link on tom's website [tomshardware.com] for more info


    _______________________________________________
    There is no statute of limitation on stupidity.
  • I am currently running on a 386SX25 that gets a grand total of 3.74 bogomips.
    Surprisingly it works pretty well for day to day stuff but takes almost 5 hours to compile a kernel.
  • oh yeah..riight. mainframe processors are sloow. they arent your top of the line number crunchers..most machines now will put them to shame. the mainframe has always been great on I/O but actually use one and you'll see that they dont have the performance of (say) a dual alpha mainboard. IBM rates em in CPW or some other weird figure which gives users no information...they/'d be really embarassed to sell a coupla million worth of mainframe with a benchmark figure lower than that of an alpha costing a couple of grand.
  • put it up on freshmeat as a group under the GPL. someone will find a use for it - whether as a benchmark suite or not.
  • ~0.80 Linux trying to boot under Amiga PCTask v4.0 :)
  • Look it would be very nice to have some freely publishable benchmarks. Most benchmarks today include a license that says that you may not publish the results of the benchmark.

    The real problem is crappy benchmarks that don't measure real life performance. Take databases. Every database has a unique set of data with a unique DBA who tunes it in a unique way. It may not even be possible to build a truly neutral benchmark to accurately reflect real life performance.

    Also consider the fact that manufacturers will build in tuning tweaks to specifically perform better at some benchmark or another.

    If you are going to build benchmarks make sure they take all of this into account.
  • " Nevermind that SSE and 3Dnow do pretty much the same thing -- companies have an interest in differentiating themselves as much as possible. "

    Disclaimer: I have only seen the 3dnow and SSE instruction sets - I haven't used them. I have used the MMX instruction set.

    They are similar, yes. But 3dNow is slower - it's 64 bit (instead of 128 bit) and it doesn't have some of the nice instructions that SSE has. Of course, SSE was able to learn from 3dnow's mistakes. There is a difference - it's not just marketing.

    "But if you are serious about performance, and want to get every once of juice out of your chip, you use the vendor provided compilers, not GCC."

    Um - most users don't compile their own software. Most high-performance software is either:
    1. a game. Games tend to use assembly language for optimization.
    2. A scientific application. Scientific apps are usually expensive, and you can usually convince a company to let you benchmark it on thier hardware before you buy the hardware.

  • Anyone who hasn't heard of or used the HINT benchmark but is interested in benchmarking... should go there NOW:

    http://www.scl.ameslab.gov/HINT/ [ameslab.gov]

    This benchmark measures performance with multiple dimensions (ie. performance with respect to the size of the problem, the number of processors available, and the size of the cache) It's numerical intensive so it gives a good idea of what scientific programming performance would be like, not say graphics performance...

  • 49.87 BogoMIPS i486dx2-100 btw
    My Pentium-90 says 36.04 BogoMIPS; my dual P75 says 29.90. Maybe I should upgrade to a 486. B-)
  • (i.e. group and user attributes in the filesystem are a real waste on a handheld device)?
    If I borrow your handheld, do you want me to be able to do the same stuff you can, or to be restricted to a guest account?

    What if an organization maintains a pool of handhelds, and you grab a different one every day? Or for each task?

    Even if you never loan it out and it's yours forever, having different users for administrative tasks is a Good Thing; you don't want to be root all the time.

  • I'm working on an FTP benchmark; see www.kegel.com/dkftpbench [kegel.com]
  • Oh come on folks!

    For goodness' sakes. I can't possibly be the first person to notice this.

    Didn't anyone else notice that the titles of all the sections in the article were titles from episodes of Babylon 5? I mean come on, "the Geometry of Shadows". "Babylon Squared"!!!!!!!

    Geeze. Am I the only sci-fi geek left around here?

    Absimiliard
    ------------------------------------------------ -------
    All sigs are lame, but I've got the lamest sig of all!
  • I've been reading computer magazines for years, and I am always amazed at the way that hardware companies will report bench mark results. Instead of explaining it in a simple, easy-to-understand manner, they will play on the ignorance of the average consumer, and throw terms like "triangles per second", "refresh rate", "maximum colours", and "frames per second" around, hoping to sound good!

    Several times, I have read ads for hardware, that proclaim that they are faster then the competitor. They even have pretty bar graphs! And of course, their bar is much longer. However, here is a great place to repeat the long-unblieved-until-now phrase "size doesn't matter."

    What is stupifyingly bizzare is that I can turn the page, find the competitors ad, and they proclaim exactly the same thing! It's mind boggling, too say the least.

    That is why open source benchmarks are a good idea -- not only does it allow people to improve on the code directly, but it lets people see exactly what is going on behind the scenes.

    Or maybe, we shouldn't let companies do benchmarks for their own products. After all, you can make stats say pretty much whatever you want them too. Or maybe we should just ignore ads completely. (I know that I don't trust magazine ads for sure. Rather, I go and find MULTIPLE reviews of hardware, both online, and word-of-mouth, to get an accurate picture.)


    ,-----.----...---..--..-....-
    ' CitizenC
    ' WebMaster, PlanetQ3F [q3f.net]
    `-----.----...---..--..-....-
  • That isn't just some generic name, it's a reference to Jonathan Swift, author of (among other things) Gulliver's Travels. He wrote an extremely funny essay titled A Modest Proposal [upenn.edu], which has not at all modest recommendations about how best to feed Ireland. I wouldn't be surprised if that's what these authors had in mind when they chose the title of their paper...



  • ...this coming from the same people who used a leaked copy of the Quake3 IHV to conduct 'exclusive' benchmarks on video cards last year...

    Talk about the left hand not knowing what the right is doing...the problem lies not with benchmarks, which are never objective, but biased review sites (such as ones that bash 3dfx for years while running nVidia's advertisements on their frontpage) that can't (or won't) put them in the proper perspective.


    telnet://bbs.ufies.org
    Trade Wars Lives
  • % dmesg | grep Bogo Calibrating delay loop.. ok - 49.87 BogoMIPS i486dx2-100 btw Got a friend who claims to have a linux caching secondary dns server on a 386 -- see if I can get him to give up the the bogomips figure...
  • 56.0 Pentium 133. It's actually the highest I've ever seen, but I'm figurin' since I am the only open-sourcer who HASN'T made $350 million off an IPO in the past few months, my computing resources is an order of magnitude lower than everyone else's and it's safe for me to go for the highest =)
  • I know the people making benchmarking programs have not. Several give reports that are over the PHYSICAL limits of the hardware under test.

    Good Benchmarking should start with the Statement of Max, Min from the Specs by the manufacturers of the sub-components and protocals.

    Now the Next step is to avoid such silly reports such as a L1 Cached Instruction rep'ed to death.
    Generate Constant Stream of interupts both soft and hard from all sources... Now try your Benchmarking proccess.

    If you need proof of the idiot ideas benchmarking programs are.. Norton Utilities(DOS) would give a different SI-CPU score based on Idle movements of Users Mouse.

    A harrased Box tells no lies. I can calculate how fast the Instuction load times are from the Spec Sheet, what I need to know is Is it running in Spec
  • by zerodvyd ( 73333 ) on Friday April 14, 2000 @07:47AM (#1131952)
    to be truly objective, the actual benchmark code should be written in a cross platform capacity. I question the reliability of benchmarking software in general, go ahead and call me a skeptic or whatnot...but I stand by that claim. What defines a benchmark? Is it not a measurement of the performance of one aspect of a system? Benchmarks should be open sourced, the community that uses the system(s) at large should define what the tests (torturous as they should be) actually test. That will determine the difference between fluff and actual fact.

    ...just as long as they keep the BogoMIPS around I'm okay with it :) lol

    zerodvyd
  • The lowest correct value I have ever seen was 4.81 BogoMips on a 386SX-18. (AMD Elan, hardware controlled variable clock, 2-18 Mhz) I did however run into the odd failure on a buggy Sunnylab MediaGX board that caused the BogoMips to be reported as 0.01 on one pass (after a lengthy hang) and 172.xx on the next with 2.2.4.

  • The ultimate example is IBM. Ask a IBM rep how fast the new mainframe model is. S/he'll try to buy you off with a relative performance index, or will tell you that it is X% faster than last years, or twice as fast as their model with 1/2 as many processors, or that it is directly comparable with Sun's model. No mention of actual performance, no 'We're running eight PowerPC processors at X Mhz, and each delivers a raw X FLOPS'. Sure, they won't stop you from publishing your own benchmarks, but they're not forthcoming either..
  • Lincoln did much the same thing until the late sixties.. Cadillac started claiming specs after the song 'Little Nash Rambler' hit the AM airwaves in 1956 or so (Rambler pulls along side a proud Caddy owner topped out at 120 and asks how to get out of second gear)

    Unlike comparing the Ford 460 to the GM 427, now IBM uses commodity processors for most of its machines; You can directly compare, by virtue of the speed and number of processors,(minus the OS and microcode fudge factor) an IBM mini to a SMP PIII, or a Altivec-enabled Mac, or a Alpha. Big Blue's mainframes still use somewhat in-house powerplants, but knowing that the 2001 390 is 1.25 times faster than the last revision isn't going to help you make a purchasing decision between it and a small Alpha cluster..
  • My point exactly (except of course signail11 made it first :). Open Source programs already constitute a great portion of some very standard benchmarks, the SpecINT and SpecFP. The thing is, people can still put in weird optimizations to make execution seem unrealistically fast. Like if there's an empty loop designed to, I don't know, test instruction dispatch speed, you can just optimize your compiler to skip empty loops: bam! 50% speed increase.

    What's really needed are some benchmarks which test realistic computer usage. The Spec's try to do this, or did, but it can be argued that a lot of those tasks aren't so common for today's users. How about a benchmark that tests how fast a PC can open Office98, access 3 menu options, type a bit, save, type some more, define a macro, and exit? I haven't seen one of those. But it really would help some folks out.

  • is that you daenion?

    -----------------------

  • I think I see what this poster is stabbing at... that if the benchmarking tools are open then hardware/driver/application developers will design their products to do well on benchmarks, and not necessarily to perform well overall... this is already a huge deal in the hardware 3D market, where the only number that REALLY matters is the Q3A timedemo score... which is ridiculous since this number measures very little other than the performance of texture-mapped triangle drawing...

    While I don't see this as a justification for trusting closed benchmarking tools, which could be doing far more sinister things (if (manufacturer != Microsoft) { busywait(); } ), it is definitely a problem that the tools themselves will have to address by being comprehensive enough to ensure that it's prohibitively difficult to optimize for that particular benchmarker.
  • Who will use it? Whoever wins, of course.

    Let's say I'm running AMD, and in the impartial benchmarks, our chips beat equivalent-speed Intel chips by 31%. I'd be a fool _not_ to use those benchmarks in my ads--and crow about how I win using impartial, open benchmarks.

    In the same way, it Adobe starts to worry about GIMP and releases a linux version of Photoshop, and it blows away GIMP on the same machine using impartial, open benchmarks, they'll use those benchmarks in their ads.

    It's like when the PowerPC G4 did so ridiculously well on Byte's benchmarks, Apple's ads make very clear that the benchmarks came from Byte, and were impartial (hell, if anything, Byte was biased _against_ Apple).

    In the end, I'm not sure open benchmarks will improve things all that much. In the examples above, Intel can just make sure that the new PIII Argon improves over the PIII Xeon in the specific areas that will matter.

    And if there are too many benchmark suites, it just gets confusing. AMD can advertise that they win on benchmarks A, B, and C, and Intel that they win on D, E, and F--and then who do you believe?
  • The only way that'll give you any information is if you actually research all the benchmarks.

    Imagine you're comparing, say, the IBM and Motorola next versions of the PowerPC. They both advertise benchmarks where they clobber each other. In Motorola's case, they're using lots of stuff that takes advantage of Altivec; in IBM's, they're using lots of stuff that takes advantage of parallelism with multiple cores. If you go read the benchmarks and what they were designed for, then maybe you have enough information to figure out that for your usage, two chips with Altivec is better for you than a single double-core chip without Altivec, or vice-versa. But you could probably figure that out without even looking at benchmarks....

    But what is a non-biased test going to do in a situation like that? How much is the "right" amount of FPU usage for the test? Or 128-bit vector processing? It really depends on what you do.

    I don't think you can do better than specific application tests. If I spend most of my time waiting for a certain set of GIMP filters to do their thing, what could possibly be a better test for me than one that exercizes those filters, on the types of images that I most often?

    So an ideal test would have to contain a huge variety of application subtests, and hopefully everyone could look at the particular subtests they care about.
  • But these _are_ benchmarks. Anything that you can test on multiple machines to compare them is a benchmark. And they can be "faked" in the same was as any benchmark--by improving your hardware, drivers, software, etc. for these particular uses.

    However, if this is the kind of thing you do all the time, then it's to your benefit if someone tweaks their hardware to improve it.

    Apple, with help from Adobe, has gone out of their way to make often-benchmarked Photoshop tests run faster on their machines. But this same effort also makes my actual work with Photoshop go faster--so I have no complaints.

    Unfortunately, the end result of optimizing one task may be that other tasks, like starting up Netscape or scrolling in Word, aren't as fast as they could be. Maybe that doesn't matter to you; maybe it does.

    The compilation test is a little more problematic. It's very easy to make a compiler run faster by not making the ultimate compiled code slower, or even less accurate. That's obviously something you wouldn't be happy with.

  • ROTFFL! I really, really wish I hadn't blown my last moderator point today.
    ---
  • We all know that benchmarks can be looked at in many different ways depending on what you want to do with them. (For an excellent discussion on the topic, check out Raj Jain's book on Statistics and Modeling.) But the reality is simple: there needs to be *some way* to compare products. And an OSS solution would fit the bill nicely. At least with an OSS benchmarking tool, there is no mystery as to how numbers were derived and better yet, they give buyers a chance to do comparisons themselves.

    I'll pick on a popular tool as an example: WebBench. When you talk to folks who do hardware/software that is web related, you inevitably here about WebBench. ("We get XXX connections/sec with WebBench.") For as terrible as Web Bench is, it does provide a standard tool for testing. It's free (as in beer), and anyone can download it from ZDLabs. In other words, I can setup a testbed and compare my numbers with the manufacurers' numbers. If mine are substantially lower, I can go back and ask for an explaination. This gives me power as the buyer. As a developer, it gives me a way of figuring out whether or not my product can compare against another.

    But there are a lot of little problems with WebBench that really make it suck. For one, it only runs under Win98. This means I have to physically be at the test location to work with it. Another big problem is when I want to change the parameters of the test, I have to go restart all of my clients -- very tedious. And of course, my favorite -- try running WebBench for an overnight test. It'll crash after a few hours. =( If there was an OSS version of WebBench I could at least see why/where it's crashing. I could also benefit from a lot of other people who are using it to put contributions back to make it more efficient, feature rich, and stable. And the best part -- there are no doubts about how I tweaked the software to make it work better for my config.

    Does this mean all benchmarks will be fair? Hell no -- you still can't cure clever sales folks who know how to graph things just the right way to exaggerate their benefits over another product. But at least there will be less voodoo behind how their benchmarks were derived.

  • 'a modest proposal' is not a generic name. it is named after a short satirical writing by jonathan swift ("gulliver's travels," etc.), in which a person proposes to solve social ills by eating children...
  • I'm not a number cruncher, or cad affectionado, but one of the best benchmarks I've personally found to jibe with the 'feel' of a system is the Quake demos. If it gives me 100frames per second in Q3, then it isnt going to slow down with anything else i throw at it either. 'Course, my system philosophy is CPU MHz, Tons o memory, Tons o cache, a kickass video card with tons o memory, and a fast harddrive. (my current gaming system is a bit behind the times, but still outperforms a new 2500 dollar intel machine!) K6-3,2MB L3 cache,128MB ram, and a TNT2 rules for me. And yes this is a socket 7. No i dont buy top of the line hardware. i wait for the price to drop first. I've got $600 in this one.
  • I actually get two totally different values for bogomips with my computer. Either 1359.87 or 679.94, which surprisingly is exactly half of the other one.

    First one I get with a 2.3.99pre3 kernel and second one with 2.3.41-kernel. I have a 650 Athlon running on a 105Mhz bus thus effectively running at 682.5 Mhz. Superbypass is also enabled.

    So what is the reliability of Bogomips according to this or can someone explain a little more?

  • AMD K6-III dmesg | grep -i bogo Calibrating delay loop... 897.84 BogoMIPS I guess that extra cache doesn't mean extra BogoMIPS
  • My job is to implement and/or alter open source or off-the-shelf software for various new services my company provides. Choice of software and platform is always *very* difficult, and, as you could imagine, performance and reliability play a large part in that decision.

    I would like to see a set of non-proprietary, transparent, applied benchmarks developed for important software packages, that would allow me to say "I will use that package, because..." without having to go on rumour, guesswork, sales literature, or, worse still, having the good projects given over to the M$ development people...

    I'm not saying it would be an easy to develop well, and it still has huge capacity to be abused by those trying to credit their product (or discredit others'). If someone can get it right, and gain the trust of the commercial world, it could also stand to do the open source movement big favours in the corporate world. We *shall* have Apache on that server, we *will* use Samba, we *can* have OpenLDAP...

    Big ifs. I'll be quiet now.

  • Standard open source benchmarks are a fine idea (and I wouldn't call this article a "proposal", but merely a "demand), but the problem with this article is not the fact that it is about open source, but the fact that it was written by Van Smith, who is - yes - the Jon Katz of Tom's Hardware. Remember, this is the guy who proclaimed a couple of months ago that Intel would be soon exiting the microprocessor business. Right.

    The more clued-in Slashdotters have Jon Katz filters installed on the page, but we also need to figure out a way to get a filter on any link to a Van Smith article - he truly is a moron, and in one of the least competent "cyber-journallists" in work today.
  • is translating theory into the practical. With the exception of some *very* specific-to-use benchmarks that I've seen, everything else has always been a very poor approximation of what someone "Thinks" actually is a practical sample of what DOES happen when a computer is used.
    As the old saying goes, "There are lies, damn lies, and statistics", and benchmarks are the most advanced form of statistics. Draw your own conclusions..
  • You wasn't kiddin! Skip past the fluff and start reading here [tomshardware.com].

    To eliminate hardware bias, write all the benchmark code to the lowest common denominator, perhaps Knuth's MIX/MMIX [stanford.edu] architecture. If you want to know how much your particular hardware is being under-optimized, run the benchmarks under HP's Dynamo [slashdot.org] or the equivalent.

  • Open-source benchmarks? Next thing you know I'll be open-sourcing my plans for world domination.

    Seriously, there are exactly two benchmarks that really make any difference:

    1. The general-purpose benchmark (e.g. SPEC) which gives a pretty good indication of how fast an architecture is relative to other architectures, and
    2. The specific benchmark, or "How much wall clock time will it take my application to run on this machine?"

    Of these two benchmarks, the second is obviously the most important. (Surprisingly, the Quake FPS freaks aren't far off from the truth here).

    Anyone that really wants to know about benchmarking should read the relevant papers or at least read The Art of Computer Systems Performance Analysis [barnesandnoble.com] by Raj Jain (and no, I'm not a shill for Barnes & Noble but that's one spot where you can get the book).

    If Van Smith got paid for that article, he should be forced to eat it... byte by byte (sorry, couldn't resist :-). -Brazilian

  • by xant ( 99438 ) on Friday April 14, 2000 @09:36AM (#1131973) Homepage
    Well . . . kind of the POINT of this whole exercise is to take the ability to perform referenceable benchmarks out of the hands of the interested parties (those who make money from them). Closed-source, commercial benchmarks are inherently flawed for some of the same reasons closed-source, commercial security is flawed. The difference is that those interested in finding and exploiting these flaws aren't crackers, but hardware companies.

    So to answer your question: Tom's Hardware, and other reputable benchmarking authorities, would use it. TH has rapidly become one of the highest-integrity, best-respected hardware/computing sites around, even (indeed especially) for the Windows crowd. (After, Win32 is still the dominant gaming platform.) If such a thing as open benching became popular, then commercial entities would be FORCED to use the open benchmarks or be accused of marketing skewed numbers, whether those accusations had merit or not.

  • Check out the Polygraph team's page for an open source benchmark that has been a great success.

    "IRCache Polygraph Homepage" [ircache.net]

    For web cache benchmarking they are the gold standard and everyone in the industry (even the much maligned IBM) shows up to the IRcache events and use Polygraph results in their ad copy.

    If you do it right, open source benchmarks can become the standard in an industry...in fact, I think if you do it right, they will become the standard in an industry. Good companies want a level playing field. Bad companies will be weeded out eventually...

    And that's all I have to say about that.

  • I don't know why so many people beleive so strongly in processor benchmarks. Know how many integrer operations a sec that a processor can do is nice, but it doesn't give you an overall picture. I have seen VERY few _good_ overal system benchmarking tools... maybe this will lead to more.
    YOu would think getting a non biased test would be fairly easy... or perhaps it should be up to the manufactors to make the benchmarks... that why they could bend the thruth as much as possible... but if everyone is bending the truth, then it will all even out. :-)

    Yup.
  • I seem to remember that one of the older computer magazines (might have been Byte) used Xscheme as the basis of their benchmarks. It was a readily available language, with source code, that had been ported to a wide number of platforms. Since it was interpreted, it tended to run slow enough for meaningful timings. Could not something similar be done with Java? Specifically, one of the open source implementations? Write a set of benchmarks in Java using the subset of Java that currently works on the open source VMs and then try them on different platforms. You could even use the compilation time as a benchmark (using jikes maybe). It would allow cross platform testing on at least semi-real world applications.

    jim
  • Where do you get this from? For web server applications, I would want insane memory bandwidth, good SMP capabilities, and efficient cache handling/IO subsystems. For scientific applications, I wouldn't usually consider a x86 processor except under very specific cases, SSE or no SSE. The FP register stack just about kills parallelism and SSE only offers 32-bits of precision. I really don't want to save 3 days on computation time to spend 3 weeks using numerical analysis to hunt down pesky instabilities. To decode PPV signals, standard integer math is more than adaquate (I believe that it uses variable line-based rotations and offset phase shifting). Spec, TPC, and other benchmarks are *very* useful when buying high-end computer systems. Of course, in the end, what counts is performance on *your* application; that's why aggregate system benchmarks (WinBench 2000, etc.) are sometimes useful for home/office users who want to get a gauge for performance on typical tasks. Then again, if a user is going to be sending email or writing 2-page memos, the computer he/she is using won't matter much.
  • This comment deals with scientific programming; YMMV wrt game or graphics benchmarks.
    I *want* compiler writers and microarchitecture designers to optimize for reasonably well-designed benchmarks, such as Spec. I *want* compilers to recognize critical code fragments, idioms, and kernels in the Spec benchmarks and emit perfectly scheduled code. I don't care if the compiler can't make the optimization in the general case; when I write scientific code, I take care to use the standard style of writing certain common transformations, such as dot products so that compilers (Compaq's ccc and SGI's compilers are excellent in this regard) that target SpecFP pattern match the code and produce good code. I want microarchitecture designers to include elements that make their chips run Spec fast, since if a benchmark in Spec runs quickly and my computational task is similar, it will most likely benefit from any architecture changes as well. Thus, selecting good benchmarks in a suite is utterly critical if the benchmark number will have any value at all; moreover, there are many incidental benefits to selecting benchmarks that represent commonly used tasks or programs.
  • That wasn't quite my point. Open source programs constitute a minority of the SpecINT and SpecFP suites; each of the individual benchmarks is designed to be representative of the workload of a typical *scientific workstation*. Spec cares not about your 3D video card, your CD-ROM transfer rates, or your 3D sound card. There are no empty loops in the Spec benchmarks, and intruction dispatch speed is not tested as a discrete benchmark; it will factor into the overall score. As for compilers optimizating specially (via pattern matching) for Spec idioms or code fragments, all the better luck for them! If the compiler can, say, sense the standard form for DAXPY or another common Spec kernel and emit inline hand-scheduled code for the fragment, I will be all the happier, since I can use the same fragment in my code and tempt the compiler into emitting specially optimized asm (this works especially well on the good compilers: Intel's reference compiler, compaq's ccc, SGI's sgi-perflib/compiler suite).
    br. With regards to your comment about realistic computer usage: those tests that you suggest can be done in such a minute duration of time on any modern computer that they are _not worth testing_! It's essentially "fast enough" for any possible user; let's face it, the CEO's secretary does not need a P-III 800 or an Athlon, to say nothing of an Origin 2000, Starfire, RS/6000, or AlphaCluster.
  • by Signail11 ( 123143 ) on Friday April 14, 2000 @08:29AM (#1131980)
    I suggest basing an open-source benchmark suite on the existing Spec benchmarks, as most of the code (or functionally equivalent code) is relatively freely available. Of the 12 SpecINT 2000 benchmarks, 5 (gzip, gcc, crafty, perlbmk, and bzip) already exist as open-source programs. The combinatorial optimization (181.mcf) benchmark's code is also on the Internet at www.zib.de, free for academic use. I'm sure someone could make a cleanroom interpretation of something similar. 175.vpr (a place and root program) can be found at http://www.eecg.toronto.edu/~vaughn/vpr/vpr.html. 197.parser is essentially a CS student's problem about parsing and extracting strings. 252.eon is a raytracer (we can use POVRay instead). 254.gap is a general purpose math library (Victor Shoup's NTL library exercises most of the same functions). 255.vortex is a standard RDBMS; MySQL or an equivalent could be used here. 300.twolf seems rather similar to 175.vpr; as circuit designing is really far removed from my field, I'll leave this to someone else.
  • I saw a 1.34 Bogomips on my screaming 386 without any cache :) it makes wonderful heater since I plugged the processor in backwards. I figure its doing more good this way.
  • Don't correct me if im wrong!

    QuakeIII is the most 31337 Benchmark in existence. Dont you guys realize what carmack really was doing when he did a Linux port of QIII?

    You need nothing other than QIII for reliable benchmarking of a system.. and any other bench-
    marks just dont matter! Nuff said..

    :)
  • ...I do want a way to compare different processors/operating systems/video cards/etc. objectively without having to obtain a 3rd party's tools and pay for them... (Which I think you have to do with SPECint, right?)

    It would be great to have tools like that, and create a repository of the results.
  • Wouldn't It Be A Bit More Helpful To Have Some Benchmarks That No One Knew What Instructions Were Used To Make Them? This Way, You Could Optimize Hardware For The Benchmarks At The Expense Of The Rest Of The Workload. And an Open Source Benchmark Would Be Constantly Improving, Thus A REading From One Year Could Be Compeletely Different Than One From The Next Year, Making It Tough To Compare New Technology To What You Would Already Have.
  • As soon as hardware vendors learn they can make their hardware helluva fast by looking for the sequence of instructions present in the open source benchmark program, they will just make their crappy old ATI Rage 2MB video cards that are really expensive and wait for the '3Dbenchmark.start();' command, at which point they will say '3Dbenchmark.finished("only .000000001 second!");' This is not good.

    Unrelated note: A Modest Proposal was an essay written by Jonathan Swift that proposed that poor people should sell their babies for food. It was satirical and shocking, but most of all, very entertaining.

    "Assume the worst about people, and you'll generally be correct"

  • The only benchmarks that I've ever found useful are the ones on that measure the performance of commonly used applications (for example all the great stuff on video cards at Tom's Hardware Page [tomshardware.com]). That's all I really care about. Benchmarks of arbitrary performance are only useful to most home users for whacking off to.

    Most companies that develop systems that might require some form of benchmarks are likely going to have to develop their own with prototypes of their application, I can't see anything arbitrary being very helpful in predicting how a particular system will perform in comparison with any other system.

Promising costs nothing, it's the delivering that kills you.

Working...