Facebook VP Slams Intel's, AMD's Chip Performance Claims 370
narramissic writes "In an interview on stage at GigaOm's Structure conference in San Francisco on Thursday, Jonathan Heiliger, Facebook's VP of technical operations, told Om Malik that the latest generations of server processors from Intel and AMD don't deliver the performance gains that 'they're touting in the press.' 'And we're, literally in real time right now, trying to figure out why that is,' Heiliger said. He also had some harsh words for server makers: 'You guys don't get it,' Heiliger said. 'To build servers for companies like Facebook, and Amazon, and other people who are operating fairly homogeneous applications, the servers have to be cheap, and they have to be super power-efficient.' Heiliger added that Google has done a great job designing and building its own servers for this kind of use."
You're Computin' for a Shootin' Mister (Score:5, Insightful)
You guys don't get it
Is it possible to take out a massive life insurance policy on Jonathan Heiliger?
To build servers for companies like Facebook, and Amazon, and other people who are operating fairly homogeneous applications, the servers have to be cheap, and they have to be super power-efficient.
I assure you, despite your misconception that the world revolves around you everyone has those requirements. From the people who build supercomputers right down to the netbook I am typing on while watching Gurren Lagann.
Can we get like a panel of hardware engineers to have a discussion with this guy and can I get some popcorn?
WTF? (Score:1, Insightful)
Maybe the dude should have benchmarked before committing. How does he scope his projects, with brochures?
Hm... (Score:4, Insightful)
To build servers for companies like Facebook, and Amazon, and other people who are operating fairly homogeneous applications, the servers have to be cheap, and they have to be super power-efficient.
Hm, lets see... perhaps because Facebook and Amazon are niche markets? The average server isn't going to even need all the computing horsepower and the power efficiency is simply a drop in the bucket for most companies electrical bills. The average server is going to be much more I/O intensive than CPU intensive unless you do cluster computing or render a lot of stuff. The average server such as a web server or a file server doesn't use that much CPU and usually you are running 1-3 servers, not the hundreds that Facebook or Amazon would run.
And really, why is a VP complaining about this stuff? That he can't either afford custom solutions or spend the money buying more servers?
Facebook's application is poorly coded (Score:4, Insightful)
I have heard from some reliable sources that Facebook and Twitter's backend applications are poorly written.
Are Intel and AMD's claims overblown, sure what hardware manufacter doesn't cherry pick performance claims.
But I don't care what sort of hardware you through at crap code you are always going to get crap performance.
Something about his arguement doesn't work (Score:5, Insightful)
1) Facebook & Amazon need cheap, power efficient systems
2) Intel and AMD aren't measuring up with processors to power these systems
3) However, Google has systems appropriate for this use (presumably using Intel or AMD processors)
If that's his argument, then it would seem that the real conclusion is that Facebook can't build systems as good as Google's, even though they are using the same processor technology.
Re:Hm... (Score:3, Insightful)
Re:Something about his arguement doesn't work (Score:2, Insightful)
In addition, there seems to be something else wrong with his arguement
"To build servers for companies like Facebook, and Amazon, and other people who are operating fairly homogeneous applications, the servers have to be cheap"
Which he later follows up with the following insight
"There's a pretty simple answer for scaling infrastructure. It's, 'Don't be cheap,'"
so which one is it?
Re:WTF? (Score:5, Insightful)
Sun.... (Score:3, Insightful)
Re:You're Computin' for a Shootin' Mister (Score:3, Insightful)
Re:Facebook's application is poorly coded (Score:3, Insightful)
Sounds like a bunch of excuses to me (Score:5, Insightful)
Assuming that a solution was properly engineered, this should not have been a surprise.
Cheap. power efficient, performance. Pick two.
Re:Facebook's application is poorly coded (Score:3, Insightful)
Not if your code is not tuned for this new "advanced hardware". Surely there are new compile flags to consider, and if you are not tuning your code for the new processor features it could very well be slower than before.
Re:You're Computin' for a Shootin' Mister (Score:3, Insightful)
and why have a PSU for each unit, why not just run 12v power rails to each server and do the ac/dc conversion on a larger transformer further down the line with larger batteries providng the back up for clusters of servers? after all no psu is cheaper than a psu with just the 5v taken out.
Re:Facebook's application is poorly coded (Score:4, Insightful)
Developers have been known to trade off performance for development ease. Frequently the result is crap code. Yes, it performs like crap on both sets of processors. But if the application is CPU-limited (rather than IO or memory or...), then throwing faster CPUs at it ought to make it proportionally faster, no? Obviously they thought the previous performance was acceptable, is it unreasonable to think that buying CPUs marketed as 50% faster should give a 50% performance increase? Clearly crap code will still run like crap, but you ought to be able to throw more CPU power at it and get 150% of crap performance.
Rub a lamp, Heiliger (Score:5, Insightful)
NEWSFLASH! Customer are tightwads.
Performance/Reliability/Price.
Pick any two, Heiliger.
Re:WTF? (Score:5, Insightful)
I think we read different articles. He's not saying he didn't plan well enough, he's saying that Intel and AMD promise that Gen Y processor is 35% faster than Gen X processor, and he's not seeing anywhere near 35% in real world performance. The 35% is a made up number but it doesn't matter what the number is that they claim. He's probably correct. Manufacturers pull this stuff all the time. Look at the recent articles on battery life claims on notebook's. AMD came out and called BS on the whole thing and basically said if you guys don't stop lying through your teeth, the FTC is going to regulate us. From the perspective you are taking, that would mean we have to call AMD incompetent for not understanding how batteries work and not properly selecting them.
Manufacturers ALWAYS overstate claims in computer related products. CRT actual inches vs viewable inches (thank you lcd's for finally being honest... about inches anyway.. brightness and contrast however....) Computer speaker wattage being 1/2 or 1/4 of what's claimed. Power supply efficiency or wattage not measuring up to claims... you name it. He's calling out what he see's to be bogus claims based on a real world experiences in a demanding environment, the type of environment where one is always looking for better performance. I think we should get some more information before declaring him to be the problem as I'm sure he has a whole team of people that are working on this issue.
What I'd like to see from him is some numbers. On this Intel (or AMD) rig, we get so many operations per hour/minute/whatever. On this new Intel (or AMD) rig which they claim is 20% faster than the previous rig, we're only seeing this number of operations per hour which amounts to only a 7% gain, thus we're seeing 13% less than they are claiming. Again, numbers made up for examples sake. I'd also be very interested in what a typical rig of theirs looks like... X Processor, Y Ram, what type of storage system is it connected to, etc. I think such numbers are vital to understanding the issues at hand. We all know that vendors will run the benchmarks that makes their stuff look the best, and that is often not reflective of real world performance. If I was Intel/AMD I'd be chiming in right about now and opening a dialog with Facebook and looking to see what the issues are. Facebook is a big customer with huge name recognition and you want to be able to use them as an example of your solution delivering the promised performance for your marketing. I'm going to assume (I know I know) that they are already working with the server vendor to try and see what's going on here.
Re:WTF? (Score:2, Insightful)
Re:PHP (Score:3, Insightful)
Exactly. All these interpreted languages, even with some special tricks, will have serious scalability issues. At some point you have to look at the application and ask some serious questions.
so what about google then? (Score:3, Insightful)
I'm bemused that he implies the problems with his servers are due to Intel and AMD no delivering with their chips, yet at the same time he admires google for how good a job they do in building out their machines.
he must be aware that google uses Intel and AMD chips.
his reasoning just doesn't square.
Re:You're Computin' for a Shootin' Mister (Score:2, Insightful)
Well, i can tell you that i do not want cheap, shitty hardware with no feature as servers.
This is all fine for companies like Facebook and Google that are in the primary business of running IT, and wrote software that accomodates for the shitty hardware they use.
However, other applications like standard business IT requires highly resilient, highly managable hardware which offers many features, stable parts supplies, management possibilites, and is built upon sturdy hardware that can withstand non-datacenter conditions of cooling and dust.
Re:WTF? (Score:5, Insightful)
I think we read different articles. He's not saying he didn't plan well enough, he's saying that Intel and AMD promise that Gen Y processor is 35% faster than Gen X processor, and he's not seeing anywhere near 35% in real world performance.
If the application was purely CPU bound, and Y wasn't giving me 35% more than X, I'd complain.
However, if it's a complex system like almost everything else, why would they expect their application to get 35% faster when there's probably 6 or 8 critical subsystems that could all be bottlenecks as well?
Re:Facebook's application is poorly coded (Score:3, Insightful)
Facebook is written in PHP
There's your problem right there... ;)
Re:Something about his arguement doesn't work (Score:3, Insightful)
If that's his argument, then it would seem that the real conclusion is that Facebook can't build systems as good as Google's, even though they are using the same processor technology.
Google does have approximately 30x as many employees as Facebook, so it's not implausible that they've got a much greater ability to build in-house custom tech.
not just the CPU it's overall system performance (Score:3, Insightful)
This isn't just about the CPU, it's about overall system performance.
Despite improvements in CPU performance, memory and IO performance is lagging behind.
A modern SATA drive delivers about 90MB/sec ( peak sequential read ).
Some RAID controllers can do about 600-800MB/sec ( peak sequential read ).
An average AM2 ( K10 core 65nm ) gets about 34,849MB/sec L1, 12,169MB/sec L2, 6371MB/sec L3, 2,741MB/sec DDR2-800 5-5-5-12.
Obviously Opterons scale a lot better since they each have an onboard memory controller and additional HT links which greatly increases bandwidth as you add more CPUs. However adding more cores on the same die which have to share a single memory controller can cause starvation.
Another major issue is software parallelization, writing parallel code is still a difficult problem. If your software doesn't parallelize well it doesn't matter if you have 8, 16 or even 32cores on a single die.
If you had an equal number of CPU cores and memory controllers you could achieve much better performance, however your relatively very slow storage subsystems would still be a major bottleneck.
PHP "extension" (Score:5, Insightful)
I once did a large project in which I took a large, slow site in PHP (it was pretty complecated, it was a CRM with a lot of custom business logic) and rewrote all the core functionality from PHP to C / C++, and made it a "module" of PHP. The rewriting was mostly simple translation -- litterally removing all dollar signs, adding some types, and attempting to compile, and just fixing the compile errors until it would build. Then going back through it with a fine-tooth comb to track down all the memory leaks.
The speed increase from doing that is pretty surprising. Simple loops that do a bit of math or something speed up by 100 times, and a loop that creates and destroys an object within the loop will be 100,000 times faster. This is without actually trying to write fast C/C++ code, and not create and delete the same thing over and over in a loop -- just pure dumb translation of the code.
At that point, the web site guys can keep tweaking and changing the web page in PHP just like before; but they load that module in the php.ini and then they have a basic library of stuff, like login_user() or get_user_balance() and etc, that are really fast and do all the heavy lifting.
I would be surprised if Facebook has not already done this. How to do it is well documented in several books, and there are lots of PHP modules written in C/C++ to look at for examples.
I suspect that Facebook's VP is right that AMD and Intel exaggerate their claims, but is also generally true that most computer programs are more IO bound that you expect. This is not a reason to avoid something like I describe above; once you have the more complete control of programming in C, IO issues may be easier to find and address.
He also mentions that the servers offered by Dell and others aren't very power efficient or practicle for him, and he mentions Google designing their own servers. Nothing google did was really rocket science, from what we know, and Facebook probably doesn't have to go as far as they did to get a reasonable benefit. It's not that hard to set up motherboards to run without a case, booting off the network with no harddrive attached.
Re:You're Computin' for a Shootin' Mister (Score:2, Insightful)
So let me get this straight, the Vice President of a web company is criticizing the hardware guys in two of the world's biggest chip makers?
He's not criticising their technical know-how, he's criticising them for not knowing what their web company customers want.
Since he himself is one of those customers, it's not too unlikely that he knows what he's talking about.
Re:You're Computin' for a Shootin' Mister (Score:5, Insightful)
When you need the cheapest, most power-efficient servers you can find, to the point where you criticize your suppliers publicly, you're not willing to pay for the most expensive cables out there.
Besides, all the seal clubbers are buying those up.
Re:Facebook's application is poorly coded (Score:3, Insightful)
throwing proportionally faster CPUs at *good* code should make it proportionally faster.
Crap code.... probably not. For example, I once had to improve the performance of an app. The app spent most of its time context switching from one thread to another, more time was taken up stopping a thread, switching to another, refilling the cache lines, and so on that was spent processing the data! Think what a faster processor would do here - the CPU would process the little bit of data it was given faster thus providing much more CPU time for context switching.
Similarly with other aspects of modern code - relatively little of it is spent spinning CPU cycles. I'd say more was spent dealing with memory IO (as there is a lot of RAM used nowadays, getting that data to and from the CPU is, relatively speaking, slow as treacle) so it wouldn't matter if you could crunch the data faster if you still had to wait for it to be delivered to you.
Then we put more stuff on the network, and connect to it via Web services and the like, and the amount of CPU power required gets less and less relevant.
I'd say the single best thing you can do to get good performance, and therefore energy efficiency, and cheapness of resources is to write efficient code that requires little resources itself. Even if it takes you longer to do the job, tough on you - there's just you as a programmer but millions of users, the extra time spent developing at a lower level (instead of pointy-clicking in the IDE) is time well spent.
If Facebook's code could be made 10% more efficient, they'd require 10% less servers with all the reduced energy bill that entails. But the Facebook chap doesn't care about that - that'd cost him programmer time, and that costs short-term money! Far better for him to whinge that Intel and AMD aren't fixing his shit for him instead.
Re:You're Computin' for a Shootin' Mister (Score:3, Insightful)
Re:Facebook's application is poorly coded (Score:2, Insightful)
Not "unreasonable", but possibly naive and inexperienced, depending on the details.
Crap code that bottlenecks a CPU often will not scale as well as good code. It involved bad synchronization, other contention, spinning loops, and memory bandwidth limits. It is often NUMA-unfriendly. It often interacts poorly with the other resources, such as I/O.
Re:You're Computin' for a Shootin' Mister (Score:3, Insightful)
Re:Hm... (Score:3, Insightful)
Kindof depends on how you read 'niche.' yes, there is a relatively small number of companies (customers) that have such requirements, but if each of them have a massive, massive number of servers, then i wouldn't call that niche any more, because it still represents a large turnover.
Re:You're Computin' for a Shootin' Mister (Score:3, Insightful)
"Google wants to see reference specs that include options for bare motherboards to slide right into your basic 42 unit rack with IO, disk and power all pulled out to the raw basics so Google can decide how to manage the bits rather than having stock OEM boards with such limited options."
Sounds a lot like a VME backplane...
Re:You're Computin' for a Shootin' Mister (Score:3, Insightful)
that guy is an ass.
the latest generations of server processors from Intel and AMD don't deliver the performance gains that 'they're touting in the press
then
Google has done a great job designing and building its own servers for this kind of use
I wonder who makes the server processors for Google's servers. Hmmm.....
Re:You're Computin' for a Shootin' Mister (Score:3, Insightful)
Most battery UPS's upconvert the 12VDC to 120VAC to provide a standard power supply that you can plug anything into. That's because most of them run off standard boat or motorcycle 12V batteries which you can get at your local car parts store. Diesel or Gasoline UPS's are electric generators and usually cost a *lot* more. They make sense for keeping an office building powered, but not for keeping just a computer or thirty up. And that's above and beyond the power losses from transmitting 12V over a distance that you mention.
I can see right away why it'd be cheaper to simply design a system to run off 12V directly and convert to 5V internally, and to having the battery right in that system.... first, you don't have to pay for the electronics in a UPS which convert the battery's 12VDC to 120VAC. Second, you don't lose energy in the form of heat, powering those electronics, and spinning the fans to keep it cool, and energy lost in transmission. A much higher proportion of the battery's power gets used to actually power the computer. The electronics which do the conversion from 12VDC to 5VDC are *much* cheaper, and less power intensive, than electronics that can increase the voltage, let alone converting it to alternating current.
Think of it this way: it's basically a laptop, only without the keyboard, screen, video card, and with 8 memory slots and dual CPUs, and provisioning for two 3.5" hard drives. The system runs directly off the battery, and the power supply just charges the battery.
Also, adding computers to the matrix doesn't reduce the length of time that you get from the UPS. I have a media center PC that's connected to a UPS, for example. The UPS is just running the computer and the sattelite receiver. In that configuration, it lasts about 1h without mains. If I were to plug the TV into it, it'd last about 25m. While you're operating on a *much* larger scale, the same would hold true for a centralized UPS. Each system you add reduces the overall effectiveness of the UPS by reducing the amount of time it can power the works without mains. By putting the battery directly on the server, you can add computers without diminishing this capacity. Your computing capacity in the event of power interruption scales up linearly, rather than hitting diminishing returns and a theoretical maximum limit.
Re:Well I suppose... (Score:3, Insightful)
Really? What sort of test was it?
We took a Java application off a E6900 using 35% of 48 1.35Ghz US-IV cores. We put it on a T5240 with 16 1.4Ghz cores we saw it only use 14% of the machine with improved user response time.
We also ran a database benchmark for some tests we were running between some AIX and Linux boxes and threw it against a T5240 running Oracle 11g for comparison. Because it was predominately a single threaded operation it ran slower than the 2.2Ghz Power5 LPAR, but the overall difference was about the same ratio as the difference in clock speeds. The thing to note was the machine was only a few percent utilised, so we could have run another 16 or so instances and coped easily.
These machines are workhorses. Granted, you need the right workload but highly parallel/highly transactional work like java web applications or web serving absolutely fly.
Re:You're Computin' for a Shootin' Mister (Score:3, Insightful)
Wrong.
He is criticizing, in the bits in TFS, two groups:
1) The marketing guys in two of the world's biggest chip makers (he's not complaining that the chips are flawed from an engineering perspective, he is complaining about the claims, which apparently conflict with Facebooks experience in testing them chips, about the performance of the chips), and
2) The people setting the design goals (not, again, the engineers) at the companies making servers, complaining that they are doing a bad job of what he sees as a major need (which is, of course, also the particular thing that Facebook needs), and that Google does a better job of building servers for that need (a complaint which would be more effective at changing behavior at server manufacturers if it was followed up by Facebook going to Google to get Google to build them servers.)
Why? His complaints aren't directed at engineers.