Facebook VP Slams Intel's, AMD's Chip Performance Claims 370
narramissic writes "In an interview on stage at GigaOm's Structure conference in San Francisco on Thursday, Jonathan Heiliger, Facebook's VP of technical operations, told Om Malik that the latest generations of server processors from Intel and AMD don't deliver the performance gains that 'they're touting in the press.' 'And we're, literally in real time right now, trying to figure out why that is,' Heiliger said. He also had some harsh words for server makers: 'You guys don't get it,' Heiliger said. 'To build servers for companies like Facebook, and Amazon, and other people who are operating fairly homogeneous applications, the servers have to be cheap, and they have to be super power-efficient.' Heiliger added that Google has done a great job designing and building its own servers for this kind of use."
Re:You're Computin' for a Shootin' Mister (Score:4, Informative)
Can we get like a panel of hardware engineers to have a discussion with this guy and can I get some popcorn?
Slashdotters might want to take a look at the details of the Google servers [cnet.com] to see what Heiliger is looking for. There's also a video tour. [youtube.com]
Re:Well I suppose... (Score:5, Informative)
Re:You're Computin' for a Shootin' Mister (Score:5, Informative)
Re:Hm... (Score:3, Informative)
Hm, lets see... perhaps because Facebook and Amazon are niche markets?
-Maybe-. Even if they are a niche market, they're a big enough one to hold the attention of the big chipmakers.
A traditional business model might use large orders, especially advance orders, to offset or defray the cost of setting up a production line or facility, and get most of the profit from smaller sales. Or they may choose only to do production runs for large, inherently profitable orders. Even in a firing-from-the-hip model, large customers cost less per unit in marketing and sales than do smaller ones, very much so when compared to the general public. And of course there's plenty of wiggle room between extremes. So depending on the diversity of the market and the choice of business model, big customers range from important to desirable. Naturally, in a niche market large customers have a greater importance, since smaller sales are fewer.
Presumably, AMD and Intel are selling servers to the likes of Amazon and Facebook 'cause they think it's profitable. If it is a niche market, keeping those guys happy is paramount to profitability.
(I don't think the server farm market is really a niche, tho'. But I dunno; I don't keep up with such things.)
And really, why is a VP complaining about this stuff? That he can't either afford custom solutions or spend the money buying more servers?
Well, because we asked. Well, not "we" as such, but someone asked him and he answered. It sounds like he was answering honestly and openly. I've no problem with that.
Re:Well I suppose... (Score:3, Informative)
the two points are somewhat independent of each other. The second I suspect is due to their being a lack of any standard for power efficent servers. Google did it by running single voltage power supplies. A standard around something like this would be useful, and not just for servers I suspect.
Re:Something about his arguement doesn't work (Score:2, Informative)
Re:Facebook's application is poorly coded (Score:5, Informative)
Facebook is written in PHP; there are no compile flags.
apache and the php engine have plenty of compile flags. not to mention whatever the database is.
Re:WTF? (Score:2, Informative)
I have some sympathy for this guy. Some years ago, I built a fileserver using the best SATA RAID (hardware RAID) cards I could find (~$300) from major manufacturers and enterprise disks (specified for use in RAID systems)
Performance absolutely sucked. The cards were fast enough it I tried to read/write single large files, but when reading/writing large numbers of small files, they were very slow. The first manufacturer's card was appallingly slow. I replaced it with another manufacturer's card and performance was merely slow.
I followed all the manufacturer's recommendations, I communicated with one manufacturer on a Linux RAID mailing list, but was never able to get anything remotely like acceptable performance. For compariso, later I built a fileserver around an old (sub 1GHz) PC, using software RAID and was able to get at least the same performance.
I was only building one machine, so I did not have the luxury of benchmarking it.
Re:Hm... (Score:5, Informative)
As someone who designs and deploys large storage environments for a living, I call BS. While the current generation of HBAs are 8Gb FibreChannel, I would say that the "average server" (as you put it) could happily live on a 1Gb HBA. Recall that almost all servers, or atleast those you care about, have DUAL HBA connections to their respective storage. So that's actually 2Gb of storage connectivity. Sure there are servers which have multiple HBAs, or use a higher utilization of the HBAs, such as database servers or backup/media servers. Most servers today are deployed with dual 4Gb HBAs as the 8Gb SFPs/optics are still quite pricey, and you cannot, in all seriousness, purchase 1 or 2Gb FC HBAs.
Even as we deploy VMware based servers, the VMware servers themselves tend to be more memory/cpu strapped than IO.
It would be very rare, or almost impossible for a server to be driving linerate HBAs, with still plenty of headroom left in the CPU. Even basic test tools like IOmeter require significant CPU usage to drive an HBA to capacity. And that is when it's writing/reading all zeros. It's doesn't actually need to do anything with the data. As would be the case if a database server was requesting 2Gb/s from a disk array, and then had to join/sort/add/whatever the tables retrieved.
Re:You're Computin' for a Shootin' Mister (Score:4, Informative)
I think they run AC to the row or rack of servers, then they have just one super efficient PSU powering all the servers in a rack rather than 42 separate power supplies (plus UL enclosures, connectors, extension cords, etc, etc)
Essentially Google builds "rack-sized" blade centers... or at least catching up to what IBM and HP are doing but on a bigger scale, like full racks or multiple racks managed at once rather than just one chassis.
I do agree that chip makers aren't thinking "big enough" with things like their Blade lines.. Google wants to see reference specs that include options for bare motherboards to slide right into your basic 42 unit rack with IO, disk and power all pulled out to the raw basics so Google can decide how to manage the bits rather than having stock OEM boards with such limited options. Google wants to manage a "rack" as a single machine and optimize power and parts across 40 servers as one group, not 40 separate little systems.
Surely that's obvious (Score:4, Informative)
They collect a large amount of data on people and mine that for marketing information to turn around and target those same users.
It's the same model as google.
Re:Well I suppose... (Score:2, Informative)
>None of these offer much better performance. None.
There are IBM and Sun systems that are in an entirely different league, in terms of IO and memory bandwidth, than any Intel- or AMD-flavored system.
Depends on 'headroom' of other subsystems. (Score:2, Informative)
Not necessarily, no.
It's all about how CPU limited the workload is.
You might be running a program that's CPU limited on one processor, then upgrade the processor and discover that it's suddenly discover that instead of being CPU-bound, now you're memory-bound. Or I/O bound. Or whatever.
Point is, just because you've hit the wall in terms of CPU doesn't mean you'll get a 50% improvement with a 50% increase in CPU ... you'll only get that if all the rest of the server's systems have 50% overhead to spare. And in most cases they don't. One of them will hit the performance wall before you return to being CPU-bound with the shiny new processor.
There are exceptions to this -- renderfarms, for instance, or some distributed HPC stuff -- where you really can reasonably expect to get 50% more performance out of 50% more CPU, but they're exceptions not the rule.
Strange... (Score:5, Informative)
Since when do we listen to manufacturer's claims? You take the new hardware, stress test it with your custom software, record results, plan servers accordingly. How hard is it really to commission a server design that meets your needs and then QA some prototypes?
Re:WTF? (Score:1, Informative)
It may be the proc's themselves are performing close to the advertised improvements (not sure where 35% improvement comes from for both Intel and AMD), its just that bottlenecks elsewhere are stopping that being seen.
For example, if memory bandwidth is important watch out for the Nehalem memory clockrate dropoff...
http://blog.scottlowe.org/2009/05/11/introduction-to-nehalem-memory/
BTW, the recent Opteron/Xeon improvements are mostly around number of cores in one socket at the same/similar clock speed and same power use so IF the code multithreads well, then it should see most of those gains.
So if performance isn't adequate, should you buy more since the app scales so well? :^)
I call BS too ...
Re:WTF? (Score:1, Informative)
agreed and make sure your OS is tuned/setup for rapid I/O ... http://en.wikipedia.org/wiki/Anticipatory_scheduling or the 2.6 standard http://en.wikipedia.org/wiki/CFQ. Maybe the system was using the deadline scheduler on an old 2.6 or 2.4 branch.
Facebook might need OS and Network engineering skills internally to optimize their server hardware/software setups. Seems like they've got performance issues at every level and are trying to cram in a bigger CPU hoping it will scale with the problem.
Re:You're Computin' for a Shootin' Mister (Score:5, Informative)
No, they don't [cnet.com]. They use motherboards built to their own specification that require only 12V power. This power is supplied by the server's own PSU, which takes 240V input. The PSU hooks into a 12V sealed lead acid battery, implementing UPS functionality (there is no centralized UPS).
I think it's a very elegant design.
Re:Facebook's application is poorly coded (Score:5, Informative)
# hdparm -Tt /dev/sdc
/dev/sdc: /dev/sdc | grep Model /dev/sda
/dev/sda: /dev/sda | grep Model
Timing cached reads: 5120 MB in 2.00 seconds = 2562.04 MB/sec
Timing buffered disk reads: 84 MB in 3.02 seconds = 27.77 MB/sec # hdparm -i
Model=ST3200822A, FwRev=3.01, SerialNo=xxxxxx
# hdparm -Tt
Timing cached reads: 6078 MB in 1.99 seconds = 3052.95 MB/sec
Timing buffered disk reads: 338 MB in 3.01 seconds = 112.22 MB/sec
# hdparm -i
Model=ST31000333AS, FwRev=SD1B, SerialNo=xxxxxx
It's not even a full order of magnitude faster, but 112MB/s is still nearly four times faster. And these are both magnetic discs, rather than SSDs.
Re:Hm... (Score:3, Informative)
As someone who designs and deploys large storage environments for a living,
Then you should know that throughput is not the only (or - typically - the most important) measure of IO performance.
Typical computing tasks tend to be I/O bound - specifically by random I/O performance. To a large degree, this is due to the massive disparity in performance improvements between CPUa and storage.
Re:Facebook's application is poorly coded (Score:3, Informative)
That may be so. The new drive may indeed have four times the raw read throughput. But how much larger are they? Five times.
And even more tellingly, look at the seek performance. I looked up those two drives you mentioned. You'll find it's unchanged at 8.5ms. So we're seeking at the same speed, for more data.
In practice, then, in terms of throughput per provisioned GB, we are 24% worse off, and in terms of seek time per megabyte we are TEN times worse off today!
To illustrate what I mean, based on those numbers above: slurping 10TB off an idealised JBOD array of those newer drives would take 89 seconds; slurping 10TB off an idealised array of the older drives in parallel would take only 72 seconds. A similar (but far worse) story applies to random seek time performance, especially for busy transaction systems.
One might challenge the exact figures, but it doesn't matter - the point is, drive size is an important gotcha in storage performance optimisation today, and it's because performance has not really kept pace with drive size. The issue is not offset by the bigger caches they're turning up with, although that helps for some workloads.
We haven't talked dollars. The cost is important, but that's another dimension. Let's keep this to engineering chatter.
So what happens in shops that need really high performance? Well, if it's an application with lots of random reads but with hotspots, then cache will do nicely. But for raw random write performance i.e. the heavy transaction processing applications, it's gotta be more 15K RPM spindles at lower capacity. Or go crazy and solid state, but that's another party.
Re:Well I suppose... (Score:3, Informative)
POWER6 absolutely ass-rapes Nehalem. Period. 4.7GHz (clocked up to 6GHz internally), faster per-cycle than any x86 processor currently on the market.
According to the SPECCPU2006 benchmarks, a 3.33Ghz Nehalem provides nearly identical performance to a 5Ghz POWER6 (@ 8 cores each).
Re:You're Computin' for a Shootin' Mister (Score:4, Informative)
Re:Well I suppose... (Score:1, Informative)
I think you mean T5140. T5240 is the 2U version (which can have up to 256GB RAM and 16x 300GB SAS drives).