Quad Core Intel Xeon 53xx Clovertown
by Johan De Gelas on December 27, 2006 5:00 AM EST- Posted in
- IT Computing
Render Servers
To get a better idea on how the different server platforms compare, we did some rendering too. Most of our tests (MySQL, DB2, and SPECjbb2005) are very integer intensive, whereas render tests are floating point intensive. We start with a simple Cinebench 9.5 benchmark (on Windows 2003 32 bit), which is based on Maxon's Cinema 4D rendering engine.
Four 2.4GHz Opteron cores are a bit slower than four 2.33GHz Xeons, but when we look at the eight core scores the Opteron is a bit faster. Again, it seems that the Opteron system scales better.
Why do we analyze this in so much detail? Cinebench, like most renders, couldn't care less about the memory subsystem. We tested our Clovertown system with two or four memory channels and the results were exactly the same. Therefore, we are pretty sure the slightly worse scaling of the Xeon E5345 is not a result of limited bandwidth or higher latency. There must be something else that limits scalability, and that something else is most likely cache coherency.
Cinebench is popular because it is an easy benchmark, but 3dsmax is a very popular application. We tested with 3dsmax version 9, which has been improved to work better with multi-core systems. We used the "architecture" scene, which has been our favorite benchmarking scene for years. All tests were done with 3dsmax's default scanline renderer, SSE enabled and we rendered at HD 720p resolution. We measure the time it takes to render frames 20 to 22.
This cannot be a coincidence anymore: a single Xeon E5345 leaves the dual Opteron 880 far behind, but a dual Xeon E5345 trails the quad Opteron. It is not only the application that matters; the dataset has an impact too. Take a look at the table below where rendered at 720p and 480p resolution.
As you can see, the resolution at which you normally render determines how much you benefit from eight cores. Using an octal core machine to render relatively low resolution movies is like driving a potent 8 cylinder engine in a crowded city: all the horsepower goes to waste as you accelerate for a short period and then hit the brakes when approaching a red light. The same is true for rendering: unless you are rendering a complex scene at high resolution, the multi-core engine can never show its full potential. Thanks to better scaling, the quad Opteron platform has still a small advantage.
However, when it comes to price/performance, it is not the quad core Xeon or the Opteron that wins, but most likely the Xeon 5160. It is more flexible as it will outperform the quad core Xeon in any scene that is not as complex as architecture and resolutions that are lower than 720p. Only if your scenes use radiosity lighting can we see a clear advantage for using the quad core Xeon. We noticed that the Xeon was up to 40% faster in such scenes.
To get a better idea on how the different server platforms compare, we did some rendering too. Most of our tests (MySQL, DB2, and SPECjbb2005) are very integer intensive, whereas render tests are floating point intensive. We start with a simple Cinebench 9.5 benchmark (on Windows 2003 32 bit), which is based on Maxon's Cinema 4D rendering engine.
Cinebench 9.5 | |
CPU | 1280x720 |
Quad Opteron 880 2.4 | 1720 |
Dual Quad Xeon E5345 2.33 | 1686 |
Dual DC Xeon 5160 3.0 | 1456 |
Quad Xeon E5345 2.33 | 1272 |
Quad DC Xeon 7130M 3.2 | 1169 |
Dual Opteron 880 2.4 | 1121 |
Dual DC Xeon 5060 3.73 | 1079 |
Dual DC Xeon 7130M 3.2 | 889 |
Four 2.4GHz Opteron cores are a bit slower than four 2.33GHz Xeons, but when we look at the eight core scores the Opteron is a bit faster. Again, it seems that the Opteron system scales better.
Cinebench 9.5 (32 bit) Per core performance |
|||
CPU | Quad core | Octal core | Scaling 4->8 |
Xeon 7130 3.2 GHz | 889 | 1272 | 43% |
Xeon 5345 2.33 GHz | 1169 | 1686 | 44% |
Opteron 880 2.4 GHz | 1121 | 1720 | 53% |
Opteron 890 2.8 GHz | 1297 | 1990 | 53% |
Xeon 5160 3 GHz | 1456 | N/A | N/A |
. | |||
Xeon Scaling 2.33 -> 3 GHz | 25% | ||
Opteron 880 vs. Quad core Xeon 2.33 GHz | -4% | 2% | 21% |
Why do we analyze this in so much detail? Cinebench, like most renders, couldn't care less about the memory subsystem. We tested our Clovertown system with two or four memory channels and the results were exactly the same. Therefore, we are pretty sure the slightly worse scaling of the Xeon E5345 is not a result of limited bandwidth or higher latency. There must be something else that limits scalability, and that something else is most likely cache coherency.
Cinebench 9.5 (32 bit) Per socket performance |
|
CPU | Dual Socket |
Quad core Xeon 2.33 GHz vs. Xeon 5160 | 16% |
Quad core Xeon 2.33 GHz vs. Opteron 880 | 50% |
Quad core Xeon 2.33 GHz vs. Opteron 890 | 30% |
Cinebench is popular because it is an easy benchmark, but 3dsmax is a very popular application. We tested with 3dsmax version 9, which has been improved to work better with multi-core systems. We used the "architecture" scene, which has been our favorite benchmarking scene for years. All tests were done with 3dsmax's default scanline renderer, SSE enabled and we rendered at HD 720p resolution. We measure the time it takes to render frames 20 to 22.
3DS Max 9 Architecture | |
CPU | 1280x720 |
Quad Opteron 880 2.4 | 273 |
Dual Quad Xeon E5345 2.33 | 308 |
Dual DC Xeon 5160 3.0 | 309 |
Quad Xeon E5345 2.33 | 392 |
Dual DC Xeon 5060 3.73 | 419 |
Quad DC Xeon 7130M 3.2 | 443 |
Dual Opteron 880 2.4 | 454 |
This cannot be a coincidence anymore: a single Xeon E5345 leaves the dual Opteron 880 far behind, but a dual Xeon E5345 trails the quad Opteron. It is not only the application that matters; the dataset has an impact too. Take a look at the table below where rendered at 720p and 480p resolution.
3DS Max 9 Architecture | |||
CPU | 720x480 | 1280x720 | |
Quad Opteron 880 2.4 | 137 | 273 | |
Dual Quad Xeon E5345 2.33 | 138 | 308 | |
Dual DC Xeon 5160 3.0 | 133 | 309 | |
Quad Xeon E5345 2.33 | 167 | 392 | |
Dual DC Xeon 5060 3.73 | 188 | 419 | |
Quad DC Xeon 7130M 3.2 | 201 | 443 | |
Dual Opteron 880 2.4 | 196 | 454 | |
. | |||
Scaling Opteron 880 | 43% | 66% | |
Scaling Xeon E5345 | 21% | 27% |
As you can see, the resolution at which you normally render determines how much you benefit from eight cores. Using an octal core machine to render relatively low resolution movies is like driving a potent 8 cylinder engine in a crowded city: all the horsepower goes to waste as you accelerate for a short period and then hit the brakes when approaching a red light. The same is true for rendering: unless you are rendering a complex scene at high resolution, the multi-core engine can never show its full potential. Thanks to better scaling, the quad Opteron platform has still a small advantage.
3DSMax 9 (32 bit) Per socket performance |
|
CPU | Dual Socket |
Quad core Xeon 2.33 GHz vs. Xeon 5160 | 0% |
Quad core Xeon 2.33 GHz vs. Opteron 880 | 47% |
Quad core Xeon 2.33 GHz vs. Opteron 890 | 27% |
However, when it comes to price/performance, it is not the quad core Xeon or the Opteron that wins, but most likely the Xeon 5160. It is more flexible as it will outperform the quad core Xeon in any scene that is not as complex as architecture and resolutions that are lower than 720p. Only if your scenes use radiosity lighting can we see a clear advantage for using the quad core Xeon. We noticed that the Xeon was up to 40% faster in such scenes.
15 Comments
View All Comments
Antinomy - Wednesday, March 7, 2007 - link
A great review, very interesting.But there are a few things to mention. A mistake in results of Cinebench test. In the overall table the uni Clovertown system got 1272 points, but in the next (per core performance) - 1169. The result was swapped with the one of Xeon 7130. And a comment about the scalability extrapolation. The result of scalability 2.33 Clover vs 3.0 Dual Woodcrest can be hardly compared due to different organization of the systems. These MoBo have two independent FSB so this means, that the two Woodcrests will be provided with twice more peak memory bandwith. This can't make no influence on the result. Also the 4 channel memory mode provides a 5% increase versus 2 channel in real bandwith, so we can't say that theese applications do not suffer from lack of memory bandwith.
It would be interesting to provide a test of uni Woodcrest system and a test of system based on Woodcrest (both uni and dual) at the same frequency as Clovertown has. And a Kentsfield\Conroe systems (despite they aren't server ones) would be nice to look at because of their more efficient usage of memory bandwith and FSB.
afuruhed - Thursday, December 28, 2006 - link
We are getting more Clovertowns. There is a chart at http://www.pantor.com/software.html">pantor.com that indicates that some applications benefit a lot. http://en.wikipedia.org/wiki/FIX_protocol">The FIX protocol is a technical specification for electronic communication of trade-related messages (financial markets).henriks - Thursday, December 28, 2006 - link
Agree with other responses - good article!Some comments on the jbb results page:
You state that JRockit is (only) available for x86-64 and Itanium. x86 and Sparc should be added to this list.
The JRockit configuration you're using enables a single-spaced GC. In that configuration, performance is tied to heap size (larger heap means fewer GC events). Increasing the heap size to 3 GB - as for the Sun benchmark results - would increase performance slightly but in particular give much better scalability when you increase the number of warehouses to large numbers.
It looks like you have not enabled large pages in the OS. Doing this would give a large performance boost and help scalability regardless of chip or JVM vendor.
Astute readers may note that your results are lower than the published results on www.spec.org. Apart from OS and possibly BIOS tuning, the reason is that the most recent results are using a newer JRockit version (not yet available for public download). This new version improves performance on this benchmark by 20-30% on x86 chips - Intel *and* AMD - with the largest positive effect on high-bin chips from the respective vendors. The effect on other Java applications vary from zero to a lot.
Cheers!
Henrik, JRockit team
dropadrop - Wednesday, December 27, 2006 - link
Considering how much we just payed for some DL585's compared to DL380's I think the performance is pretty impressive. There is still something the DL380's (and most other two socket servers) can't do, and that is hosting 64GB or more ram.I mainly take care of vmware servers, and there the amount of memory becomes a bottleneck long before the processors, atleast in most setups. I don't think I'd have alot of use for octal processors unless I got a minimum of 32GB of ram, probably 64.
rowcroft - Thursday, December 28, 2006 - link
I've run into the same challenge when planning for the quads. My take is that I'm getting dual quads for half the price of quad dual cores. With ESX 3's HA functionality I can group the host servers and get the 32GB of ram with double the cores and have host based redundancy for critical vm's.mino - Thursday, December 28, 2006 - link
there is another thing DL380 lacks: no drop-in analog to Barcelona on the horizon...Justin Case - Wednesday, December 27, 2006 - link
Finally a good article at AT, written by someone who knows what he's talking about. Meaningful benchmarks, meaningful comments, and conclusions that make sense. If only some Johanness could rub off on other AT writers...hans007 - Wednesday, December 27, 2006 - link
i think an alternative to say a dual dual core AMD though even as a server or workstation is say a quad core socket 775 cpu. I know the lower 3xxx series xeons are made for this (and are exactly the same as core 2 duo) soyou could do a comparison of 2 amd dual cores vs a single 775 quad with ECC ddr2 etc.
mino - Thursday, December 28, 2006 - link
Check QuadFX vs. Kentsfield reviews.With ECC both results will be a bit lower but the conparison remains.
A small hint: NO ONE tested QuadFX as DB server against Kenstfield....
Gues what: Quad FX is cheaper and would rules the roost on server-like tasks.
ltcommanderdata - Wednesday, December 27, 2006 - link
Well it's nice to finally see a review of the 5145, although I was hoping for more detailed power consumption numbers. The performance benchmarks were very detailed though which was great.Thought I would point out a few errors I noticed as I was flipping through. First on page 2, in the Cache2Cache Latency chart the 201 for the Xeon DP 5060 that is placed in the "Same die, same package" row should be in the "Different die, same package" row. Dempsey uses a dual die approach like Presler and Cloverton as opposed to a single die approach like Smithfield and Paxville DP. And in the last page in the conclusion, you mentioned Clarksboro having "four DIBs", which implies 8 FSBs. I believe that should read two DIBs or really a Quad Independent Bus (QIB) since I'm pretty sure it only has 4 FSBs. (On a side note, Intel slides showed those 4 FSBs clocked at 1066MHz which is really disappointing. Hopefully, now that Cloverton turns out to come in 1333MHz versions instead of only 1066MHz versions that was first announced, Tigerton (and therefore Clarksboro) which is based on Cloverton will also have 1333MHz versions.)