The HTC One Review
by Brian Klug on April 5, 2013 8:50 PM EST- Posted in
- Smartphones
- HTC
- Android
- Mobile
- HTC One
- Snapdragon 600
The One: Powered by Qualcomm's Snapdragon 600
At the beginning of 2012, the flagship smartphone platform was Qualcomm's MSM8960 featuring two Krait cores, Adreno 225 GPU and integrated LTE on TSMC's 28nm LP process. By the end of the year, the target shifted to Qualcomm's Fusion 3 platform: APQ8064/Snapdragon S4 Pro featuring four Krait cores, Adreno 320 GPUand a discrete MDM9x15 LTE baseband also on TSMC's 28nm LP process. Here we are, less than half a year later and the bar has been raised once more.
The new platform for any flagship Android smartphone is Qualcomm's recently announced Snapdragon 600. At the heart of the Snapdragon 600 platform are four Krait 300 CPU cores and an Adreno 320 GPU. The move from Krait (also known as Krait 200) to Krait 300 CPU cores comes with a handful of microarchitectural level improvements, not all of which have been disclosed publicly at this point.
Krait 300 is still built on the same 28nm LP process at TSMC, Samsung and perhaps a third foundry at this point. The pipeline of Krait 300 hasn't been changed, but Qualcomm claims it is able to run Krait 300 at higher clocks than the original version of the core without relying simply on voltage scaling. Indeed we see this in HTC's One, which can run each core at up to 1.7GHz compared to the 1.5GHz max in the APQ8064 based Droid DNA. Samsung has already announced that some versions of its Galaxy S 4 will feature the Snapdragon 600 with its CPU cores running at up to 1.9GHz.
As with previous implementations of Krait, each Krait 300 core can operate at its own frequency and voltage independently of the other cores. Each core is also power gated, so if you're not using cores they aren't a power burden on the system.
Diving a bit deeper, Krait 300 introduces a hardware L2 data prefetcher, responsible for using available memory bandwidth to preemptively bring data into the L2 cache ahead of actual demand. Any sort of prefetching (or speculative execution) typically comes with a significant power penalty, which is why we don't see a ton of it in mobile - at least not without an associated move to a smaller manufacturing process.
Branch prediction accuracy improves with Krait 300 and the new architecture is capable of executing more instructions out of their original program order. Krait 300 still isn't as fully out of order as ARM's Cortex A15, but it's somehow more OoO than the original core. Finally Qualcomm claims improvements to both FP and js performance with Krait 300, but we still don't have details as to how.
In general, Qualcomm claims we should expect a 15% increase in CPU performance at the same frequency for Krait 300 vs. the og Krait. Factoring in clock speed improvements and you're looking at 25 - 30% at the high end. To find out just how much of an improvement exists in the real world, we once again turn to a combination of microbenchmarks and the usual set of web browser and native client tests.
I started out by running Geekbench 2.0 on the One and HTC's Butterfly, an APQ8064 platform. The One has its CPU cores clocked at up to 1.7GHz while the Butterfly tops out at 1.5GHz. As I've done in the past, I'm presenting the Geekbench results at native speeds as well as scaled up by 13% to simulate a theoretical 1.7GHz APQ8064.
We'll start out with a look at integer performance. These benchmarks exclusively use integer data and are generally small enough to fit in at least the processor's L2 cache. Each benchmark runs in both a single threaded (1 core) and multithreaded (many core) instance. Since both SoCs feature four CPU cores, we should get a good idea of how multithreaded scaling changed with the move to Krait 300/Snapdragon 600.
Geekbench 2 - Microbenchmark Performance Comparison - Integer Performance | ||||||
Krait - 1.5GHz | Krait - 1.7GHz (Simulated) | Krait 300 - 1.7GHz | Krait 300/Krait Perf Advantage | |||
Integer | 973 | 1103 | 1959 | 77.6% | ||
Blowfish | 23.5 | 26.6 | 34.6 | 29.9% | ||
Blowfish MT | 52.9 | 59.9 | 129 | 115.2% | ||
Text Compression | 1.84 | 2.08 | 2.41 | 15.6% | ||
Text Compression MT | 3.07 | 3.47 | 8.16 | 134.5% | ||
Text Decompression | 1.78 | 2.01 | 234 | 16.5% | ||
Text Decompression MT | 3.92 | 4.44 | 6.08 | 36.9% | ||
Image Compression | 4.98 | 5.64 | 8.78 | 55.6% | ||
Image Compression MT | 15.7 | 17.7 | 21.8 | 22.5% | ||
Image Decompression | 7.41 | 8.39 | 13.6 | 61.9% | ||
Image Decompression MT | 21.0 | 23.8 | 45.1 | 89.5% | ||
LUA | 299.0 | 338.8 | 607.6 | 79.3% | ||
LUA MT | 753.0 | 853.4 | 2090 | 144.9% | ||
ST Average | 43.1% | |||||
MT Average | 90.6% |
The Blowfish test is an encryption/decryption test that implements the Blowfish algorithm. The algorithm itself is fairly cache intensive and features a good amount of integer math and bitwise logical operations. Here we see a 30% increase in single threaded performance, but a whopping 115% increase in the multithreaded Blowfish test. We see this same ST/MT disparity echoed in a number of other benchmarks here as well.
The text compression/decompression tests use bzip2 to compress/decompress text files. As text files compress very well, these tests become great low level CPU benchmarks. The bzip2 front end does a lot of sorting, and is thus very branch as well as heavy on logical operations (integer ALUs used here). We don't know much about the size of the data set here but given the short run times we're not talking about compressing/decompressing all of the text in Wikipedia. It's safe to assume that these tests run mostly out of cache. Here we see a 15.6% advantage over a perfectly scaled Krait 200, but once again looking at multithreaded performance the gains are far larger (134.5%). Qualcomm may have improved cache coherency/cache sharing performance in Krait 300, which results in a significant speedup when multitple threads are accessing the same data. All of this is speculation at this point since I don't have good low level information on what Krait's architecture actually looks like however.
The image compression/decompression tests are particularly useful as they just show JPEG compression/decompression performance, a very real world use case that's often seen in many applications (although hardware JPEG acceleration does limit its usefulness these days in mobile). The code here is once again very integer math heavy (adds, divs and muls), with some light branching. On the compression side we see a reversal of the earlier trends - single threaded performance is where we see the biggest improvement (55.6%) while MT scaling is roughly half that. On the decompression side, both ST and MT performance improves by a healthy amount. Improved branch prediction likely plays a role here. I'm still not totally clear on why we're seeing inconsistency in ST and MT scaling here between Krait 200 and Krait 300, but it's obvious that Qualcomm has done significant work under the hood of the Krait microarchitecture.
The final set of integer tests are scripted LUA benchmarks that find all of the prime numbers below 200,000. As with most primality tests, the LUA benchmarks here are heavy on adds/muls with a fair amount of branching. Krait 300's gains here are nothing short of impressive both in single and multithreaded performance. The improved branch prediction accuracy likely helps here, but there's still some unknown impact that enables greater-than-expected MT scaling.
On average there's a 43.1% increase in non-memory-bound single threaded integer performance and almost twice that on the multithreaded side. Remember we're talking about very targeted benchmarks here, so a real world single threaded perf increase closer to 15% wouldn't be surprising.
Geekbench 2 - Microbenchmark Performance Comparison - FP Performance | ||||||
Krait - 1.5GHz | Krait - 1.7GHz (Simulated) | Krait 300 - 1.7GHz | Krait 300/Krait Perf Advantage | |||
FP | 2199 | 2492 | 4244 | 70.3% | ||
Mandlebrot | 201.0 | 227.8 | 416.5 | 82.8% | ||
Mandlebrot MT | 582.0 | 659.6 | 1650.0 | 150.2% | ||
Dot Product | 442.0 | 500.9 | 855.5 | 70.8% | ||
Dot Product MT | 944.0 | 1069.8 | 3070.0 | 187.0% | ||
LU Decomposition | 184.0 | 208.5 | 368.3 | 76.6% | ||
LU Decomposition MT | 571.0 | 647.1 | 795.9 | 23.0% | ||
Primality | 180.0 | 204.0 | 370.5 | 81.6% | ||
Primality MT | 491.0 | 556.4 | 657.2 | 18.1% | ||
Sharpen Image | 3.85 | 4.36 | 7.15 | 63.9% | ||
Sharpen Image MT | 13.0 | 14.7 | 18.6 | 26.2% | ||
Blur Image | 2.31 | 2.61 | 3.51 | 34.1% | ||
Blur Image MT | 5.73 | 6.49 | 12.9 | 98.6% | ||
ST Average | 68.3% | |||||
MT Average | 83.8% |
The Mandlebrot benchmark simply renders iterations of the Mandlebrot set. Here there's a lot of floating point math (adds/muls) combined with a fair amount of branching as the algorithm determines whether or not values are contained within the Mandlebrot set. Right off the bat we see significant gains in FP performance, and once again we see the same non-linear scaling when we compare MT improvement to ST improvement.
The Dot Product test is simple enough, it computes the dot product of two FP vectors. Once again there are a lot of FP adds and muls here as the dot product is calculated. With a 70% increase here (ST), we're able to put numbers to Qualcomm's claims of increased FP performance. Multithreaded performance improves by an even larger margin.
The LU Decomposition tests factorize a 128 x 128 matrix into a product of two matrices. The sheer size of the source matrix guarantees that this test has to hit the L2 cache. The math involved are once again FP adds/muls, but the big change here appears to be the size of the dataset. Interestingly enough, the multithreaded performance improvement isn't anywhere near as large in this physically bigger FP workload. Single threaded performance gains are still significant however. I wonder if the better than expected MT scaling results might be limited to workloads where the datasets are small enough to be contained exclusively in the L1 cache.
The Primality benchmarks perform the first few iterations of the Lucas-Lehmar test on a specific Mersenne number to determine whether or not it's prime. The math here is very heavy on FP adds, multiplies and sqrt functions. The data set shouldn't be large enough to require trips out to main memory. The performance resultshere are very similar to what we saw in the LU decomposition tests - significant gains in single threaded perf, and more subtle increases in MT performance.
Both the sharpen and blur tests apply a convolution filter to an image stored in memory. The application of the filter itself is a combination of matrix multiplies, adds, divides and branches. The size of the data set likely hits the data cache a good amount. The sharpen test shows great scaling in single threaded performance, but less scaling in the multithreaded workload - while the blur test does the exact opposite.
Overall FP performance improvements are solid. Single threaded performance improves a bit more than what we saw with the integer workloads, while overall multithreaded FP is somewhat similar. Bottom line, small size compute limited workloads seem to improve tremendously with Krait 300.
The next table of results looks at memory bound performance to see if things have improved on that front. We don't have a good idea of the DRAM used in the HTC One vs. the HTC Butterfly, and since memory performance doesn't necessarily improve with clock speed I'm not presenting the scaled Krait 200 results:
Geekbench 2 - Microbenchmark Performance Comparison - Memory Performance | |||||
Krait - 1.5GHz | Krait 300 - 1.7GHz | Krait 300/Krait Perf Advantage | |||
Memory Overall | 1244 | 2617 | 110.4% | ||
Read Sequential | 347 | 1160 | 234% | ||
Write Sequential | 950 | 2940 | 209% | ||
Stdlib Alloc | 1.11 | 1.40 | 26.1% | ||
Stdlib Write | 2.91 | 4.74 | 62.9% | ||
Stdlib Copy | 2.98 | 5.33 | 78.9% | ||
STREAM Overall | 571 | 657 | 15.1% | ||
Copy | 804 | 916.5 | 14.0% | ||
Scale | 852 | 849.7 | 0% | ||
Add | 923 | 1005.6 | 8.9% | ||
Triad | 672 | 970.8 | 44.5% |
The read and write sequential tests measure reads and writes to/from the register file. In both situations, performance on Krait 300 is significantly better - more than doubling its predecessor.
The C-library functions all improve by varying degrees - it's clear that peak memory performance improved with Krait 300.
STREAM gives us a good look at sustained memory bandwidth, and here we don't see a ton of improvement in most of the tests. The copy, scale and add tests just copy data around in memory, with the latter two tests including a single FP operation (mul then add, respectively). It's only the final test, Triad, where we see significant scaling. The only difference there is the use of two FP operations, with the add being dependent on the results of the previous mul. Here we could be seeing the results of the ability to forward data between pipeline stages in Krait 300.
To corroborate the Geekbench 2 results, we turn to AndEBench - effectively Coremark for Android. The Coremark source is composed of very tiny benchmarks that fit entirely within local caches, giving us a poor indication of system level performance but good insight into improvements at a microarchitectural level. The results below appear in both native code and java runtime formats:
AndEBench - Microbenchmark Performance Comparison | ||||||
Krait - 1.5GHz | Krait - 1.7GHz (Simulated) | Krait 300 - 1.7GHz | Krait 300/Krait Perf Advantage | |||
Native Score | 6627 | 7510 | 8958 | 19.3% | ||
Java Score | 221 | 250 | 432 | 72.5% |
Native performance increased by only 19.3%, while Java performance improved by a healthy 72.5%. AndEBench doesn't actually give us any new data, but it does help to validate some of the gains we saw earlier.
Overall it's clear that Krait 300 includes substantial improvement to register file access and integer/FP compute. Sustained memory bandwidth alone doesn't improve by a huge amount, which makes sense given Krait's already decent memory subsystem (at least compared to ARM's Cortex A9). Features such as data forwarding and improved branch prediction can also have significant impacts on performance. Finally, there seems to be some enhancement in Krait 300/Snapdragon 600 for improving multithreaded performance when all cores are working on a similar program with shared data. All of these improvements make Krait 300 a significant upgrade, at least from a microarchitectural perspective, over its predecessor. While we don't have good low level power analysis for Krait 300 vs. the original Krait yet, it's entirely possible that the nature of these improvements won't come at a significant power penalty. In particular, improving branch prediction accuracy can lead to lower power consumption as well as higher performance as mispredicted branches result in wasted compute and wasted power.