Answered by the Experts: ARM's Cortex A53 Lead Architect, Peter Greenhalgh
by Anand Lal Shimpi on December 17, 2013 11:56 PM EST- Posted in
- Ask the Experts
Question from OreoCookie
Here's my question: Implementations of previous ARM cores by licensees, most notably the A15, feature much higher clocks than what ARM recommends. How has that influenced the design of the A53? Do you expect ARM's clock frequency design targets to be closer to the clocks in actual implementations?
Answer
Hi OreoCookie,
ARM processor pipelines allow the processor to be built to achieve certain frequencies, but we don't recommend or advise what they should be. After all, there are still ARM1136 processors being implemented today on 40nm, yet we designed the processor on 180nm!
We and our partners like the freedom to chose whether to push the frequency as far as it will go or to back off a bit and save a bit of area/power. This freedom allows differentiation, optimisation around the rest of the platform and time-to-market (higher frequency = more effort = more time).
Naturally our pipelines have a range of sweet-spot frequencies on a given process node and there is a lot of discussion with lead partners about a new micro-architecture, but we aren't changing the pipelines based on the frequencies we're seeing in current mobile implementations.
Question from msm595
As someone starting their Computer Engineering degree and really interested in computer architecture, how can I give myself a head start?
Answer
Hi Msm,
Most good EE/CE degrees will have a reasonable amount of micro-architecture/architecture courses, but it doesn't hurt to understand what makes all the popular micro-architectures tick. For that matter, a lot of the designs in the 90's were impressive too - check out the Dec Alpha EV8 which never got to market, but was a really interesting processor.
Question from tabascosauz
Hi all (and hopefully this gets a response from our guest Mr. Greenhalgh),
I'm not exactly too well informed in the CPU department, so I won't pretend that I am. I'm just curious as to how A53 will fare against the likes of Krait 450 and Cyclone in terms of DMIPS (as obsolete as some people may think it is, i'd just like to get a sense of it performance-wise) and pipeline-depth.
We're all assuming that Apple has gone ahead and used a ARMv8a instruction set and, as per their own usual routine, swelled up the cores to many times that of their competitors and marketed it as a custom architecture. Since A53 is also based off ARMv8, I'm wondering how this will translate into speed. I think someone's mentioned before that A53 is the logical successor to Cortex-A7, but my mind is telling me that there's more to the number in the name than just a random number that is a few integers below 57.
If this is essentially a quad-core part and succeeds the A7, then are we looking at placement in the Snapdragon 400 segment of the market? It would certainly satisfy the conditions of "mid-to-high end" but I'm a little disappointed in Cortex-A at the moment considering that the A7 was introduced as a sort of energy-efficient, slightly lower performing A9. I mean, the A12 is seen as the A9's successor but it's still ARMv7a and it won't be hitting the market anytime soon, so would it be possible that we could see A53, with its ARMv8 set, on par with the Cortex-A12 in terms of rough performance estimates?
Can't wait until A57; it's bound to be a great performer!
Answer
Hi Tabascosauz,
Speaking broadly about Dhrystone, the pipeline length is not relevant to the benchmark as perfect branch prediction is possible which means issue width to multiple execution units and fetch bandwidth largely dictates the performance. This is the reason Dhrystone isn't great as a benchmark as it puts no pressure on the data or instruction side memory systems (beyond the L1 cache interfaces), TLBs and little pressure on the branch predictors.
Cortex-A12 is a decent performance uplift from Cortex-A53 in performance so we're not worried about overlap and while the Smartphone market is moving in the direction of 64-bit, there are still a lot of sockets for Cortex-A12. In addition there are many other markets where Cortex-A9 has been successful (Set Top Box, Digital TV, etc) where 64-bit isn't a near-term requirement and Cortex-A12 will be a great follow-on.
Question from hlovatt
Can you explain what you mean by a 'weak' memory model and how this differs from other architectures and how it translates into memory models in common languages like Java?
Answer
Hi hlovatt,
A weakly ordered memory model essentially allows reads (loads) and writes (stores) to overtake each other and observed by other CPUs/GPUs/etc in the system at different times or different order.
A weakly ordered memory model allows for the highest performance system to be built, but requires the program writer to enforce order where necessary through barriers (sometimes termed fences). There are many types of barrier in the ARM architecture from instruction only (ISB) to full-system barriers (DSB) and memory barriers (DMB) with various variants that, for example, only enforce ordering on writes rather than reads.
The Alpha architecture is the most weakly ordered of all the processor architectures I'm aware of, though ARM runs it close. x86 is an example of a strongly ordered memory model.
Recent programming standards such as C++11 assume weakly ordered and may need ordering directives even on strongly ordered processors to prevent the compiler from optimising the order.
Question from vvid
Hi Peter!
Can 32bit performance degrade in future ARMv8 processor designs? ARMv7 requires some features omitted in ARMv8 - I mean arbitrary shifts, direct access to R15, conditional execution. I guess this extra hardware is not free, especially the latter.
Answer
Hi vvid,
Fortunately, while the ARM instruction set has evolved over the years, ARMv8 AArch32 (which is effectively ARMv7) isn't that far away from ARMv8 AArch64. A couple of big differences in ARMv8 AArch64 are constant length instructions (32-bit) rather than variable (16-bit or 32-bit) and essentially no conditional execution, but most of the main instructions in AArch32 have an AArch64 equivalent. As a micro-architect, one of the aspects I like the most about the AArch64 instruction set is the regularity of the instruction decoding as it makes decoding them faster.
As such the hardware cost of continuing to support AArch32 is not that significant and it is more important to be able to support the thousands of software applications that have been compiled for ARMv7 which are fully compatible and run just fine on the generation of 64-bit ARM processors that are now arriving.
Question from elabdump
Very nice,
Here are some Questions:
- Does ARM works on GCC Development?
- Are there special instructions for Cryptostuff defined in the 64-Bit ISA?
- If yes, are there patches for the upstream linux kernel available?
- Are there Instructions for SHA-3 available?
Would ARM change their mind about free Mali drivers?
Would ARM support device-trees?
Answer
Hi Elabdump,
Yes, ARM works on GCC development and, yes, there are special Crypto instructions defined in the v8 Architecture (for AES and SHA).
As for patches, Mali drivers and device trees, these are handled by other teams in ARM. If you're interested in these wider questions about ARM technology, forums such as http://community.arm.com can help you.
Question from JDub8
My question is about processor architecture design in general - there cant be very many positions in the world for "lead processor/architecture designer" - so how does one become one? Obviously promotion from within but how to you get the opportunity to show your company you have what it takes to make the tough calls? There cant be very many textbooks on the subject since you guys are constantly evolving the direction these things go.
How many people does it take to design a bleeding edge ARM processor? How are they split up? Can you give a brief overview of the duties assigned to the various teams that work on these projects?
Thanks.
Answer
Hi JDub8,
I'd imagine that ARM is not so different from any other processor company in that it is the strength of the engineering team that is key to producing great products.
Perhaps where ARM differs from more traditional companies is the level of discussion with the ARM partners. Even before an ARM product has been licensed by an ARM partner they get input in to the product and there will be discussions with partners at all levels from junior engineers a few years out of college, through to multi-project veterans, lead architects, product marketing, segment marketing, sales, compiler teams, internal & external software developers, etc etc.
As a result, there are rarely 'tough calls' to be made as there's enough input from all perspectives to make a good decision.
In answer to your question about processor teams, these are typically made up of unit design engineers responsible for specific portions of the processor (e.g. Load-Store Unit) working alongside unit verification engineers. In addition to this there will be top-level verification teams responsible for making sure the processor works as a whole (using a variety of different verification techniques), implementation engineers building the design and providing feedback about speed-paths/power, performance teams evaluating the IPC on important benchmarks/micro-benchmarks.
And this is just the main design team! The wider team in ARM will include physical library teams creating advanced standard cells and RAMs (our POP technology), IT managing our compute cluster, marketing/sales working with partners, software teams understanding instruction performance, system teams understanding wider system performance/integration and test-chip teams creating a test device.
All in all it takes a lot of people and a lot of expertise!
Question from twotwotwo
I. Core count inflation. Everyone but Apple lately has equated high-end with quad-core, which is unfortunate. I have a four-core phone, but would rather have a dual-core one that used those two cores' worth of die area for a higher-IPC dual-core design, or low-power cores for a big.LITTLE setup, or more L2, or most anything other than a couple of cores that are essentially always idle. Is there anything ARM could do (e.g., in its own branding and marketing or anything else) to try to push licensees away from this arms race that sort of reminds me of the megapixel/GHz wars and towards more balanced designs?
II. Secure containers. There has been a lot of effort put in to light-weight software sandboxes lately: Linux containers are newly popular (see Docker, LXC, etc.); Google's released Native Client; process-level sandboxing is used heavily now. Some of those (notably NaCl) seem be clever hacks implemented in spite of the processor architecture, not with its help. Virtualization progressed from being that sort of hack to getting ISA support in new chips. Do you see ARM having a role in helping software implementers build secure sandboxes, somewhat like its role in supporting virtualization?
III. Intel. How does it feel to work for the first company in a long while to make Intel nervously scramble to imitate your strategy? Not expecting an answer to that in a thousand years but had to ask.
Answer
Hi twotwotwo,
Core counts are certainly a popular subject at the moment!
From our perspective we've consistently supported a Quad-Core capability on every one of our multi-core processors all the way back to ARM11 MPCore which was released in the mid-2000's. And while there's a lot of focus from the tech industry and websites like Anandtech on high-end mobile, our multi-core processors go everywhere from Set-Top-Box to TVs, in-car entertainment, home networking, etc, etc some of which can easily and consistently use 4-cores (and more, which is why we've built coherent interconnects to allow multiple cluster to be connected together).
The processor's are designed to allow an ARM partner to chose between 1,2,3 or 4-cores and the typical approach is to implement a single core then instance it 4-times to make a Quad-Core with the coherency+L2 cache layer connecting the cores together and power switching to turn un-used Cores off. The nice thing about this approach is that it is technically feasible to design a coherency+L2 cache solution that scales in frequency, energy-efficiency and IPC terms from 1-4 cores rather than compromising in any one area.
The result of this is that a Dual-Core implementation will be very similar in overall performance terms as a Quad-Core implementation. So while it may be that for thermal reasons running all 4-Cores at maximum frequency for a sustained period of time is not possible, if two Cores are powered off on a Quad-Core implementation it isn't any different from only having a Dual-Core implementation to start with. Indeed, for brief periods of time 4-Cores can be turned-on as a Turbo mode for responsiveness in applications that only want a burst of performance (e.g. web browsing). Overall there are few downsides to multiple Core implementations outside of silicon area and therefore yield.
From a product perspective we've been consistent for almost a decade on the core counts provided by our processors and allow the ARM partners to choose how they want to configure their platforms with our technology.
20 Comments
View All Comments
Exophase - Wednesday, December 18, 2013 - link
Wonderful article. Thank you very much for your time and information.syxbit - Wednesday, December 18, 2013 - link
Great answers!It's too bad none of the questions about A15 losing so badly to Krait and Cyclone weren't brought up.
ciplogic - Wednesday, December 18, 2013 - link
It was about Cortex A53. Also, the answers were politically neutral (as they should), as the politics and other companies future development are not the engineer's talk. Maybe an engineer from Qualcomm could answer accurately.lmcd - Wednesday, December 18, 2013 - link
Krait 200 is way worse than A15. If A15 revisions come in then A15 could easily keep pace with Krait. But idk if ARM does those.Cyclone is an ARM Cortex-A57 competitor.
Wilco1 - Wednesday, December 18, 2013 - link
A15 has much better IPC than Krait (which is why in the S4 Krait needs 2.3GHz to get the similar performance as A15 at just 1.6GHz). The only reason Krait can keep up at all is because it uses 28nm HPM, which allows for much higher frequencies.ddriver - Wednesday, December 18, 2013 - link
Really? The Note3 with krait is pretty much neck to neck with the exynos octa version at 1.9, which was a A15 design last time I checked.Wilco1 - Wednesday, December 18, 2013 - link
Sorry, it was 1.6 vs 1.9GHz in the S4 and 1.9 vs 2.3GHz in the Note3. Both are pretty much matched on performance, so Krait needs ~20% higher clock.cmikeh2 - Wednesday, December 18, 2013 - link
I don't know if those are normalized for actual power consumption, although we have to deal with different process technologies as well. Good IPC is pretty much meaningless in this segment if it requires ridiculous voltages to hit the frequencies it needs to.Wilco1 - Thursday, December 19, 2013 - link
It does seem Samsung had some issues with its process indeed. NVidia was able to reach higher frequencies easily at low power. I haven't seen detailed power consumption comparisons between the 2 variants of S4 and N3 at load, but there certainly is a power cost to pushing your CPU to high frequencies (high voltages indeed!), so having better IPC helps.twotwotwo - Wednesday, December 18, 2013 - link
Not even sure I'd count A15 out yet. I have a vague that impression power draw is part of why it didn't get more wins; if so, the next process gen might help with that. Folks on AT will have moved on to thinking about 64-bit chips by then, but as Peter put it there will be plenty of lower-end sockets left to fill.Also, there was an A15 in the most popular ARM laptop yet (the Exynos in the Chromebook) so at least it got one really neat win. :)