Ask the Experts: Krisztián Flautner VP of R&D, ARM
by Anand Lal Shimpi on June 3, 2014 12:00 PM EST- Posted in
- CPUs
- Arm
- Ask the Experts
- SoCs
Late last year we did an installment of Ask the Experts with ARM's Peter Greenhalgh, lead architect for the Cortex A53. The whole thing went so well, in no small part to your awesome questions, that ARM is giving us direct access to a few more key folks over the coming months.
Krisztián Flautner is Vice President of Research and Development at ARM, and as you can guess - he's focused on not the near term, but what's coming down the road for ARM. ARM recently celebrated its 50 billionth CPU shipment via its partners, well Krisztián is more focused on the technologies that will drive the next 100 billion shipments.
Krisztián holds PhD, MSE and BSE degrees in computer science and engineering from the University of Michigan. He leads a global team that researches everything from circuits to processor/system architectures and even devices. And he's here to answer your questions.
If there's anything you want to ask the VP of R&D at ARM, this is your chance. Leave a comment with your question and Krisztián will go through and answer any he's able to answer. If you've got questions about process tech, Moore's Law, ARM's technology roadmap planning or pretty much anything about where ARM is going, ask away!
62 Comments
View All Comments
AndreiLux - Tuesday, June 3, 2014 - link
Hello, two main (related) questions here:My first question is regarding ARM's case for synchronous clock and voltage domains as an actual advantage over asynchronous ones. ARM seems to be definitely convinced that going synchronous is the right choice, but except for one whitepaper which Samsung has released regarding bit.LITTLE some time ago (http://mobile.arm.com/files/downloads/Benefits_of_... there has never been any in-depth technical justification for the matter.
How far has ARM actually researched into the matter to come to this conclusion? What are the ramifications against it? Not only on a architectural basis, but also on a platform basis (PMIC voltage rail power conversion overhead increase for example)?
Current power management mechanisms seem extremely "dumb" in the sense that they almost never are able to control DVFS and Idle states in such a manner to actually make full efficient use of the hardware in terms of power efficiency and performance. Current Linux kernel discussions are (finally) trying to merge these mechanisms into the scheduler to improve their functioning. My understanding that the upcoming A53/A57 have also more advanced retention states over the current generation of cores to allow for better use of idle states, much like the A15 generation of SoCs got rid of legacy hot-plugging of cores in favor of power-gating C-states.
My second question is what the reasoning is that hardware PPMU controllers which would be in charge of DVFS are basically inexistent in the current mobile ARM space? ARM's reasoning until now is that a hardware controller is not able to have a full "overview" of system load to base its decisions on, but that doesn't explain why a software-policy based hardware controller could not achieve this, as it would have advantages of both worlds with none of the disadvantages. We achieve much finer granularity, order of magnitudes greater than current software based ones are currently running on. Any comments on the matter?
KFlautner - Thursday, June 5, 2014 - link
Some of what you are asking about are implementation choices that our silicon partners would make which I cannot really comment on. But roughly speaking for cores in a MP cluster we would expect these to run these at the same voltage, and synchronous frequencies, and communication with the L2 cache would be synchronously as well (L2 typically run at 1/2 speed of L1). The interface between the CPU and the bus-fabric is often asynchronous. Different cores within the cluster can be power-gated, but the primary supplies (sometimes different supplies for RAM and logic) would be the same.The following document states some of these assumptions:
http://www.arm.com/files/pdf/big_LITTLE_technology...
"The CPU clusters are connected to the cache coherent interconnect through an asynchronous bridge to enable each CPU cluster to be scaled independently in frequency and voltage. The independent DVFS operation of the CPU clusters allows the system to more closely track the performance requirements in loads where there is a mix of high-performance and background activity."
As for controlling DVFS state, some of our system designs use a dedicated Cortex-M3 processor in the system to manage the various power states. The reason for the dedicated processor is that there is much more going on in a SoC than just activity on the application processors and it's easier to observe this activity from different positions on the SoC. And you'd like to be able to make power-related decisions without having to wake up the big CPUs all the time - let them rest as much as they can. M-class processors have much lower-latency and consume less power, so they are a natural place in the system for making power-related decisions.
maltanar - Tuesday, June 3, 2014 - link
The computer architecture research community seems to increasingly lean towards hardware accelerators and heterogeneity (not just big.LITTLE and GPGPU, but all sorts of mixes of architectural and microarchitectural techniques) for increasing energy efficiency/performance as Moore's law and Dennard scaling don't seem to work that well anymore.My question is...does ARM R&D see some future for widely applicable accelerator-rich architectures, or do they deem it to be a passing trend? If they are here to stay, there will probably quite a few software-related issues that will follow and I'm quite curious as to how ARM R&D thinks these may be addressed.
KFlautner - Wednesday, June 4, 2014 - link
It’s not a passing fad: SoCs have been accelerating various functions for many decades now. Do I think that this will become more central to our strategy? No. There are a number of reasons for this. The functions initially accelerated by external hardware often end up being incorporated into the architecture over time. This usually makes their use easier and provide a stabler software target. Coordinating the activities of a large number of accelerators is as hard a problem as the general parallelization problem; you can have some good application-specific solutions but very difficult to make broadly applicable…. And there is always Amdahl’s law to worry about - it applies to accelerators as well.I do see a role for increasing heterogeneity in system but this is about how we best cater to supporting different styles of parallelism and activity levels in different applications rather than speeding up the individual compute kernels.
Ikefu - Tuesday, June 3, 2014 - link
What do you see coming down the road from ARM that is relevant to the DIY Maker and the Internet of Things? I'm intrigued by the forthcoming Arduino Zero and am an avid user of linux boards like the Raspberry Pi and BeagleBone black for my robotics projects. How do you see ARM being able to improve this space in the future?KFlautner - Wednesday, June 4, 2014 - link
What we can do is reduce the barrier of entry for the "long tail” of developers who are looking to build prototypes and then take them to product. Availability of hardware and modules is one important factor but equally important is the software / toolchain story around them. We have some exciting plans for mbed (check out mbed.org) that we’ll be talking about later this year.BMNify - Wednesday, June 4, 2014 - link
"Availability of hardware and modules is one important factor "thats good if the IP people really want is there, mbed.org seem interesting if you can cater for the novice as well as the semi-pro.
for instance none of the curent vendors produce what many people want, and looking in the long tail to design/make/perhaps produce in 10's/100's of PCB etc then im optimistic in my view that most interested people will want access to
USB3.1,
PCI-e3/4
dual/quad DRAM controllers compatible tested IP for the likes of the 3.2+ GBytes/second NVR Everspin ST-MRAM DDR3 modules
http://www.everspin.com/PDF/ST-MRAM_Presentation.p...
and even mbed.org slices http://www.cnx-software.com/2013/12/29/xmos-xcore-...
Khenglish - Tuesday, June 3, 2014 - link
The fabrication companies are having much difficulty in achieving high enough yield to mass produce 20nm processors. Yield problems will increase as companies attempt to shrink to even smaller processes and adopt FINFETS. How concerned are you that we are near the end of the road of continual CMOS process shrinks for increased transistor counts and performance improvements? How confident are you that for the next 5 years or so that we will continue to see transistor level device improvements that can be implemented into future processors?KFlautner - Wednesday, June 4, 2014 - link
Don’t forget that Moore’s Law is an economic law and the primary reason it will stop is because of economics (not because of physics). If a manufacturer doesn’t get enough bang for his buck, they will get off the treadmill. My team has been working on predictive technology models to understand the issues with future generations of processes: there are conceivable combinations of technologies to create new process nodes for the next decade or more. The real question is whether it is actually worth doing so as the improvements may not be commensurate with the expense.PeteH - Tuesday, June 3, 2014 - link
Will 2014 be the year Michigan football finally returns to the top of the B1G?