Ask the Experts: Heterogeneous and GPU Compute with AMD’s Manju Hegde
by Anand Lal Shimpi on May 14, 2012 3:46 PM EST- Posted in
- CPUs
- AMD
- Ask the Experts
- GPUs
AMD’s Manju Hegde is one of the rare folks I get to interact with who has an extensive background working at both AMD and NVIDIA. He was one of the co-founders and CEO of Ageia, a company that originally tried to bring higher quality physics simulation to desktop PCs in the mid-2000s. In 2008, NVIDIA acquired Ageia and Manju went along, becoming NVIDIA’s VP of CUDA Technical Marketing. The CUDA fit was a natural one for Manju as he spent the previous three years working on non-graphics workloads for highly parallel processors. Two years later, Manju made his way to AMD to continue his vision for heterogeneous compute work on GPUs. His current role is as the Corporate VP of Heterogeneous Applications and Developer Solutions at AMD.
Given what we know about the new AMD and its goal of building a Heterogeneous Systems Architecture (HSA), Manju’s position is quite important. For those of you who don’t remember back to AMD’s 2012 Financial Analyst Day, the formalized AMD strategy is to exploit its GPU advantages on the APU front in as many markets as possible. AMD has a significant GPU performance advantage compared to Intel, but in order to capitalize on that it needs developer support for heterogeneous compute. A major struggle everyone in the GPGPU space faced was enabling applications that took advantage of the incredible horsepower these processors offered. With AMD’s strategy closely married to doing more (but not all, hence the heterogeneous prefix) compute on the GPU, it needs to succeed where others have failed.
The hardware strategy is clear: don’t just build discrete CPUs and GPUs, but instead transition to APUs. This is nothing new as both AMD and Intel were headed in this direction for years. Where AMD sets itself apart is that it is will to dedicate more transistors to the GPU than Intel. The CPU and GPU are treated almost as equal class citizens on AMD APUs, at least when it comes to die area.
The software strategy is what AMD is working on now. AMD’s Fusion12 Developer Summit (AFDS), in its second year, is where developers can go to learn more about AMD’s heterogeneous compute platform and strategy. Why would a developer attend? AMD argues that the speedups offered by heterogeneous compute can be substantial enough that they could enable new features, usage models or experiences that wouldn’t otherwise be possible. In other words, taking advantage of heterogeneous compute can enable differentiation for a developer.
That brings us to today. In advance of this year’s AFDS, Manju has agreed to directly answer your questions about heterogeneous compute, where the industry is headed and anything else AMD will be covering at AFDS. Manju has a BS in Electrical Engineering (IIT, Bombay) and a PhD in Computer Information and Control Engineering (UMich, Ann Arbor) so make the questions as tough as you can. He'll be answering them on May 21st so keep the submissions coming.
101 Comments
View All Comments
B3an - Monday, May 14, 2012 - link
I have some questions for Manju...1: Could an OS use GPU compute in the future to speed up everyday tasks, apart from the usual stuff like the UI? What possible tasks would this be? And is it possible we'll see this happen within the next few years?
2: Are you excited about Microsofts C++ Accelerated Massive Parallelism (AMP)? Do you think we'll see a lot more software using GPU compute now that Visual Studio 11 will include C++ AMP support?
3: Do you expect the next gen consoles to make far more use of GPU compute?
BenchPress - Monday, May 14, 2012 - link
1: Your best bet is AVX2, not GPGPU. Any 32-bit code loop with independent iterations can be speeded up by a factor of up to eight using AVX2. And since it's part of the CPU's instruction set there's no data or commands to send back and forth between the CPU and GPU. Also, you won't have to wait long for AVX2 to make an impact. Compilers are ready to support it today, and it takes very little if any developer effort.2: It's just OpenCL is disguise. Yes it supports a few C++ constructs but it still has many of the same limitations. AVX2 doesn't impose any limitations. In fact you can use it with any programming language you like.
3: I'd rather hope next gen consoles have AVX2 or similar technology (i.e. a vector equivalent of every scalar instruction, including gather).
Omoronovo - Monday, May 14, 2012 - link
That wasn't really what he had asked, though. Is there anything stopping AVX2 AND GPGPU being used in parallel to speed up *more* tasks than either one combined? This is the focus (and direction) of AMD's current work into heterogeneous compute, and it remains a question that has been fully answered.I would love to see the day where simple everyday business tasks like running Excel have the same kind of integrated gpu compute ability as, say, web browsers have gained over the last few years. I personally am of the opinion that we are still a long way from that, but AMD seems to be betting on the "wathershed" moment for it happening a lot sooner.
Omoronovo - Monday, May 14, 2012 - link
I apologize for my typing and spelling errors, I hit reply before reading it over, and forgot AnandTech has no edit capability in the inline comments.BenchPress - Monday, May 14, 2012 - link
AVX2 is more versatile than GPGPU, and just as powerful. So why would you want them both? We could just have a homogeneous CPU with more cores instead. Of course that makes TSX another piece of critical technology, and AVX-1024 will be required to lower the power consumption. But it's obvious that GPGPU has no future when the CPU can incorporate the same technology as the GPU.AMD is betting on something that will never happen. Developers are very reluctant to invest time and money on technology from one small vendor. The ROI is very low and will decline over time. The CPU and GPU have been growing closer together, and the next step is to merge them together. AVX2 is a major step toward that, making it a safe bet for developers to support.
SleepyFE - Tuesday, May 15, 2012 - link
I'm not sure you noticed but more cores is the problem. Not everything is a server and even in servers power consumption matters. For desktop processors you still only have 4 core from Intel, and they don't seem too keen on making 6 or 8 core parts. The GCN is good because SIMD is paired into larger units, that might allow more flexibility. If you don't need as many just split the units in 2 and you can have 2 apps running on one physical unit. SIMD is already in CPU-s but AMD put them in a GPU, when they could just put more SIMD-s on the CPU and try to make the system recognize them as a GPU.Power gating my friend. If the SIMD is on the CPU core it has to run with the core (i think). So here it is power gating and flexibility. And they can probably move the AVX to the GPU as well.
BenchPress - Tuesday, May 15, 2012 - link
Yes, too many cores is a problem currently, but that's precisely why Haswell adds TSX technology!Sandy Bridge features power gating for the upper lane of AVX. So there's no waste when not using it.
And no, AVX cannot move to the GPU. It's an integral part of x86 and moving all of it over to the GPU would simply turn the GPU into more CPU cores.
The only remaining problem with AVX2 is high power consumption. Not from the ALUs, but from the rest of the pipeline. But this can be fixed with AVX-1024 by executing them in four cycles on 256-bit units. This allows clock gating large parts of the pipeline for 3/4 of the time and lowers switching activity elsewhere.
A5 - Monday, May 14, 2012 - link
AVX2 is nice, but it isn't the solution to all of these problems.For one, it is Intel only, and will only be available on Haswell and later CPUs. Considering that all MMX, SSEn, etc "required" was a compiler update and new hardware as well, you can look at those for your implementation timelines in normal applications (aka a couple years at best).
GPGPU is good for now because it works on existing hardware (there are far more compute-capable GPUs than Haswell processors at the moment...).
BenchPress - Monday, May 14, 2012 - link
AVX2 is not Intel only. AMD has already implemented AVX support and will add AVX2 support as soon as possible.Furthermore, you can't compare AVX2 to MMX and SSE. The latter two are 'horizontal' SIMD instruction set extensions. They're only suitable for explicit vector math. AVX2 on the other hand is a 'vertical' SIMD instruction set extension. It is highly suitable for the SPMD programming model also used by GPUs. It allows you to write scalar code and just have multiple loop iterations execute in parallel. It's a whole new paradigm for CPUs.
So it will be adopted extremely fast. It is instantly applicable to any GPGPU workload, but more flexible in every way. Meanwhile NVIDIA has crippled GPGPU performance in the GTX 6xx series so developers are not inclined to rely on the GPU for generic computing. One small manufacturer offering APUs isn't going to change that. Intel has the upper hand here.
AMD has to embrace homogeneous computing to stand a chance. With hardware quickly becoming bandwidth limited, ease of programmability will be a primary concern. GPGPU is horrendous in this regard. It's currently impossible to write code which runs well on GPUs from each manufacturer. AVX2 won't suffer from this because it has no heterogeneous bottlenecks, low instruction and memory latencies, great data locality, a large call stack, etc.
Jaybus - Thursday, May 17, 2012 - link
There is little doubt that coding for AVX2 is less complex than coding for GPGPU. But what would happen if those bandwidth bottlenecks were drastically mitigated? Say Intel gets their on chip silicon photonics stuff working and enables a chip-to-chip optical bus with a subsequent orders of magnitude increase of bandwidth. Would it still be better to have all linked chips have identical cores? Or would it be better to have a mix, where all cores of a particular chip were homogenous, but each chip may have a different type of core? II can see advantages to both, but for programming and OS scheduling, a bunch of like cores are certainly simpler.