Intel’s "Knights Landing" Xeon Phi Coprocessor Detailed
by Ryan Smith on June 26, 2014 10:00 AM ESTContinuing our ISC 2014 news announcements for the week, next up is Intel. Intel has taken to ISC to announce further details about the company’s forthcoming Knights Landing processor, the next generation of Intel’s Xeon Phi processors.
While Knights Landing in and of itself is not a new announcement – Intel has initially announced it last year – Intel has offered very few details on the makeup of the processor until now. However with Knights Landing scheduled to ship in roughly a year from now, Intel is ready to offer up further details about the processor and the capabilities.
As previously announced, as the successor to Intel’s existing Knights Corner (1st generation Xeon Phi), Knights Landing makes the jump from using Intel’s enhanced Pentium 1 (P54C) x86 cores to using the company’s modern Silvermont x86 cores, which currently lie at the heart of the Intel’s Atom processors. These Silvermont cores are far more capable than the older P54C cores and should significantly improve Intel’s single threaded performance. All the while these cores are further modified to incorporate AVX units, allowing AVX-512F operations that provide the bulk Knights Landing’s computing power and are a similarly potent upgrade over Knights Corner’s more basic 512-bit SIMD units.
All told, Intel is planning on offering Knights Landing processors containing up to 72 of these cores, with double precision floating point (FP64) performance expected to exceed 3 TFLOPs. This will of course depend in part on Intel’s yields and clockspeeds – Knights Landing will be a 14nm part, a node whose first products won’t reach end-user hands until late this year – so while Knights Landing’s precise performance is up in the air, Intel is making it extremely clear that they are aiming very high.
Which brings us to this week and Intel’s latest batch of details. With last year focusing on the heart of the beast, Intel is spending ISC 2014 explaining how they intend to feed the beast. A processor that can move that much data is not going to be easy to feed, so Intel is going to be turning to some very cutting edge technologies to do it.
First and foremost, when it comes to memory Intel has found themselves up against a wall. With Knights Corner already using a very wide (512-bit) GDDR5 memory bus, Intel is in need of an even faster memory technology to replace GDDR5 for Knights Landing. To accomplish this, Intel and Micron have teamed up to bring a variant of Hybrid Memory Cube (HMC) technology to Knights Landing.
Through the HMC Consortium, both Intel and Micron have been working on developing HMC as a next-generation memory technology. By stacking multiple DRAM dies on top of each other, connecting those dies to a controller at the bottom of the stack using Through Silicon Vias (TSVs), and then placing those stacks on-package with a processor, HMC is intended to greatly increase the amount of memory bandwidth that can be used to feed a processor. This is accomplished by putting said memory as close to the processor as possible to allow what’s essentially an extremely wide memory interface, through which an enormous amount of memory bandwidth can be created.
Image Courtesy InsideHPC.com
For Knights Landing, Intel and Micron will be using a variant of HMC designed just for Intel’s processor. Called Multi-Channel DRAM (MCDRAM), Intel and Micron have taken HMC and replaced the standard memory interface with a custom interface better suited for Knights Landing. The end result is a memory technology that can scale up to 16GB of RAM while offering up to 500GB/sec of memory bandwidth (nearly 50% more than Knights Corner’s GDDR5), with Micron providing the MCDRAM modules. Given all of Intel’s options for the 2015 time frame, the use of a stacked DRAM technology is among the most logical and certainly most expected (we've already seen NVIDIA plan to follow the same route with Pascal); however the use of a proprietary technology instead of HMC for Knights Landing comes as a surprise.
Moving on, while Micron’s MCDRAM solves the immediate problem of feeding Knights Landing, RAM is only half of the challenge Intel faces. The other half of the challenge for Intel is in HPC environments where multiple Knights Landing processors will be working together on a single task, in which case the bottleneck shifts to getting work to these systems. Intel already has a number of fabrics at hand to connect Xeon Phi systems, including their own True Scale Fabric technology, but like the memory situation Intel needs a much better solution than what they are using today.
For Knights Landing Intel will be using a two part solution. First and foremost, Intel will be integrating their fabric controller on to the Knights Landing processor itself, doing away with the external fabric controller, the space it occupies, and the potential bottlenecks that come from using a discrete fabric controller. The second part of Intel’s solution comes from developing a successor to True Scale Fabric – dubbed Omni Scale Fabric – to offer even better performance than Intel’s existing fabric solution. At this point Intel is being very tight lipped about the Omni Scale Fabric specifications and just how much of an improvement in inter-system communications Intel is aiming for, but we do know that it is part of a longer term plan. Eventually Intel intends to integrate Omni Scale Fabric controllers not just in to Knights Landing processors but traditional Xeon CPUs too, further coupling the two processors by allowing them to communicate directly through the fabric.
Last but not least however, thanks in large part to the consolidation offered by using MCDRAM, Intel is also going to be offering Knights Landing in a new form factor. Along with the traditional PCIe card form factor that Knights Corner is available in today, Knights Landing will also be available in a socketed form factor, allowing it to be installed alongside Xeon processors in appropriate motherboards. Again looking to remove any potential bottlenecks, by socketing Knights Landing Intel can directly connect it to other processors via Quick Path Interconnect as opposed to the slower PCI-Express interface. Furthermore by being socketed Knights Landing would inherit the Xeon processor’s current NUMA capabilities, sharing memory and memory spaces with Xeon processors and allowing them to work together on a workload heterogeneously, as opposed to Knights Landing operating as a semi-isolated device at the other end of a PCIe connection. Ultimately Intel is envisioning programs being written once and run across both types of processors, and with Knights Landing being binary compatible with Haswell, socketing Knights Landing is the other part of the equation that is needed to make Intel’s goals come to fruition.
Wrapping things up, with this week’s announcements Intel is also announcing a launch window for Knights Landing. Intel is expecting to ship Knights Landing in H2’15 – roughly 12 to 18 months from now. In the meantime the company has already lined up its first Knights Landing supercomputer deal with the National Energy Research Scientific Computing Center, who will be powering their forthcoming Cori supercomputer with 9300 Knights Landing nodes. Intel currently enjoys being the CPU supplier for the bulk of the Top500 ranked supercomputers, and with co-processors becoming increasingly critical to these supercomputers Intel is shooting to become the co-processor vendor of choice too.
Source: Intel
41 Comments
View All Comments
Morawka - Thursday, June 26, 2014 - link
holy batman that's a lot of x86 cores. 9300 nodes * 72 Cores per node = 669,600 baytrail cores with AES.kind of crazy when you think about it.
i wonder if cooling a socketed version with this Hyper cubed memory will be a real challenge. heat doesn't dissipate through memory very well.
Friendly0Fire - Thursday, June 26, 2014 - link
Yeah, I'm really curious about the form factor of a socketed variant. It sounds like it could be an extremely interesting combo with a dual-socket Xeon board for massively parallel workloads, perhaps even eating straight into NVIDIA and AMD's GPGPU lunch.ddriver - Saturday, June 28, 2014 - link
Probably as much as it did eat from their GPU lunches. AMD already has a part delivering 2.6 DP tflops at 28 nm. And knowing Intel, this chip will probably cost an arm and a leg, and a set of good kidneys.quixver - Wednesday, July 9, 2014 - link
Shut up and take my money ... and my body parts!bassportout - Monday, July 21, 2014 - link
I'd give my left nut for this. Jkmadmilk - Thursday, June 26, 2014 - link
Traditional BGA packaged memory doesn't dissipate heat well because of the plastic case. On-package memory shouldn't be harder to cool than any other MCM.FaaR - Friday, June 27, 2014 - link
DRAM memory doesn't dissipate very much power in the first place so cooling it should not be a big challenge. Traditionally, most of the power burnt by high-bandwidth memory ICs is because of the large, high-power transcievers needed. 7GHz GDDR5 is >50% transciever IIRC. Stacked memory on the CPU substrate itself should require much less power... The Crystalwell eDRAM used in Haswell GT4 CPUs burns what, a watt or two at most? The tech used here is probably more advanced*.*Blind guess, but seems logical. :)
tipoo - Thursday, June 26, 2014 - link
Very interesting architecture that I'll be interested in seeing the impact of. However I'm wondering, with these massively paralel chips with up to 72 cores, if the x86 ISA isn't going to be a huge hindrance. It's still a fairly big part of the chip, if you look at say a Jaguar floorplan it's about as big as an FPU, one of the biggest functional units on a CPU core. Times that by 72, and you're using a lot of die size and power just for ISA compatibility. I'd wonder if this is an area where starting from scratch wouldn't benefit Intel tremendously, lest ARM swoop in with its smaller ISA and do something better in less die area.p3ngwin1 - Thursday, June 26, 2014 - link
"I'd wonder if this is an area where starting from scratch wouldn't benefit Intel tremendously"Well, you know what happened last time.....Itanium.....
Cogman - Thursday, June 26, 2014 - link
Which is a vastly superior uarch to x86. One of the main reasons it didn't catch on because its x86 emulation wasn't fast enough.Itanium can kick the crap out of x86 for a large number of workloads.