New NPU: Intel NPU 4, Up to 48 Peak TOPS

Perhaps Intel's main focal point, from a marketing point of view, is the latest generational change to its Neural Processing Unit or NPU.Intel has made some significant breakthroughs with its latest NPU, aptly called NPU 4. Although AMD disclosed a faster NPU during their Computex keynote, Intel claims up to 48 TOPS of peak AI performance.NPU 4, compared with the previous model, NPU 3, is a giant leap in enhancing power and efficiency in neural processing. The improvements in NPU 4 have been made possible by achieving higher frequencies, better power architectures, and a higher number of engines, thus giving it better performance and efficiency.

In NPU 4, these improvements are enhanced in vector performance architecture, with higher numbers of compute tiles and better optimality in matrix computations.This incurs a great deal of neural processing bandwidth; in other words, it is critical for applications that demand ultra-high-speed data processing and real-time inference. The architecture supports INT8 and FP16 precisions, with a maximum of 2048 MAC (multiply-accumulate) operations per cycle for INT8 and 1024 MAC operations for FP16, clearly showing a significant increase in computational efficiency.

A more in-depth look at the architecture reveals increased layering in the NPU 4. Each of the neural compute engines in this 4th version has an incredibly excellent inference pipeline embedded — comprising MAC arrays and many dedicated DSPs for different types of computing. The pipeline is built for numerous parallel operations, thus enhancing performance and efficiency. The new SHAVE DSP is optimized to four times the vector compute power it had in the previous generation, enabling more complex neural networks to be processed.

A significant improvement of NPU 4 is an increase in clock speed and introducing a new node that doubles the performance at the same power level as NPU 3. This results in peak performance quadrupling, making NPU 4 a powerhouse for demanding AI applications. The new MAC array features advanced data conversion capabilities on a chip, which allow for a datatype conversion on the fly, fused operations, and layout of the output data to make the data flow optimal with minimal latency.

The bandwidth improvements in NPU 4 are essential to handle bigger models and data sets, especially in transformer language model-based applications. The architecture supports higher data flow, thus reducing the bottleneck and ensuring it runs smoothly even when in operation. The DMA (Direct Memory Access) engine of NPU 4 doubles the DMA bandwidth—an essential addition in improving network performance and an effective handler of heavy neural network models. More functions, including embedding tokenization, are further supported, expanding the potential of what NPU 4 can do.

The significant improvement of NPU 4 is in the matrix multiplication and convolutional operations, whereby the MAC array can process up to 2048 MAC operations in a single cycle for INT8 and 1024 for FP16. This, in turn, makes an NPU capable of processing much more complex neural network calculations at a higher speed and lower power. That makes a difference in the dimension of the vector register file; NPU 4 is 512-bit wide. This implies that in one clock cycle, more vector operations can be done; this, in turn, carries on the efficiency of the calculations.

NPU 4 supports activation functions and a wider variety is available now that supports and treats any neural network, with the choice of precision to support the floating-point calculations, which should make the computations more precise and reliable. Improved activation functions and an optimized pipeline for inference will empower it to do more complicated and nuanced neuro-network models with much better speed and accuracy.

Upgrading to SHAVE DSP within NPU 4, with four times the vector compute power compared to NPU 3, will bring a 12x overall increase in vector performance. This would be most useful for transformer and large language model (LLM) performance, making it more prompt and energy efficient. Increasing vector operations per clock cycle enables the larger vector register file size, which significantly boosts the computation capabilities of NPU 4.

In general, NPU 4 presents a big performance jump over NPU 3, with 12 times vector performance, four times TOPS, and two times IP bandwidth. These improvements make NPU 4 a high-performing and efficient fit for up-to-date AI and machine learning applications where performance and latency are critical. These architectural improvements, along with steps in data conversion and bandwidth improvements, make NPU 4 the top-of-the-line solution for managing very demanding AI workloads.

Intel Lunar Lake: New E-Core, Skymont Takes Flight For Peak Efficiency Better I/O: Thunderbolt 4, Thunderbolt Share, Wi-Fi 7 Included
POST A COMMENT

91 Comments

View All Comments

  • mode_13h - Tuesday, June 4, 2024 - link

    Yeah, it definitely comes across as two-faced for Intel to be pitching its foundry business to others, while it's not even using it for its own cutting-edge CPUs! Reply
  • kn00tcn - Tuesday, June 4, 2024 - link

    1) bob (or brian?) made a deal with tsmc and they need fill the required capacity

    2) tsmc chiplet packaging requires all tiles to come from tsmc (but mixing tile foundries is fine as long as someone else packages)

    3) lunar lake isnt high power high core desktop/server, there's plenty else to make themselves, and obviously they've been ramping cutting edge future nodes

    4) these things take years, why would a recent subsidy relate to old deals
    Reply
  • mode_13h - Thursday, June 6, 2024 - link

    > 1) bob (or brian?) made a deal with tsmc and they need fill the required capacity

    This is probably the dumbest claim I've seen in a while. There's guaranteed to be an escape clause in that contract, although Intel would be stuck with some fee.

    Given the current demand for cutting-edge nodes, I'm sure Intel could probably work out an agreement with another fab customer to buy their excess wafer capacity and probably even turn a profit by it.

    > 2) tsmc chiplet packaging requires all tiles to come from tsmc

    Second dumbest claim in the thread. Lunar Lake uses Foveros, not TSMC's technology, and Intel is making the base layer on their own 22 nm node.

    > 3) lunar lake isnt high power high core desktop/server

    What does that have to do with anything? It still needs to compete on performance and efficiency!

    > why would a recent subsidy relate to old deals

    Who said anything about that?
    Reply
  • kwohlt - Tuesday, June 4, 2024 - link

    Intel's foundry service doesn't have a full suite of nodes to choose from and is currently building out a fabs. In the meantime, client will be using some of the TSMC N3B allocation that Intel carved out years ago. Expect 2024-2025 to be peak TSMC usage.

    What other options were realistically available? Intel 3 is just hitting the market and fully allocated to Xeon 6 initially. Intel 4 isn't library complete and wouldn't work for a tile that also contains NPU and GPU. Intel 7 is heavy DTCO'd for ADL/RPL and has poor low wattage performance. 18A isn't ready yet.

    By the time 14A releases, Intel will have a selection of 18A and the Intel 3 family of nodes to pick from for their other CPU tiles.
    Reply
  • mode_13h - Thursday, June 6, 2024 - link

    > Intel 3 is just hitting the market and fully allocated to Xeon 6 initially

    The Lunar Lake CPU tiles can't be very big. They should've been a good "pipe cleaner" product for Intel to ramp up their "3" node, before making the huge Xeon dies.

    I hadn't noticed the GPU was on the same tile. If true, I think they could've kept it on its own tile, as Meteor Lake did.
    Reply
  • lmcd - Wednesday, June 12, 2024 - link

    Intel has not shipped an Xe product on an Intel process since DG1. We don't know that it ports.

    Adding a separate die might have increased the package size, and part of the point of this product was to be a small package that could supplant Qualcomm designs easily (and the PMIC callout was specifically targeted at vendors that got burned by Qualcomm's power shenanigans, if you believe Charlie).
    Reply
  • andrewaggb - Thursday, June 6, 2024 - link

    yeah, it's not a great look on the fab side, but honestly I hope it's an amazing chip and worth upgrading. I hope Qualcomm's chip is great as well and get some actual innovation/competition going on. Reply
  • eonsim - Tuesday, June 4, 2024 - link

    Is Intel comparing there new E-cores to the LP-E cores here (the ones on the SoC with no L3), rather than the main E-cores for Meteor lake? Reply
  • mode_13h - Tuesday, June 4, 2024 - link

    +1 Reply
  • name99 - Wednesday, June 5, 2024 - link

    Exactly.
    And judging from what I've seen on the internet, plenty of people were fooled. And don't like to be told that they were fooled...
    Reply

Log in

Don't have an account? Sign up now