Arm Cortex X925: Leading The Way in Single-Threaded IPC

The Arm Cortex-X925, codenamed "Black Hawk," as Arm boldly claims, stands at the forefront of single-threaded instruction per clock (IPC) performance, setting things up for improved performance and efficiency in a big way, at least from Arm's claims. This core is a pivotal part of Arm's move to the 3 nm process node and integrates seamlessly into the second-generation Armv9.2 architecture. If Arms claims were taken as gospel, the Cortex X925 would be positioned as a leader in high-performance mobile computing and is an example of where Arm and its focus on a highly efficient PPA is the driving force with Arm's 2024 CPU Core Cluster.

The Cortex-X925 is built on architectural improvements designed to maximize IPC. One of the standout features is its 10-wide decode and dispatch width, significantly increasing the number of instructions processed per cycle. This enhancement allows the core to execute more instructions simultaneously, leading to better utilization of execution units and higher overall throughput.

Arm has doubled the instruction window size to support this wide instruction path, allowing more instructions to be held in flight at any given time. This reduces stalls and improves the efficiency of the execution pipeline. Additionally, the core boasts a 2X increase in L1 instruction cache (I$) bandwidth and a similar increase in L1 instruction translation lookaside buffer (TLB) size. These enhancements ensure that the core can quickly fetch and decode instructions, minimizing delays and maximizing performance.

The Cortex-X925 also features a highly advanced branch prediction unit, which reduces the number of mispredicted branches. By incorporating techniques such as folded-out unconditional direct branches, Arm has removed several architectural roadblocks, enabling a more streamlined and efficient execution path. This leads to fewer pipeline flushes and higher sustained IPC.

The front end of the Arm Cortex-X925 showcases plenty of improvements within the design, including boosting instruction throughput and reducing latency. Central to these improvements is the 10-wide decode and dispatch width, which allows the core to handle more instructions per cycle compared to previous architectures. This wide instruction path increases the parallelism in instruction processing, enabling the core to execute more tasks simultaneously.

Additionally, the Cortex-X925 features a doubled instruction window size, accommodating more instructions in flight and minimizing pipeline stalls. The L1 instruction cache (I$) bandwidth has also been increased by 2x, along with a similar expansion in the L1 instruction translation lookaside buffer (iTLB) size. These enhancements ensure that the core can quickly fetch and decode instructions, significantly reducing fetch bottlenecks and improving overall performance.

The backend of the Cortex-X925 has seen significant growth in out-of-order (OoO) execution capabilities, with a 25-40% increase. This growth allows the core to execute instructions more flexibly and efficiently, reducing idle times and improving overall performance. Furthermore, the core's register file structure has been enhanced, increasing the reorder buffer size and instruction issue queues, contributing to ultimately smoother and, thus, faster instruction execution.

Despite its high performance, the Cortex-X925 is designed to be power efficient. The 3 nm process technology is crucial, enabling better power efficiency than previous generations. The core's design includes features such as dynamic voltage and frequency scaling (DVFS), which allows it to adjust power and performance levels based on the workload. This ensures energy is used efficiently, extending battery life and reducing thermal output.

The Cortex-X925 also incorporates advanced power management features, such as per-core DVFS and improved voltage regulation. These features help manage power consumption more effectively, ensuring the core delivers high performance without compromising energy efficiency. This balance is particularly beneficial for mobile devices requiring sustained performance and long battery life.

The Cortex-X925 is also designed for and optimized for AI-based workloads, with dedicated AI accelerators and software optimizations that enhance AI processing efficiency. With up to 80 TOPS (trillion operations per second), the core can handle complex AI tasks, from natural language processing to computer vision. These capabilities are further supported by Arm's Kleidi AI and Kleidi CV libraries, which provide developers with the tools needed to build advanced AI applications.

Interestingly, Arm hasn't moved into the realm of NPU or AI accelerators itself. Instead, it allows its partners, such as MediaTek, to incorporate their own, ensuring that the Core Cluster can provide the necessary support and integration capabilities. With its reference software stack and optimized libraries, the CSS platform provides a robust foundation for developers. The inclusive Arm Performance Studio offers advanced tooling environments that help developers optimize their applications for the new architecture.

The CSS platform's integration with operating systems such as Android, Linux variants, and Windows through its reinvigorated Windows on Arm OS ensures broad compatibility and ease of development. This cross-operating system support enables developers to quickly and efficiently build applications that leverage the capabilities of the Cortex-X925, along with the entirety of the updated Armv9.2 Core Cluster, which not only accelerates time-to-market but ensures compatibility across multiple devices.

Arm Unveils 2024 CPU Core Designs, Cortex X925, A725 and A520: Arm v9.2 Redefined For 3nm Arm Cortex A725: Improvements to Middle Core Efficiency
POST A COMMENT

55 Comments

View All Comments

  • ET - Thursday, May 30, 2024 - link

    I'm not sure why you're attributing this to insecurity and desperation when it's all about money. I can understand why end users would prefer companies to invest into things they feel are more relevant, but jumping on bandwagons (and driving them forward) is exactly the thing that companies wanting to keep their market healthy should do. Reply
  • GeoffreyA - Thursday, May 30, 2024 - link

    Agreed; it is all about money. Generally, it is not to the benefit of the consumer or the world. An AI PC might be good for Jensen, Pat, Satya, Tim, Lisa, and co. but does not help most people. Reply
  • mode_13h - Thursday, May 30, 2024 - link

    Ooh, you just got "named!"

    Seriously, your comment does indeed sound snarky and your reply sounds defensive and even a bit insecure. I don't think name99 was suggesting that you should want to be a genius, but rather pointing out that it pays to think beyond a single track.

    > when one see Microsoft and Intel making an "AI PC," or AMD calling their
    > CPU "Ryzen AI," and so on, it is little about true AI and more about money,
    > checklists, and the bandwagon.

    I'm reminded of when 3D-capable GPUs went so mainstream you could scarcely buy a PC without it. Yet, the killer app for the average PC user had yet to be invented. To some extent, the hardware needs to lead the way before mainstream apps can fully exploit the technology, because software companies aren't going to invest the time & effort in making features & functionality that only a tiny number of users can take advantage of.

    Also, you say you want AI models to use little power, but progress happens incrementally and having hardware assist indeed improves the efficiency of inferencing on models that aren't all as big or demanding as LLMs.
    Reply
  • GeoffreyA - Thursday, May 30, 2024 - link

    Fair enough. I apologise to everyone for negative connotations in my comment and replies, but the companies are free game and we ought to poke fun at them. I'm fed up, with the lies, marketing, double standards, doublespeak, and nonsense. These companies are only after money, and we are the fools at the end of the day. The last few years it was cloud; now, it's AI. What's next? Reply
  • GeoffreyA - Thursday, May 30, 2024 - link

    As I've said, both here and in several comments elsewhere, AI and LLMs are of immense interest to me. I believe they're the Stone Age version of the stuff in our brains. What I'm trying to criticise is not LLMs or the technology, but the marketing ripoff that is bombarding us everywhere, this so-called AI PC, Copilot PC, or whatever Apple calls theirs. It's laughable the way they're plastering the term AI all over products. Reply
  • SydneyBlue120d - Thursday, May 30, 2024 - link

    Can we expect Samsung S25 3nm Exynos 2500 SOC to be based on this cores? Reply
  • eastcoast_pete - Sunday, June 2, 2024 - link

    After their rather poor showing with their Mongoose custom cores, I'd be very surprised if Samsung doesn't stick with ARM's designs for the CPU side of the Exynos 2500. What's (IMHO) really interesting right now is what Samsung will use for their GPU for the 2500. Rumors abound, many saying that they'll walk away from XDNA and use an in-house designed GPU, or come back to the ARM Mali mothership. The latter would put them in an awkward position, as Mediatek is likely the first out of the gate with their new 9400 featuring both the newest ARM cores and whatever the new version of Immortalis will be called. And Mediatek's Dimensity 9400 is (will be?) fabbed on TSMC's newest 3 nm node, so Samsung will want to have maximum differentiation here. Reply
  • James5mith - Thursday, May 30, 2024 - link

    "The enhanced AI capabilities ensure these applications run efficiently and effectively, delivering faster and more accurate results."

    ARM hardware will magically fix AI algorithms to be better than they otherwise would be? Really?!?
    Reply
  • mode_13h - Thursday, May 30, 2024 - link

    They're probably referring to the fact that it can deliver good inferencing performance without having to resort to the sorts of extreme quantization behind some companies TOPS claims. Quantization often comes at the expense of accuracy, especially if it's done after training, rather than the model being designed and trained to utilize some amount of quantized weights. Reply
  • James5mith - Thursday, May 30, 2024 - link

    Also, amazing increases in performance per watt doesn't mean less power draw. If it draws 3x the power to do 4x the work, then it's increased efficiency 1.33x. But it's still drawing 3x the power. That means a battery will be drained 3x faster.

    Saying the 30w SoC does work more efficiently than the 10w SoC doesn't make it draw less power.
    Reply

Log in

Don't have an account? Sign up now