Marvell Announces ThunderX3: 96 Cores & 384 Thread 3rd Gen Arm Server Processor
by Andrei Frumusanu on March 16, 2020 8:30 AM EST- Posted in
- Servers
- CPUs
- Marvell
- Arm
- Enterprise
- Enterprise CPUs
- Cavium
- ThunderX3
The Arm server ecosystem is well alive and thriving, finally getting into serious motion after several years of false-start attempts. Among the original pioneers in this space was Cavium, which went on to be acquired by Marvell in 2018. Among the company’s server CPU products is the ThunderX line; while the first generation ThunderX left quite a lot to be desired, the ThunderX2 was the first Arm server silicon that we deemed viable and competitive against Intel and AMD products. Since then, the ecosystem has accelerated quite a lot, and only last week we saw how impressive the new Amazon Graviton2 with the N1 chips ended up. Marvell didn’t stop at the ThunderX2, and had big ambitions for its newly acquired CPU division, and today is announcing the new ThunderX3.
The ThunderX3 is a continuation and successor to then-Cavium’s custom microarchitecture found in the TX2, adopting a lot of the key characteristics, most notably the capability of 4-way SMT. Adopting a new microarchitecture with higher IPC capabilities, the new TX3 also ups the clock frequencies, and now hosts up to a whopping 96 CPU cores, allowing the chip to scale up to 384 threads in a single socket.
Marvell sees the ecosystem shifting in terms of workloads as more and more applications are shifting to the cloud, and applications are changing in their nature, with more customers employing their own custom software stacks and scaling out these applications. This means that workloads aren’t necessarily focused just on single-threaded performance, but rather on the total throughput available in the system, at which point power efficiency also comes into play.
Like many other Arm vendors, Marvell sees a window of opportunity in the lack of execution of the x86 incumbents, very much calling out Intel’s stumbling in process leadership over the past few years, and in general x86 designs being higher power. Marvell describes that part of the problem is that the current systems by the x86 players were designed with a wide range of deployment targets ranging from consumer client devices to the actual server machines, never actually achieving the best results in either workloads. In contrast, the ThunderX line-up is reportedly designed specifically with server workloads in mind, being able to achieve higher power efficiency and thus also achieving higher total throughput in a system.
We’ve known that ThunderX3 has been coming for quite a while now, admittedly expecting it towards the latter half of 2019. We don’t know the behind-the-scenes timeline, but now Marvell is finally ready to talk about the new chip. Marvell’s CPU roadmap is on a 2-year cadence, and the chip company here explains that this is a practical timeline, allowing customers time to actually adopt a generation and get good return on investment on the platform before possibly switching over to the next one. Of course, this also gives the design team more time to bring to market larger performance jumps once the new generations are ready.
The ThunderX3 - 96 Cores and 384 Threads in Arm v8.3+
So, what is the new ThunderX3? It’s a ambitious design hosting up to 96 Arm v8.3+ custom cores running at up to frequencies of up to 3GHz all-core, at TDPs ranging from 100 to 240W depending on the SKU.
Marvell isn’t quite ready to go into much details of the new CPU microarchitecture just yet, saying that they’ll divulge a deeper disclosure of the TX3 cores later in the year (They’re aiming for Hotchips), but they do say that one key characteristic is that it now features 4 128-bit SIMD execution units, matching the vector execution throughput of AMD and Intel’s cores. When fully using these units, clock frequencies for all-core drop between 2.2 and 2.6GHz, limited by the thermal and power headroom available to the chip.
Having SMT4, the 96-core SKU is able to scale up to 384 threads in a socket, which is by far the highest thread count of any current and upcoming server CPU in the market, a big differentiating factor for the ThunderX3.
Marvell doesn’t go into details of the topology of the chip or its packaging technology, only alluding that it’ll have monolithic latencies between the CPU cores. The design comes in either 1 or 2 socket configurations, and the inter-socket communication uses CCPI (Cavium Cache Coherent Interconnect) in its 3rd generation, with 24 lanes at 28Gbit/s each, between the two sockets.
External connectivity is handled by 64 lanes of PCIe 4.0 with 16 controllers per socket, meaning up to 16 4x devices, with the choice of multiplexing them for higher bandwidth connectivity for 8x or 16x devices.
Memory capabilities of the chip is in line with current generation standards, featuring 8 DDR4-3200 memory controllers.
Marvell plans several SKUs, scaling the core count and memory controllers, in TDP targets ranging from 100W to 240W. These will all be based on the same silicon design, and binning the chips.
Large Generational Performance Improvements
In a comparison to the previous generation ThunderX2, the TX3 lists some impressive performance increases. IPC is said to have increased by a minimum of 25% in workloads, with total single-threaded performance going up to at least 60% when combined with the clock frequency increases. If we use the TX2 figures we have at hand, this would mean the new chip would land slightly ahead of Neoverse-N1 systems such as the Graviton2, and match more aggressively clocked designs such as the Ampere Altra.
Socket-level integer performance has at least increased by 3-fold, both thanks to the more capable cores as well as their vastly increased core number to up to 96 cores. Because the new CPU has now more SIMD execution units, floating point performance is even higher, increasing to up to 5x.
Because the chip comes with SMT4 and it’s been designed with cloud workloads, it is able to extract more throughput out of the silicon compared to other non-SMT or SMT2 designs. Cloud workloads here essentially means data-plane bound workloads in which the CPU has to wait on data from a more distant source, and SMT helps in such designs in that the idle execution clocks between data accesses is simply filled by a different thread, doing long latency accesses itself.
ThunderX3 Performance Claims Against the Competition
Using this advantage, the ThunderX3 is said to have significant throughput advantages compared to the incumbent x86 players, vastly exceeding the performance of anything that Intel has currently to offer, and also beating AMD’s Rome systems in extremely data-plane bound workloads thanks to the SMT4 and higher core counts.
More execution and compute bound workloads will see the least advantages here, as the SMT4 advantages greatly diminishes.
Yet for HPC and in particular floating-point workloads, the ThunderX3 is said to also be able to showcase its strengths thanks to the increased SIMD units as well as the overall power efficiency of the system, allowing for significant higher performance in such calculations. Memory bandwidth is also higher than a comparative AMD Rome based system because the lower latencies the TX3 is able to achieve. It’s to be noted that the ThunderX3 will be coming to market later in the year, by which time they’ll have to compete with AMD’s newer Milan server CPU.
Marvell says that Arm in the cloud is gaining a lot of traction, and the company is already the market leader in terms of deployments of its ThunderX2 system among companies and hyperscalers (Microsoft Azure currently being the one publicly disclosed, but it’s said that there are more). I don’t really know if having a extremely high number of virtual machines being hosted on a single chip is actually an advantage (because of SMT4, per-VM performance might be quite bad), but Marvell does state that they’d be the leader in this metric with the ThunderX3, thanks to be able to host up to 384 threads.
Finally, the company claims a 30% perf/W advantage over AMD’s Rome platform across an average of different workloads, thanks to the more targeted microarchitecture design. The more interesting comparison here would have been a showcase or estimate of how the ThunderX3 would fare against Neoverse-N1 systems such as the Graviton2 or the Altra, as undoubtedly the latter system would pose the closest competitor to the new Marvell offering. Given that the Altra isn’t available yet, we don’t know for sure how the systems will compete against each other, but I do suspect that the ThunderX3 to do better in at least FP workloads, and of course it has an indisputable advantage in data-plane workloads thanks to the SMT4 capability.
More Information at Hotchips 2020
Marvell hasn’t yet disclosed much about the cache configuration or any other specifics of the system, for example what kind of interconnect the cores will be using or what kind of CPU topology they will be arranged in. The ThunderX3’s success seemingly will depend on how it’s able to scale performance across all of its 96 cores and the 384 threads – but at least as an initial impression, it seems that it might do quite well.
Today is just the initial announcement of the TX3, and Marvell will be revealing more details and information about the new CPU and the product line-up over the following months till the eventual availability later in the year.
44 Comments
View All Comments
RallJ - Monday, March 16, 2020 - link
Just marketing fluff no real info. Far behind the N1. This is DOA.eek2121 - Monday, March 16, 2020 - link
My favorite part is they claim x86 has “low memory bandwidth” yet they have the exact same bandwidth as EPYC Rome. I will continue to be skeptical until they offer something concrete and testable.ProDigit - Sunday, April 26, 2020 - link
AMD Ryzen series, does have quite some latency, due to their auto frequency settings with xmp timings and infinity fabric and core auto overboost; much more than Intel, which is why AMD absolutely NEEDS fast memory to perform.This latency can lead up to several seconds, before auto config has stabilized, and the system runs pretty optimal and lag-free.
It doesn't affect compute by much, save for the beginning load times, but it does affect VMs, and adds to the latency of cloud interactions.
I guess they're saying that this will not be the case with thunderx3.
dianajmclean6 - Monday, March 23, 2020 - link
Six months ago I lost my job and after that I was fortunate enough to stumble upon a great website which literally saved me• I started working for them online and in a short time after I've started averaging 15k a month••• icash68.coMDug - Monday, March 16, 2020 - link
Interesting read by Linus Torvalds. https://www.realworldtech.com/forum/?threadid=1834...\Basically why it hasn't taken off is nobody is developing on those systems. Any benefit that ARM provides is negated by the fact that you need to change everything that is already developed, so there is no cost benefit. People will pay more for an x86 box simply because it's what they developed their load on. He points to example of why RISC vendors died off.
Same on software side. There was no cross-development. It's too costly and relatively painful. And developers go to where the hardware and software already exit and is easily develop on.
The_Assimilator - Monday, March 16, 2020 - link
Yup. People keep going on and on about how Arm hardware is so much cheaper, when hardware is only a portion - often small - of the TCO. x86 is entrenched and the inertia to overcome that entrenchment is massive. Hence why the only companies that are actually interested in Arm servers are the companies that don't have to pay that massive software debt - i.e. hosting providers like Amazon.webdoctors - Monday, March 16, 2020 - link
This has been known for more than 10 yrs.I think the idea was the platform provider would port the entire toolchain, back when AMD bought SuperMICRO. The platform providers need to port the entire platform to ARM. Like the OS, the database software, the entire software ecosystem so when folks are selling SAAS, it doesnt matter what the CPU type is, because the customer doesn't care.
Look at Android SW development, you dont know what the base CPU Type is. The linker deals with it.
rahvin - Monday, March 16, 2020 - link
AMD never purchased supermicro. AMD did buy a company called SeaMicro that develops large mainframe style computers.questionlp - Monday, March 16, 2020 - link
SeaMicro had built ultra-dense micro servers, not mainframe-style computers. They were shut down by AMD a couple of years ago, I think.rahvin - Tuesday, March 17, 2020 - link
Ultra dense with high IO is mainframe IMO. The mainframe space has blurred a lot over the last decade with many new companies entering the market with these ultra dense x86 servers that behave very similar to mainframes of the past. Maybe the IO isn't quite high enough to qualify as mainframe but they are close enough I wouldn't qualify it.