Enterprise NVMe Round-Up 2: SK Hynix, Samsung, DapuStor and DERA
by Billy Tallis on February 14, 2020 11:15 AM ESTTest Setup
For this year's enterprise SSD reviews, we've overhauled our test suite. The overall structure of our tests is the same, but a lot has changed under the hood. We're using newer versions of our benchmarking tools and the latest longterm support kernel branch. The tests have been reconfigured to drastically reduce CPU overhead, which has minimal impact on SATA drives but lets us properly push the limits of the many enterprise NVMe drives for the first time.
The general philosophy underlying the test configuration was to keep everything at its default or most reasonable everyday settings, and change as little as possible while still allowing us to measure the full performance of the SSDs. Esoteric kernel and driver options that could marginally improve performance were ignored. The biggest change from last year's configuration and away from normal everyday usage is in the IO APIs used by the fio benchmarking tool to interact with the operating system.
In the past, we configured fio to use ordinary synchronous IO APIs: read() and write() style system calls. The way these work is that the application makes a system call to perform a read or write application, and control transfers to the kernel to handle the IO. The application thread is suspended until that IO is complete. This means we can only have one outstanding IO request per thread, and hitting a drive with a queue depth of 32 requires 32 threads. That's no problem on a 36-core test system, but when it takes a queue depth of 200 or more to saturate a high-end NVMe SSD, we run out of CPU power. Running more threads than cores can get us a bit more throughput than just QD36, but that causes latency to suffer not just from the overhead of each system call, but from threads fighting over the limited number of CPU cores. In practice, this testbed is limited to about 560k IOPS when performing IO this way, and that leaves no CPU time for doing anything useful with the data that's moving around. Spectre, Meltdown and other vulnerability mitigations tend to keep increasing system call and context switch overhead, so this situation isn't getting any better.
The alternative is to use asynchronous storage APIs that allow an application thread to submit an IO request to the operating system but then continue executing while the IO is performed. For benchmarking purposes, that continued execution means the application can keep submitting more IO requests before the first one is complete, and a single thread can load down a SSD with a reasonably high queue depth.
Asynchronous IO presents challenges, especially on Linux. On any platform, asynchronous IO is a bit more complicated for the application programmer to deal with, because submitting a request and getting the result become separate steps, and operations may complete out of order. On Linux specifically, the original async IO APIs were fraught with limitations. The most significant is that Linux native AIO is only actually asynchronous when IO is set to bypass the operating system's caches, which is the opposite of what most real-world software should want. (Our benchmarking tools have to bypass the caches to ensure we're measuring the SSD and not the testbed's 192GB of RAM.) Other AIO limitations include support for only one filesystem, and myriad scenarios in which IO silently falls back to being synchronous, unexpectedly halting the application thread that submitted the request. The end result of all those issues is that true asynchronous IO on Linux is quite rare and only usable by some applications with dedicated programmers and competent sysadmins. Benchmarking with Linux AIO makes it possible to stress even the fastest SSD, but such a benchmark can never be representative of how mainstream software does IO.
The best way to set storage benchmark records is to get the operating system kernel out of the way entirely using a userspace IO framwork like SPDK. This eliminates virtually all system call overhead and makes truly asynchronous IO possible and fast. It also eliminates the filesystem and the operating system's caching infrastructure and makes those the application's responsibility. Sharing a SSD between applications becomes almost impossible, and at the very least requires rewriting both applications to use SPDK and overtly cooperate in how they use the drive. SPDK works well for use cases where a heavily customized application stack and system configuration is possible, but it is no more capable of becoming a mainstream solution than Linux AIO.
A New Hope
What's changed recently is that Linux kernel developer (and fio author) Jens Axboe introduced a new asynchronous IO API that's easy to use and very fast. Axboe has documented the rationale behind the new API and how to use it. In summary: The core principle is that communication between the kernel and userspace software takes place with a pair of ring buffers, so the API is called io_uring. One ring buffer is the IO submission queue: the application writes requests into this buffer, and the kernel reads them to act on. The other is the completion queue, where the kernel writes notification of completed IOs, which the application is watching for. This dual queue structure is basically the same as how the operating system communicates with NVMe devices. For io_uring, both queues are mapped into the memory address spaces of both the application and the kernel, so there's no copying of data required. The application doesn't need to make any system calls to check for completed IO; it just needs to inspect the contents of the completion ring. Submitting IO requests involves putting the request in the submission queue, then making a system call to notify the kernel that the queue isn't empty. There's an option to tell the kernel to keep checking the submission queue as long as it doesn't stay idle for long. When that mode is used, a large number of IOs can be handled with an average of approximately zero system calls per IO. Even without it, io_uring allows for IO to be done with one system call per IO compared to two per IO with the old Linux AIO API.
Using synchronous IO, our enterprise SSD testbed cannot reach 600k IOPS. With io_uring, we can do more than 400k IOPS on a single CPU core without any extra performance tuning effort. Hitting 1M IOPS on a real SSD takes at most 4 CPU cores, so even the Micron X100 and upcoming Intel Alder Stream 3D XPoint SSDs should pose no challenge to our new benchmarks.
The first stable kernel to include the io_uring API was version 5.1 released in May 2019. The first long term support (LTS) branch with io_uring is 5.4, released in November 2019 and used in this review. The io_uring API is still very new and not used by much real-world software. But unlike the situation with the old Linux AIO APIs or SPDK, this seems likely to change. It can do more than previous asynchronous IO solutions, including being used for both high-performance storage and network IO. New features are arriving with every new kernel release; lots of developers are trying it out, and I've seen feature requests fulfilled in a matter of days. Many high-level languages and frameworks that currently simulate asynchronous IO using thread pools will be able to implement new io_uring backends.
For storage benchmarking on Linux, io_uring currently strikes the best balance between the competing desires to simulate workloads in a realistic manner, and to accurately gauge what kind of performance a solid state drive is capable of providing. All of the fio-based tests in our enterprise SSD test suite now use io_uring and never run more than 16 threads even when testing queue depths up to 512. With the CPU bottlenecks eliminated, we have also disabled HyperThreading.
Enterprise SSD Test System | |
System Model | Intel Server R2208WFTZS |
CPU | 2x Intel Xeon Gold 6154 (18C, 3.0GHz) |
Motherboard | Intel S2600WFT Firmware 2.01.0009 CPU Microcode 0x2000065 |
Chipset | Intel C624 |
Memory | 192GB total, Micron DDR4-2666 16GB modules |
Software | Linux kernel 5.4.0 fio version 3.16 |
Thanks to StarTech for providing a RK2236BKF 22U rack cabinet. |
33 Comments
View All Comments
PaulHoule - Friday, February 14, 2020 - link
"The Samsung PM1725a is strictly speaking outdated, having been succeeded by a PM1725b with newer 3D NAND and a PM1735 with PCIe 4.0. But it's still a flagship model from the top SSD manufacturer, and we don't get to test those very often."Why? If you've got so much ink for DRAMless and other attempts to produce a drive with HDD costs and SSD performance (hopefully warning people away?) why can't you find some for flagship products from major manufacturers?
Billy Tallis - Friday, February 14, 2020 - link
The division of Samsung that manages the PM17xx products doesn't really do PR. We only got this drive to play with because MyDigitalDiscount wanted an independent review of the drive they're selling a few thousand of.The Samsung 983 DCT is managed by a different division than the PM983, and that's why we got to review the 983 DCT, 983 ZET, 883 DCT, and so on. But that division hasn't done a channel/retail version of Samsung's top of the line enterprise drive.
romrunning - Friday, February 14, 2020 - link
Too bad you don't get more samples of the enterprise stuff. I mean, you have both influencers, recommenders, and straight-up buyers of enterprise storage who read Anandtech.Billy Tallis - Friday, February 14, 2020 - link
Some of it is just that I haven't tried very hard to get more enterprise stuff. It worked okay for my schedule to spend 5 weeks straight testing enterprise drives because we didn't have many consumer drives launch over the winter. But during other times of the year, it's tough to justify the time investment of updating a test suite and re-testing a lot of drives. That's part of why this is a 4-vendor roundup instead of 4 separate reviews.Since this new test suite seems to be working out okay so far, I'll probably do a few more enterprise drives over the next few months. Kingston already sent me a server boot drive after CES, without even asking me. Kioxia has expressed interest in sampling me some stuff. A few vendors have said they expect to have XL-NAND drives real soon, so I need to hit up Samsung for some Z-NAND drives to retest and hopefully keep this time.
And I'll probably run some of these drives through the consumer test suite for kicks, and upload the results to Bench like I did for one of the PBlaze5s and some of the Samsung DCTs.
PandaBear - Friday, February 14, 2020 - link
ESSD firmware engineer here (and yes I have worked in one of the company above). Enterprise business are mostly selling to large system builder so Anandtech is not really "influence" or "recommend" for enterprise business. There are way more requirements than just 99.99 latency and throughput, and buyers tend to focus on the worst case scenarios than the peak best cases. Oh, pricing matters a lot. You need to be cheap enough to make it to the top 3-4 or else you lose a lot of businesses, even if you are qualified.RobJoy - Tuesday, February 18, 2020 - link
Well these are Intel owners here.Anything PCIe 4.0 has not even crossed their minds, and are patiently waiting for Intel to move their ass.
No chance in hell they dare going AMD Rome way even if it performs better and costs less.
romrunning - Friday, February 14, 2020 - link
This article makes my love of the P4800X even stronger! :) If only they could get the capacity higher and the pricing lower - true of all storage, though especially desired for Optane-based drives.curufinwewins - Friday, February 14, 2020 - link
100% agreed, it's such a paradigm shifter by comparison.eek2121 - Friday, February 14, 2020 - link
Next gen Optane is supposed to significantly raise both capacity and performance. Hopefully Intel is smart and prices their SSD based Optane solutions at a competitive price point.curufinwewins - Friday, February 14, 2020 - link
Ok, great stuff Billy! I know it wasn't really the focus of this review, but dang, I actually came out ludicrously impressed with how very small quantities of first gen optane on relatively low channel installments have such a radically different (and almost always in a good way) behavior to flash. Definitely looking forward to the next generation of this product.