Answering Your Questions: Powering the Data Center with NVM Express™ Webcast

By Mark Carlson and John Kim

We held our Q2 webcast titled, “Powering the Data Center with NVM Express™” on June 11, 2019. Audiences learned about how the NVMe™ specifications are evolving to support data center deployment of SSDs. During the webcast, we discussed how hyperscalers have brought their requirements to the NVM Express organization including isolation, predictable latency and write amplification. As a result, new features such as IO determinism and predictable latency have been added to the upcoming NVMe 1.4 specification to address some of these concerns. We also addressed the demand for TCP technology to be added to the family of NVMe-oF™ transports.

We received so many thought-provoking inquiries from the audience during the webcast, but we were not able to answer all of them live. In this blog, we will answer those remaining questions.

Customer Requirements

Why is tail latency such a big issue? Why isn’t having a low average latency enough?
Requests end up pulling in other requests. For example, a single web page can result in hundreds of requests sometimes serialized behind others. A long tail will visibly affect the loading of this web page and impact customers.

Are there enterprise (non-hyperscaler) customers who also want these new features in NVMe architecture?
Yes, especially if they are still building out data centers and want to use the same approach.

Why are hyperscaler or data center requirements different from those of regular customers?
These requirements differ because of the scale and availability of a large staff of skilled programmers.

NVM Sets

If NVM sets are covered in NVMe 1.4 specification, but NVM set creation/deletion is not, does each vendor need to create their own method of creating NVM sets?
Yes, or they are pre-configured at the factory.

Can an NVM namespace be moved from one set to another?
No.

Can an NVM set contain multiple namespaces from the same SSD?
Yes.

Can one set contain namespaces that span multiple SSDs?
No.

I/O Determinism

Why can’t vendors offer an NVMe SSD that achieves deterministic latency 100% of the time?
This scenario is difficult because of the work required behind the scenes that the drive must do in order to prevent data loss. This is the nature of using NAND media.

Does I/O Determinism apply to writes also, or is it only for reads?
It’s only for reads. The drive may be able to handle a few writes and still provide predictable reads.

Do any SATA or SAS SSDs offer deterministic latency or endurance groups?
SATA and SAS do have features to address some of these issues, but they handle it differently.

Endurance Group

Can I use Media Unit and Capacity Endurance Group Management together, or are they mutually exclusive?
While a drive or system could support both, it is unlikely.

Who is managing the endurance groups if it’s not done by the host? Is it done by the array controller?
Endurance Groups can be thought of as an agreement between host and controller on who is doing the wear leveling for which pieces of media. The typical drive sold today (non-IOD) has a single Endurance Group that relays to the host so that the drive takes care of all wear leveling.

Is NVMe over Fabrics the only storage networking protocol that can run over Ethernet with RDMA or TCP?
NFS, iSCSI, and SMB can also run over both RDMA and TCP. NFS over RDMA is used by some customers (over RoCE transport) who need faster file storage performance, but it is not supported by many commercial storage arrays. iSCSI Extensions for RDMA (iSER) are supported by several commercial storage arrays and can use the RoCE or iWARP transports for RDMA. SMB Direct is supported by all Windows Server 2012, 2016, and 2019 systems and can use RoCE or iWARP for RDMA transport.

Why is failure instead of recovery sometimes an optimal situation?
We don’t like to refer to this as a failure but as an option to eliminate heroic error recovery.
The host may be able to access a different copy faster from somewhere else. Many hyperscalers keep three or more copies of each piece of data and would rather read the data quickly from the second or third location than to spend a lot of time recovering from a read error at the first location.

What use case might be found in regard to changing a configuration after the media has been used? You mentioned it might be the basis for a future TP if found.
We would say drive re-purposing into new situations that have different requirements.

If you have more questions about powering the data center or are a member interested in becoming more involved in the specification development process, please contact us to learn more. If you missed the live webcast or want to re-watch sections, a full recording is available here.