Answering Your Questions: NVMe™/TCP: What You Need to Know About the Specification

Blog

By Sagi Grimberg and Peter Onufryk             

NVM Express, Inc. recently announced the addition of NVMe™ over TCP (NVMe™/TCP) to the family of NVMe transports. The addition of NVMe/TCP is such an important development, that our first webcast of 2019 was an in-depth exploration of the crucial benefits and features of the new specification. If you missed the live webcast, we invite you to watch the recording.

We received so many engaging questions from the audience during the webcast, that we were not able to answer them all live. In this blog, we answer all of your burning unanswered NVMe/TCP questions.

Should we expect official documentation of NVMe/TCP when the NVMe 1.4 spec is finalized?
NVMe/TCP is an NVMe-oF transport binding, so we should expect the ratified technical proposal TP 8000 to be integrated into the NVMe-oF 1.1 released specification document. While the NVMe Board has not announced an official schedule, we expect it to be released later this year.

What is needed on the Host side (hardware, firmware, software, drivers, etc.) to support NVMe over TCP?
NVMe/TCP software does not require any special hardware/firmware to operate, although different classes of CPU and network adapters can benefit from better performance. NVMe/TCP host and NVM subsystem software need to be installed in order to run NVMe/TCP. The software is available with Linux Kernel (v5.0) and SPDK (v.19.01), as well as commercial NVMe/TCP target devices.

Are there any constraints to the number of namespaces a host can have at runtime? What is needed in terms of resources on the host side (cores, memory, ports)?
NVMe/TCP does not impose any limitations on the basic features of the NVMe architecture because it is an NVMe-oF transport binding. As such, there is no limit on the number of namespaces that can be supported with NVMe/TCP. From the transport point of view, a namespace is a purely logical concept without any host resources assigned to it.

Can you elaborate on comparable direct attached NVMe SSDs in terms of latency? Will latency improve with NVMe/TCP?
Latency will likely not improve over direct attached NVMe just by including NVMe/TCP. A specific controller implementation may add some special upgrades.

Which upstream kernel has support for NVMe/TCP?
Version 5.0 and forward support NVMe/TCP.

Would you expect significant performance differences when running NVMe/TCP on top of a data plane based networking stack such as DPDK?
If the platform the controller is running on has enough horsepower to run NVMe/TCP on top of the general-purpose Linux networking stack, then radical performance improvements are not expected. However, if the controller does not have enough CPU dedicated to run the Linux networking stack—for example, if it has some other operations that take CPU cycles—then a DPDK based solution may achieve better performance due to increased efficiency.

Do you recommend having Data Center TCP for running an NVMe/TCP workload?
DCTCP has potential to estimate congestion better than other TCP congestion control algorithms. Generally, it can be useful regardless of NVMe/TCP. The question is, if the pattern congestion happens in the DC network as well as modern TCP/IP stacks, do they have other mechanisms to handle congestion efficiently?

Can you have multiple R2Ts? How are these compared to buffers in FCP?
In theory, a controller can send multiple inflight R2T PDUs to a host for a specific command. However, the host limits the maximum of them (MAXR2T). R2T PDUs have credit mechanism equivalent to FCP BBC in FC but operate on an NVMe command level rather than an FC port level.

How is flow control managed? Is it only done with R2T and standard TCP congestion windows?
Correct. End-to-end stream flow control is handled by TCP/IP, and NVMe transport level flow control is managed via a R2T credit mechanism.

Is there an ordering constraint for PDUs among multiple outstanding requests in the SQ?
No. PDUs that correspond to different NVMe commands have no defined ordering rules.

How are patching and upgrades managed with NVMe/TCP? Is it all non-disruptive? What is the rollout process for utilizing NVMe/TCP in large environments?
Recommend that you consult with a vendor and ask about a specific solution. There is nothing about NVMe/TCP protocol that enforces or inhibits such a requirement.

Are there any open source target implementations for NVMe/TCP available?
Yes, both Linux and SPDK include an NVMe/TCP target implementation.

Is there an equivalent of NVMe/TCP in iSCSI?
There is no NVMe/TCP equivalent in iSCSI, but there are many equivalent concepts. NVMe/TCP and iSCSI are equivalent in the sense that iSCSI is a SCSI transport that runs over TCP/IP and NVMe/TCP is an NVMe transport that runs over TCP/IP.

How does NVMe/TCP compare in performance (BW, IOPS, latency etc.) with NVMe/FC?
We have not tested any NVMe/FC product or open-source implementations, nor seen any similar NVMe/FC performance benchmarks. We expect that both would have a relatively small degradation compared to direct attached NVMe.

Do you have any data on CPU utilization for NVMe/RoCE vs. NVME/TCP?
Not officially. However, a software NVMe/TCP will require more CPU resources than NVMe/RDMA, which offloads the transport protocol in the hardware. Further variation depends on the workload and how effective the stateless offloads implemented by the network adapter appear to be.

What are the pros and cons of using NVMe/TCP compared to NVMe over RDMA? Are there performance differences?
NVMe/TCP is a transport binding that offers the benefits of commodity hardware, excellent scalability and reach, without requiring changes to the network infrastructure to support RDMA (e.g. make ethernet lossless). NVMe/RDMA can have lower absolute latency, depending on the workload. NVMe/RDMA also typically has less CPU utilization, depending on the implementation and stateless offloads effectiveness. When deciding on an investment, one should consider the differences in performance for the workloads of interest, as well as other factors such as cost, scale, etc.

How does NVMe/TCP IP differ from NVMe over RDMA? Can we combine these two traffic on the same Ethernet 100gbps cable?
NVMe/TCP is different from NVMe/RDMA as it runs NVMe-oF capsules and Data on top of TCP/IP. NVMe/RDMA runs NVMe-oF capsules and Data over either RoCE (InfiniBand over UDP) or iWARP (TCP with DDP and MPA). Both NVMe/TCP and NVMe/RDMA run over an Ethernet fabric so they can run on the same Ethernet 100Gb/s cable.

What is the maximum tolerable latency for the NVMe PDU across an Ethernet Switch Fabric within a Data Center/Private Cloud and across geographical distributed Data Center/Private Cloud?
There is no specified maximum latency for NVMe/TCP. In practice, network latency is not an issue. The NVMe Keep Alive Timeout default is two minutes.

If you have more questions about NVMe/TCP or are a member interested in becoming more involved in the specification development process, please contact us to learn more. We know you’re still on the edge of your seat, so grab the popcorn and watch the full webcast here.