How Microsoft Uses NVMe™ Solutions to Power Azure and its Data Centers

Blog

By Lee Prewitt, Microsoft

Microsoft manages its data centers using Azure, our cloud computing solutions for building, testing, deploying, and managing applications and services. For nearly a decade, this open, flexible, enterprise-grade technology (now available in over 54 regions around the world) has operated as Microsoft’s answer to the incredibly expanding storage industry.[1]

Let’s talk NVMe™ technology and Azure…Microsoft’s Azure data center capacity is enormous and most servers in the Microsoft data center have at least six NVMe drives in them, which equates to millions and millions of drives being used. As you can imagine, a system of this capacity presents its challenges, including form factors, rot in place and debugging solutions, telemetry issues and security matters.

Challenge One: Form Factors

From Microsoft’s perspective, the power and thermal constraints and the fact that it’s not hot-swappable have caused the M.2 form factor to run its course in the enterprise. Enter EDSFF, which stands for Enterprise and Datacenter SSD Form Factor Working Group, and its collection of form factors. E1 family of form factors are replacing M.2 as they are built from the ground up for data center uses cases, while also supporting NVMe specifications.

Source: Intel

Source: OCP Storage

Source: https://www.flashmemorysummit.com/English/Collaterals/Proceedings/2018/20180808_SSDS-201-1_Shaw.pdf

Challenge Two: Rot in Place – Graceful Degradation of SSDs

SSDs can be used inefficiently in the data center, resulting in poor performance and shortened life.  We needed to find a way to better use NVMe SSDs and extend the life of the drive. By using NVMe’s Zoned Namespaces (ZNS) technology. ZNS are an excellent solution because they take aspects of shingle magnetic recording for Azure’s data centers and add them to SSD drives, allowing engineers to write it sequentially in well-defined chunks. The data may then be read randomly, and the zone of the data will be reset. At the end of the process, the write pointer returns and can be reset and then the process starts all over again. We also use Zoned Namespaces to minimize garbage collection, which is extremely useful when large writes occur. You can find out more about ZNS at NVM Express and Zonedstorage.io

Source: https://zonedstorage.io/introduction/zns/

Challenge Three: Debugging

In terms of debugging solutions, Microsoft couldn’t ask for a better option than NVMe technology. NVMe architecture allows Microsoft to timestamp exact drive events correlated to individual system events. One example of this is through the power of a drive, which will provide the actual wall clock time during the debugging process. If issues arise in the drive during debugging, our engineers can recover the precise timestamp where they arose and then access that specific entry point to alleviate the problem.

Challenge Four: Telemetry

Telemetry refers to an umbrella of tools, utilities and protocols to remotely extract and decode information for debugging potential issues. NVMe telemetry solutions help Azure data centers facilitate standardizations when issues arise and put a process in place for data to be collected, stored and reported back to an In-House Programmer (IHP). Another great feature of NVMe telemetry is that it works over industry standard protocols and eliminates or minimizes the need to remove SSDs from vendor systems for retrieving things like debug logs.

NVMe telemetry is also helpful when it comes to Azure’s bug checks and supporting Microsoft’s need for transparency during certain processes. Our IHPs receive data directly, which they use to figure out what happened at any given moment. This is the case for host-initiated I/O failures and drive-initiated firmware panic situations.

Challenge Five: Security

Microsoft spends approximately one billion dollars per year on cybersecurity, and much of that investment is to make Azure a trusted cloud platform. Our primary security tool for Azure is Project Cerberus. Project Cerberus is a security co-processor that establishes a root of trust within itself for the hardware devices on a computing platform, allowing it to defend platform firmware. A security chip is included on all the data center’s motherboards and then interacts with the firmware, which will not boot unless it is communicated with and connected. NVMe secure boot technology helps support Project Cerberus in this way through the enablement of signed firmware.

OCPSummit19 – The State of Hardware Security Cerberus Present and Future – Presented by Microsoft

In conclusion, NVMe technology enables Microsoft’s Azure data center ecosystem to perform at higher, more efficient levels. It also helps engineers communicate, maintain control and understand internal processes such as debugging and security issues.

Learn More About NVMe Technology and Microsoft

To gain additional insight to how hyperscalers like Facebook and Microsoft chose NVMe technology flash for the storage, watch my FMS 2019 presentation on how NVMe and Microsoft work together.

[1] https://azure.microsoft.com › en-us › global-infrastructure