How SSDs Fail – NVMe™ SSD Management, Error Reporting, and Logging Capabilities

Blog

  

By Jonmichael Hands, NVMe MWG Co-Chair, Intel       

NVMe™ technology was built from the ground up for SSDs, and the original NVMe specification included a standard SMART (Self-Monitoring, Analysis and Reporting Technology) log that monitored errors, device health, and endurance. At the time, SAS/SATA drives had SMART capability, but it was vendor specific (tools had to parse data by vendor) and the data wasn’t widely trusted. I can’t understate how important this was in NVMe architecture– creating an industry standard SMART log page that contained the most common information needed to monitor an SSD. Ultimately, it was a tool to help vendors maintain accountability for accurate and correct data reporting.

Many capabilities have been built into NVMe technology since, including enhance error reporting, logging, management, debug, and telemetry. These capabilities can be built into tools ranging from open source management tools to OEM management consoles to help support monitor the status and health of the SSD (like notifying users when an SSD failure occurs). More importantly, customers want to ensure smooth and normal operation of their SSD and be able to understand where and why things are failing and when it does happen.

Management tools, log pages, endurance monitoring and more can help identify and pinpoint when a device fails, the number of errors and type of errors. These errors could include hardware failure, integrity errors, media errors, temperature issues and more. Before we dive into specifics on the capabilities of NVMe technology, it is important to understand how SSDs fail, then we can use the tools to help predict and prevent them. SSD failures generally fall into these categories

  • System incompatibility – In this situation, there is nothing wrong with the SSD, but compatibility bugs are preventing normal operation. An example would be a system hang or no enumeration of the SSD. A customer would generally return an SSD to the manufacturer if this happened.
  • SSD Endurance – SSD endurance is finite and writing data will eventually wear out an SSD. Good news is that this can be accurately predicted and modeled by understanding the workload and the SSD, and NVMe technology can report the statistics to monitor this in real time.
  • Firmware errors – SSD firmware is complex and must handle many corner cases of workloads and states for transferring data. SSD vendors try to eliminate as many firmware issues as possible prior to going to production, but perfect validation and verification can’t catch all firmware issues. Firmware failures account for the majority of SSD failures!
  • Media Errors – there are many different classes of SSDs, some with end-to-end data protection, power loss protection, and redundancy within the SSD media through RAID, XOR, or other technologies. But NAND flash and other storage and memory classes do have failures, and too many will cause the SSD to stop functioning
  • Hardware errors – capacitors, resistors, and power management circuits can fail. These are rarer but are more catastrophic when they do happen.

Log Pages

Log Pages are maintained in the SSD and can be read by host software at any time. Below are the various log pages NVMe technology utilizes:

  • Error Log Page

The Error Log Pages are used to log all errors so that no errors go unreported or missing. NVMe drives maintain an error log page that records all errors that happen. This log page maintains important information regarding the number of errors, which queue they came from, and which data and namespaces were affected. This is critical into identifying problematic drives and root causing what in the system may be causing errors.

  • SMART Log Page

The SMART log page is used to report on general health information about the drive. Its main health indicator is called the critical warning, which warns of a problem in the drive. The NVMe drive will then inform the host on the type of issue. Issues could mean the drive is in a degraded or read only mode due to media errors, the drive is currently exceeding the temperature threshold or there could be a hardware failure. The SMART log page also works to summarize the error log page for media or data integrity errors and lists the number of unsafe shutdowns caused by power loss events. Lastly, the SMART page is useful to monitor endurance. By checking the SMART Percentage Used field, a system integrator can view the SSD life left as an easy to read percentage of total life used/available. To best utilize this feature, vendors can set an available spares field to send a notification to the host when spares are below a certain threshold.

  • Persistent Event Log

Added in the NVMe 1.4 specification, the Persistent Event Log can be compared to a black box recorder for the SSD. This works to log events occurring on the SSD such as errors, updating firmware, formatting, and more so that they are legible to humans and timestamped. This is extremely useful to an OEM or OS vendor looking to identify and manage their device and pinpoint when a specific event or failure happened. Those interested in learning more about the Persistent Event Log can visit the “Changes in NVMe Revision 1.4” webpage.

  • Telemetry – adding debug capability to NVMe technology

Telemetry enables SSD vendors / manufactures to collect internal logs upon device failure. Standard human readable logs are encouraged here due to IP and internal data collection sensitivity from customers. The command can be either host or controller initiated, but generally makes sense for a host (customer in this case) to read out the telemetry log when a device fails and send that to an SSD vendor or OEM that they purchased it from for further analysis. As we saw from the introduction, firmware issues are a major cause of SSD failures, and a telemetry log allows vendors to get to the root cause when failures occur in the field.

Event and Error Reporting

Along with log pages, many NVMe specification features work to report errors and operation failures. These reports help identify each specific type of error and how to recover the controller, drive and operating system.

  • Asynchronous Event Request

Asynchronous events are used to notify host software of the status, error and health information of various events. NVMe controllers or drives report an event to the host software when an error occurs, attributes on the drive change, a SMART change, or a management event is completed. The most important capability here is for the NVMe controller (drive in most cases) can notify the host asynchronously when a critical warning happens, and the operating system or system console can immediately report this to the user. To find out more details about the types of events defined, visit page 96 of the NVMe 1.4 specification.

  • Operation failures

The NVMe specification includes a section dedicated to the error reporting and recovery for use of the controller/drive, driver and operating system use. This is mostly used for device drivers and host software systems to identify critical failures of the NVM subsystems and NVMe controllers. This section can be found on page 400 of the NVMe 1.4 specification.

  • Rebuild Assist

Rebuild assist was added as an option in the NVMe 1.4 specification. Rebuild Assist defines a new Get LBA Status capability that identifies potentially unrecoverable LBAs to the host. This status is used to determine what LBAs on a device need recovered by the host from another location and re-written. One of the top use cases for rebuild assist can help replace background data scrubs for SSDs, since the SSD firmware is generally already doing this analysis internally and now has a way to report this to the host. The host generally has redundant copies of data and now has an opportunity to recover the data from a valid copy.

 

Management

Management capability of NVMe Management Interface™ (NVMe-MI™) technology is critical for enterprise, cloud, and data center deployments. These are especially useful for OEMs that support multiple operating systems and benefit from one management console, which is a value add to end customers.

  • NVMe-MI Specification

The NVMe-MI specification manages NVMe SSDs outside of the operating system through the SMBUS/MCTP and PCIe/VDMs interface. NVMe-MI architecture uses baseboard management controllers to check inventory, monitory for errors, track SMART log and endurance and report these through a management console. To learn more about the NVMe-MI specification, we invite you to read our NVMe-MI technology blog for a more in-depth explanation of its features and benefits. NVMe-MI really sets the NVMe standard apart from other storage interfaces by providing an entire specification dedicated to management of the storage devices.

Testing 

Testing features are useful to conduct diagnostics and ensure NVMe technology has been properly implemented.

  • Device Self-Test Command

The Device Self-Test command feature, defined on page 107 of the NVMe 1.4 specification, allows the host to start either a short or long self-test to be run for offline diagnostics. OEMs, ODMS and system integrators often use this command feature when integrating a new NVMe SSD into a larger system. One example would be at a system integrator or factory, they obtain SSDs from an SSD vendor and put them into a larger server, and then proceed to run the self-test command to ensure the drives are all functioning correctly. The NVMe specification includes an informative figure containing an example device self-test, pictured below.

How Can You Get Involved?

As we have seen, NVMe technology has a robust suite of features and capabilities to help monitor, manage and deploy NVMe SSDs at scale. As an open standards organization, NVM Express is constantly improving and getting real feedback from SSD vendors, OEMs, ODMs, and hyperscale cloud service providers on what matters in real world deployments. To keep up with this feedback and the evolving storage landscape, NVM Express updates its specifications by adding needed features such as Persistent Event Log, the NVMe Management Interface and more. This year, we look forward to continuing to augment NVM Express technology and making the end-user experience both simple and seamless.

Members interested in contributing their expertise to the NVM Express specifications are encouraged to join one of our NVM Express Working Groups.