Your SSD Drives May Brick After 40,000 Hours

On 2020-03-20, HPE issued customer bulletin a00097382en_us (located here) which states:

HPE SAS Solid State Drives – Critical Firmware Upgrade Required for Certain HPE SAS Solid State Drive Models to Prevent Drive Failure at 40,000 Hours of Operation.

The HPE bulletin further states:

After the SSD failure occurs, neither the SSD nor the data can be recovered. In addition, SSDs which were put into service at the same time will likely fail nearly simultaneously. Restoration of data from backup will be required in non-fault tolerance modes (e.g., RAID 0) and in fault tolerance RAID mode if more drives fail than what is supported by the fault tolerance RAID mode logical drive.
(Editor’s Note: boldface is mine).

This issue appears to not only affect HPE drives. Dell issued a notice (located here) that deals with the same issue. In its notice Dell says that the disks that are affected were made by SanDisk. There is a good chance that smaller vendors, or vendors that are no longer around may also have used SanDisk drives that have this same issue. We recommend checking all SSD drives to make sure that they have the latest firmware. You may also want to check if they are approaching 40,000 hours of usage.

The Bad News

Organizations often encounter drive failure issues. Usually it is merely an inconvenience as it is rather simple to replace a drive; however, in this particular case such a drive failure could be a catastrophic event. Why? When these drives hit 40,000 hours, the data on the drives becomes unrecoverable. If these drives are being used in a RAID configuration, and if all the drives for a particular RAID were placed into service at the same time, the data protection that RAID offers will be useless as all the drives will hit this issue simultaneously.

The drives in question are certain 800GB and 1.6TB SAS models. The exact models are listed in the service bulletins.

The Good News

The good news is that the 40,000-hour limit will not hit the user community until October 2020. Better news is that HPE and Dell have developed tools to check and correct this issue if your vSphere, Linux or Windows systems have drives that will be affected by the issue. For HPE systems please refer to customer bulletin a00097382en_us for complete information regarding this issue (located here). Dell systems please refer to the customer service bulletin located here.

It should also be noted that this is a separate issue from the one addressed by HPE in Customer Bulletin a00092491en_us (located here) which covers a critical issue for HPE SAS SSD drives that fail after 32,768 hours of operation.