Re: Critical Information: DELL/Toshiba SSDs dying after 70,000 hours of operation

Frédéric Nass <frederic.nass@xxxxxxxxxxxxxxxx> · Fri, 1 Sep 2023 11:16:09 +0200 (CEST)

Hello, 

This message to inform you that DELL has released a new firmwares for these SSD drives to fix the 70.000 POH issue: 

[ https://www.dell.com/support/home/en-us/drivers/driversdetails?driverid=69j5f&oscode=w12r2&productcode=poweredge-r730xd | Toshiba A3B4 for model number(s) PX02SMF020, PX02SMF040, PX02SMF080 and PX02SMB160. ] 
[ https://www.dell.com/support/home/en-us/drivers/driversdetails?driverid=31jmh&lwp=rt | Toshiba A4B4 for model number(s) PX02SSF010, PX02SSF020, PX02SSF040 and PX02SSB080. ] [ https://www.dell.com/support/home/en-us/drivers/driversdetails?driverid=69j5f&oscode=w12r2&productcode=poweredge-r730xd ] 
[ https://www.dell.com/support/home/en-us/drivers/driversdetails?driverid=tc8kc&lwp=rt | Toshiba A5B4 for model number(s) PX03SNF020, PX03SNF080 and PX03SNB160. ] 

Based on our recent experience, this firmware gets dead SSD drives back to life with their data (after the upgrade, you may need to import foreign config by pressing 'F' key on the next start) 

Many thanks to DELL French TAMs and DELL engineering for providing this firmware in a short time. 

Best regards, 
Frédéric. 

----- Le 19 Juin 23, à 10:46, Frédéric Nass <frederic.nass@xxxxxxxxxxxxxxxx> a écrit : 

> Hello,

> This message does not concern Ceph itself but a hardware vulnerability which can
> lead to permanent loss of data on a Ceph cluster equipped with the same
> hardware in separate fault domains.

> The DELL / Toshiba PX02SMF020, PX02SMF040, PX02SMF080 and PX02SMB160 SSD drives
> of the 13G generation of DELL servers are subject to a vulnerability which
> renders them unusable after 70,000 hours of operation, i.e. approximately 7
> years and 11 months of activity.

> This topic has been discussed here:
> https://www.dell.com/community/PowerVault/TOSHIBA-PX02SMF080-has-lost-communication-on-the-same-date/td-p/8353438

> The risk is all the greater since these disks may die at the same time in the
> same server leading to the loss of all data in the server.

> To date, DELL has not provided any firmware fixing this vulnerability, the
> latest firmware version being "A3B3" released on Sept. 12, 2016:
> https://www.dell.com/support/home/en-us/ drivers/driversdetails?driverid=hhd9k

> If your have servers running these drives, check their uptime. If they are close
> to the 70,000 hour limit, replace them immediately.

> The smartctl tool does not report the uptime for these SSDs, but if you have
> HDDs in the server, you can query their SMART status and get their uptime,
> which should be about the same as the SSDs.
> The smartctl command is: smartctl -a -d megaraid,XX /dev/sdc (where XX is the
> iSCSI bus number).

> We have informed DELL about this but have no information yet on the arrival of a
> fix.

> We have lost 6 disks, in 3 different servers, in the last few weeks. Our
> observation shows that the drives don't survive full shutdown and restart of
> the machine (power off then power on in iDrac), but they may also die during a
> single reboot (init 6) or even while the machine is running.

> Fujitsu released a corrective firmware in June 2021 but this firmware is most
> certainly not applicable to DELL drives:
> https://www.fujitsu.com/us/imagesgig5/PY-CIB070-00.pdf

> Regards,
> Frederic

> Sous-direction Infrastructure and Services
> Direction du Numérique
> Université de Lorraine
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx