I'm running some s4610 (SSDPE2KE064T8), with firmware VDV10140. don't have any problem with them since 6months. But I remember than around september 2017, supermicro has warned me about a firmware bug on s4600. (don't known which firmware version) ----- Mail original ----- De: "David Turner" <drakonstein@xxxxxxxxx> À: "ceph-users" <ceph-users@xxxxxxxxxxxxxx> Envoyé: Lundi 18 Février 2019 16:44:18 Objet: Intel P4600 3.2TB U.2 form factor NVMe firmware problems causing dead disks We have 2 clusters of [1] these disks that have 2 Bluestore OSDs per disk (partitioned), 3 disks per node, 5 nodes per cluster. The clusters are 12.2.4 running CephFS and RBDs. So in total we have 15 NVMe's per cluster and 30 NVMe's in total. They were all built at the same time and were running firmware version QDV10130. On this firmware version we early on had 2 disks failures, a few months later we had 1 more, and then a month after that (just a few weeks ago) we had 7 disk failures in 1 week. The failures are such that the disk is no longer visible to the OS. This holds true beyond server reboots as well as placing the failed disks into a new server. With a firmware upgrade tool we got an error that pretty much said there's no way to get data back and to RMA the disk. We upgraded all of our remaining disks' firmware to QDV101D1 and haven't had any problems since then. Most of our failures happened while rebalancing the cluster after replacing dead disks and we tested rigorously around that use case after upgrading the firmware. This firmware version seems to have resolved whatever the problem was. We have about 100 more of these scattered among database servers and other servers that have never had this problem while running the QDV10130 firmware as well as firmwares between this one and the one we upgraded to. Bluestore on Ceph is the only use case we've had so far with this sort of failure. Has anyone else come across this issue before? Our current theory is that Bluestore is accessing the disk in a way that is triggering a bug in the older firmware version that isn't triggered by more traditional filesystems. We have a scheduled call with Intel to discuss this, but their preliminary searches into the bugfixes and known problems between firmware versions didn't indicate the bug that we triggered. It would be good to have some more information about what those differences for disk accessing might be to hopefully get a better answer from them as to what the problem is. [1] [ https://www.intel.com/content/www/us/en/products/memory-storage/solid-state-drives/data-center-ssds/dc-p4600-series/dc-p4600-3-2tb-2-5inch-3d1.html | https://www.intel.com/content/www/us/en/products/memory-storage/solid-state-drives/data-center-ssds/dc-p4600-series/dc-p4600-3-2tb-2-5inch-3d1.html ] _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com