We had several postgresql servers running these disks from Dell. Numerous failures, including one server that had 3 die at once. Dell claims it is a firmware issue instructed us to upgrade to QDV1DP15 from QDV1DP12 (I am not sure how these line up to the Intel firmwares). We lost several more during the upgrade process. We are using ZFS with these drives. I can confirm it is not a Ceph Bluestore only issue.
On Mon, Feb 18, 2019 at 8:44 AM David Turner <drakonstein@xxxxxxxxx> wrote:
_______________________________________________We have 2 clusters of [1] these disks that have 2 Bluestore OSDs per disk (partitioned), 3 disks per node, 5 nodes per cluster. The clusters are 12.2.4 running CephFS and RBDs. So in total we have 15 NVMe's per cluster and 30 NVMe's in total. They were all built at the same time and were running firmware version QDV10130. On this firmware version we early on had 2 disks failures, a few months later we had 1 more, and then a month after that (just a few weeks ago) we had 7 disk failures in 1 week.The failures are such that the disk is no longer visible to the OS. This holds true beyond server reboots as well as placing the failed disks into a new server. With a firmware upgrade tool we got an error that pretty much said there's no way to get data back and to RMA the disk. We upgraded all of our remaining disks' firmware to QDV101D1 and haven't had any problems since then. Most of our failures happened while rebalancing the cluster after replacing dead disks and we tested rigorously around that use case after upgrading the firmware. This firmware version seems to have resolved whatever the problem was.We have about 100 more of these scattered among database servers and other servers that have never had this problem while running the QDV10130 firmware as well as firmwares between this one and the one we upgraded to. Bluestore on Ceph is the only use case we've had so far with this sort of failure.Has anyone else come across this issue before? Our current theory is that Bluestore is accessing the disk in a way that is triggering a bug in the older firmware version that isn't triggered by more traditional filesystems. We have a scheduled call with Intel to discuss this, but their preliminary searches into the bugfixes and known problems between firmware versions didn't indicate the bug that we triggered. It would be good to have some more information about what those differences for disk accessing might be to hopefully get a better answer from them as to what the problem is.
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com