Hi Mark, Thanks again for looking into this problem. I ran the cluster overnight, with a script checking for dead OSDs every second, and restarting them. 40 OSD failures occurred in 12 hours, some OSDs failed multiple times, (there are 50 OSDs in the EC tier). Unfortunately, the output of collectl doesn't appear to show any increase in disk queue depth and service times before the OSDs die. I've put a couple of examples of collectl output for the disks associated with the OSDs here: https://hastebin.com/icuvotemot.scala please let me know if you need more info... best regards, Jake _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com