Yeah, we also observed problems with HP raid controllers misbehaving when a single disk starts to fail. We would never recommend building a Ceph cluster on HP raid controllers until they can fix that issue. There are several features in Ceph which detect dead disks: there are timeouts for OSDs checking each other and there's a timeout for OSDs checking in with the mons. But that's usually not enough in this scenario. The good news is that recent Ceph versions will show which OSDs are implicated in slow requests (check ceph health detail) which at least gives you some way to figure out which OSDs are becoming slow. We have found it to be useful to monitor the op_*_latency values of all OSDs (especially subop latencies) from the admin daemon to detect such failures earlier. Paul -- Paul Emmerich Looking for help with your Ceph cluster? Contact us at https://croit.io croit GmbH Freseniusstr. 31h 81247 München www.croit.io Tel: +49 89 1896585 90 Am Mi., 21. Nov. 2018 um 16:22 Uhr schrieb Arvydas Opulskis <zebediejus@xxxxxxxxx>: > > Hi all, > > it's not first time we have this kind of problem, usually with HP raid controllers: > > 1. One disk is failing, bringing all controller to slow state, where it's performance degrades dramatically > 2. Some OSDs are reported as down by other OSDs and marked as down > 3. At same time other OSDs on same node are not detected as failed and are still participating in cluster. I think, it's because OSD is not aware about backend disk problems and answers to health checks > 4. Because of this, requests to PGs, which are on problematic node, are becoming "slow", later becoming "stuck" > 5. Cluster is struggling and client operations are not performed, so cluster is in some kind "locked" state > 6. We need to mark them down manually (or stop problematic daemons), so cluster starts to recover and process requests > > Is there any mechanism in Ceph, which monitors slow request containing OSDs and mark them down after some kind of threshold? > > Thanks, > Arvydas > _______________________________________________ > ceph-users mailing list > ceph-users@xxxxxxxxxxxxxx > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com