Hi all, We just saw an example of one single down OSD taking down a whole (small) luminous 12.2.2 cluster. The cluster has only 5 OSDs, on 5 different servers. Three of those servers also run a mon/mgr combo. First, we had one server (mon+osd) go down legitimately [1] -- I can tell when it went down because the mon quorum broke: 2018-01-22 18:26:31.521695 mon.cephcta-mon-658cb618c9 mon.0 137.138.62.69:6789/0 121277 : cluster [WRN] Health check failed: 1/3 mons down, quorum cephcta-mon-658cb618c9,cephcta-mon-3e0d524825 (MON_DOWN) Then there's a long pileup of slow requests until the OSD is finally marked down due to no beacon: 2018-01-22 18:47:31.549791 mon.cephcta-mon-658cb618c9 mon.0 137.138.62.69:6789/0 121447 : cluster [WRN] Health check update: 372 slow requests are blocked > 32 sec (REQUEST_SLOW) 2018-01-22 18:47:56.671360 mon.cephcta-mon-658cb618c9 mon.0 137.138.62.69:6789/0 121448 : cluster [INF] osd.2 marked down after no beacon for 903.538932 seconds 2018-01-22 18:47:56.672315 mon.cephcta-mon-658cb618c9 mon.0 137.138.62.69:6789/0 121449 : cluster [WRN] Health check failed: 1 osds down (OSD_DOWN) So, first question is: why didn't that OSD get detected as failing much earlier? The slow requests continue until almost 10 minutes later ceph marks 3 of the other 4 OSDs down after seeing no beacons: 2018-01-22 18:56:31.727970 mon.cephcta-mon-658cb618c9 mon.0 137.138.62.69:6789/0 121539 : cluster [INF] osd.1 marked down after no beacon for 900.091770 seconds 2018-01-22 18:56:31.728105 mon.cephcta-mon-658cb618c9 mon.0 137.138.62.69:6789/0 121540 : cluster [INF] osd.3 marked down after no beacon for 900.091770 seconds 2018-01-22 18:56:31.728197 mon.cephcta-mon-658cb618c9 mon.0 137.138.62.69:6789/0 121541 : cluster [INF] osd.4 marked down after no beacon for 900.091770 seconds 2018-01-22 18:56:31.730108 mon.cephcta-mon-658cb618c9 mon.0 137.138.62.69:6789/0 121542 : cluster [WRN] Health check update: 4 osds down (OSD_DOWN) 900 is the default mon_osd_report_timeout -- why are these OSDs all stuck not sending beacons? Why haven't they noticed that the osd.2 had failed, then recover things on the remaining OSDs? The config [2] is pretty standard, save for one perhaps culprit: osd op thread suicide timeout = 1800 That's part of our standard config, mostly to prevent OSDs from suiciding during FileStore splitting. (This particular cluster is 100% bluestore, so admittedly we could revert that here). Any idea what went wrong here? I can create a tracker and post logs if this is interesting. Best Regards, Dan [1] The failure mode of this OSD appears like its block device just froze. It runs inside a VM and the console showed several of the typical 120s block dev timeouts. The machine remained pingable, but wasn't doing any IO. [2] https://gist.github.com/dvanders/7eca771b6a8d1164bae8ea1fe45cf9f2 _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com