Here's a bit more info as I read the logs. Firstly, these are in fact Filestore OSDs... I was confused, but I don't think it makes a big difference. Next, all the other OSDs had indeed noticed that osd.2 had failed: 2018-01-22 18:37:20.456535 7f831728e700 -1 osd.0 598 heartbeat_check: no reply from 137.138.121.224:6803 osd.2 since back 2018-01-22 18:36:59.514902 front 2018-01-22 18:36:59.514902 (cutoff 2018-01-22 18:37:00.456532) 2018-01-22 18:37:21.085178 7fc911169700 -1 osd.1 598 heartbeat_check: no reply from 137.138.121.224:6803 osd.2 since back 2018-01-22 18:37:00.518067 front 2018-01-22 18:37:00.518067 (cutoff 2018-01-22 18:37:01.085175) 2018-01-22 18:37:21.408881 7f78b8ea4700 -1 osd.4 598 heartbeat_check: no reply from 137.138.121.224:6803 osd.2 since back 2018-01-22 18:37:00.873298 front 2018-01-22 18:37:00.873298 (cutoff 2018-01-22 18:37:01.408880) 2018-01-22 18:37:21.117301 7f4ac8138700 -1 osd.3 598 heartbeat_check: no reply from 137.138.121.224:6803 osd.2 since back 2018-01-22 18:37:01.092182 front 2018-01-22 18:37:01.092182 (cutoff 2018-01-22 18:37:01.117298) The only "reported failed" came from osd.0, who BTW was the only OSD who hadn't been marked down for not sending beacons: 2018-01-22 18:37:20.457400 7fc1b51ce700 1 mon.cephcta-mon-658cb618c9@0(leader).osd e598 prepare_failure osd.2 137.138.121.224:6800/1377 from osd.0 137.138.156.51:6800/1286 is reporting failure:1 2018-01-22 18:37:20.457457 7fc1b51ce700 0 log_channel(cluster) log [DBG] : osd.2 137.138.121.224:6800/1377 reported failed by osd.0 137.138.156.51:6800/1286 So presumably it's because only 1 reporter showed up that osd.2 was never marked down. (1 being less than "mon_osd_min_down_reporters": "2") And BTW, I didn't mention before that the cluster came fully back to HEALTH_OK after I hard rebooted the osd.2 machine -- the other OSDs were unblocked and recovery healed everything: 2018-01-22 19:31:12.381762 7fc907956700 0 log_channel(cluster) log [WRN] : Monitor daemon marked osd.1 down, but it is still running 2018-01-22 19:31:12.381774 7fc907956700 0 log_channel(cluster) log [DBG] : map e602 wrongly marked me down at e601 2018-01-22 19:31:12.515178 7f78af691700 0 log_channel(cluster) log [WRN] : Monitor daemon marked osd.4 down, but it is still running 2018-01-22 19:31:12.515186 7f78af691700 0 log_channel(cluster) log [DBG] : map e602 wrongly marked me down at e601 2018-01-22 19:31:12.586532 7f4abe925700 0 log_channel(cluster) log [WRN] : Monitor daemon marked osd.3 down, but it is still running 2018-01-22 19:31:12.586544 7f4abe925700 0 log_channel(cluster) log [DBG] : map e602 wrongly marked me down at e601 Thanks for the help solving this puzzle, Dan On Mon, Jan 22, 2018 at 8:07 PM, Dan van der Ster <dan@xxxxxxxxxxxxxx> wrote: > Hi all, > > We just saw an example of one single down OSD taking down a whole > (small) luminous 12.2.2 cluster. > > The cluster has only 5 OSDs, on 5 different servers. Three of those > servers also run a mon/mgr combo. > > First, we had one server (mon+osd) go down legitimately [1] -- I can > tell when it went down because the mon quorum broke: > > 2018-01-22 18:26:31.521695 mon.cephcta-mon-658cb618c9 mon.0 > 137.138.62.69:6789/0 121277 : cluster [WRN] Health check failed: 1/3 > mons down, quorum cephcta-mon-658cb618c9,cephcta-mon-3e0d524825 > (MON_DOWN) > > Then there's a long pileup of slow requests until the OSD is finally > marked down due to no beacon: > > 2018-01-22 18:47:31.549791 mon.cephcta-mon-658cb618c9 mon.0 > 137.138.62.69:6789/0 121447 : cluster [WRN] Health check update: 372 > slow requests are blocked > 32 sec (REQUEST_SLOW) > 2018-01-22 18:47:56.671360 mon.cephcta-mon-658cb618c9 mon.0 > 137.138.62.69:6789/0 121448 : cluster [INF] osd.2 marked down after no > beacon for 903.538932 seconds > 2018-01-22 18:47:56.672315 mon.cephcta-mon-658cb618c9 mon.0 > 137.138.62.69:6789/0 121449 : cluster [WRN] Health check failed: 1 > osds down (OSD_DOWN) > > > So, first question is: why didn't that OSD get detected as failing much earlier? > > > The slow requests continue until almost 10 minutes later ceph marks 3 > of the other 4 OSDs down after seeing no beacons: > > 2018-01-22 18:56:31.727970 mon.cephcta-mon-658cb618c9 mon.0 > 137.138.62.69:6789/0 121539 : cluster [INF] osd.1 marked down after no > beacon for 900.091770 seconds > 2018-01-22 18:56:31.728105 mon.cephcta-mon-658cb618c9 mon.0 > 137.138.62.69:6789/0 121540 : cluster [INF] osd.3 marked down after no > beacon for 900.091770 seconds > 2018-01-22 18:56:31.728197 mon.cephcta-mon-658cb618c9 mon.0 > 137.138.62.69:6789/0 121541 : cluster [INF] osd.4 marked down after no > beacon for 900.091770 seconds > 2018-01-22 18:56:31.730108 mon.cephcta-mon-658cb618c9 mon.0 > 137.138.62.69:6789/0 121542 : cluster [WRN] Health check update: 4 > osds down (OSD_DOWN) > > > 900 is the default mon_osd_report_timeout -- why are these OSDs all > stuck not sending beacons? Why haven't they noticed that the osd.2 had > failed, then recover things on the remaining OSDs? > > The config [2] is pretty standard, save for one perhaps culprit: > > osd op thread suicide timeout = 1800 > > That's part of our standard config, mostly to prevent OSDs from > suiciding during FileStore splitting. (This particular cluster is 100% > bluestore, so admittedly we could revert that here). > > Any idea what went wrong here? > > I can create a tracker and post logs if this is interesting. > > Best Regards, > > Dan > > [1] The failure mode of this OSD appears like its block device just > froze. It runs inside a VM and the console showed several of the > typical 120s block dev timeouts. The machine remained pingable, but > wasn't doing any IO. > > [2] https://gist.github.com/dvanders/7eca771b6a8d1164bae8ea1fe45cf9f2 _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com