On Mon, Jan 22, 2018 at 8:46 PM, Dan van der Ster <dan@xxxxxxxxxxxxxx> wrote: > Here's a bit more info as I read the logs. Firstly, these are in fact > Filestore OSDs... I was confused, but I don't think it makes a big > difference. > > Next, all the other OSDs had indeed noticed that osd.2 had failed: > > 2018-01-22 18:37:20.456535 7f831728e700 -1 osd.0 598 heartbeat_check: > no reply from 137.138.121.224:6803 osd.2 since back 2018-01-22 > 18:36:59.514902 front 2018-01-22 18:36:59.514902 (cutoff 2018-01-22 > 18:37:00.456532) > > 2018-01-22 18:37:21.085178 7fc911169700 -1 osd.1 598 heartbeat_check: > no reply from 137.138.121.224:6803 osd.2 since back 2018-01-22 > 18:37:00.518067 front 2018-01-22 18:37:00.518067 (cutoff 2018-01-22 > 18:37:01.085175) > > 2018-01-22 18:37:21.408881 7f78b8ea4700 -1 osd.4 598 heartbeat_check: > no reply from 137.138.121.224:6803 osd.2 since back 2018-01-22 > 18:37:00.873298 front 2018-01-22 18:37:00.873298 (cutoff 2018-01-22 > 18:37:01.408880) > > 2018-01-22 18:37:21.117301 7f4ac8138700 -1 osd.3 598 heartbeat_check: > no reply from 137.138.121.224:6803 osd.2 since back 2018-01-22 > 18:37:01.092182 front 2018-01-22 18:37:01.092182 (cutoff 2018-01-22 > 18:37:01.117298) > > > > The only "reported failed" came from osd.0, who BTW was the only OSD > who hadn't been marked down for not sending beacons: And presumably osd.0 was the only one with a functioning connection the monitors. Why it was the only one, I'm not sure about. Things to consider: 1) hard killing an OSD and a monitor on the same host tends to cause trouble. Firstly because if the dead OSD was connected locally, you have to go through the OSD heartbeat mark down process, and that can be a bit delayed by other OSDs themselves having to timeout their monitor and reconnect. 2) Which monitor were the OSDs connected to, and how quickly did they notice if they were connected to the dead one? 3) How are the constraints on marking down daemons set up on this cluster? A single OSD per server, but with only 5 servers in a flat host-only crush map, is a bit outside normal testing and design patterns and may have tripped a bug. -Gre > > 2018-01-22 18:37:20.457400 7fc1b51ce700 1 > mon.cephcta-mon-658cb618c9@0(leader).osd e598 prepare_failure osd.2 > 137.138.121.224:6800/1377 from osd.0 137.138.156.51:6800/1286 is > reporting failure:1 > 2018-01-22 18:37:20.457457 7fc1b51ce700 0 log_channel(cluster) log > [DBG] : osd.2 137.138.121.224:6800/1377 reported failed by osd.0 > 137.138.156.51:6800/1286 > > > So presumably it's because only 1 reporter showed up that osd.2 was > never marked down. (1 being less than "mon_osd_min_down_reporters": > "2") > > > And BTW, I didn't mention before that the cluster came fully back to > HEALTH_OK after I hard rebooted the osd.2 machine -- the other OSDs > were unblocked and recovery healed everything: > > 2018-01-22 19:31:12.381762 7fc907956700 0 log_channel(cluster) log > [WRN] : Monitor daemon marked osd.1 down, but it is still running > 2018-01-22 19:31:12.381774 7fc907956700 0 log_channel(cluster) log > [DBG] : map e602 wrongly marked me down at e601 > > 2018-01-22 19:31:12.515178 7f78af691700 0 log_channel(cluster) log > [WRN] : Monitor daemon marked osd.4 down, but it is still running > 2018-01-22 19:31:12.515186 7f78af691700 0 log_channel(cluster) log > [DBG] : map e602 wrongly marked me down at e601 > > 2018-01-22 19:31:12.586532 7f4abe925700 0 log_channel(cluster) log > [WRN] : Monitor daemon marked osd.3 down, but it is still running > 2018-01-22 19:31:12.586544 7f4abe925700 0 log_channel(cluster) log > [DBG] : map e602 wrongly marked me down at e601 > > > Thanks for the help solving this puzzle, > > Dan > > > On Mon, Jan 22, 2018 at 8:07 PM, Dan van der Ster <dan@xxxxxxxxxxxxxx> wrote: >> Hi all, >> >> We just saw an example of one single down OSD taking down a whole >> (small) luminous 12.2.2 cluster. >> >> The cluster has only 5 OSDs, on 5 different servers. Three of those >> servers also run a mon/mgr combo. >> >> First, we had one server (mon+osd) go down legitimately [1] -- I can >> tell when it went down because the mon quorum broke: >> >> 2018-01-22 18:26:31.521695 mon.cephcta-mon-658cb618c9 mon.0 >> 137.138.62.69:6789/0 121277 : cluster [WRN] Health check failed: 1/3 >> mons down, quorum cephcta-mon-658cb618c9,cephcta-mon-3e0d524825 >> (MON_DOWN) >> >> Then there's a long pileup of slow requests until the OSD is finally >> marked down due to no beacon: >> >> 2018-01-22 18:47:31.549791 mon.cephcta-mon-658cb618c9 mon.0 >> 137.138.62.69:6789/0 121447 : cluster [WRN] Health check update: 372 >> slow requests are blocked > 32 sec (REQUEST_SLOW) >> 2018-01-22 18:47:56.671360 mon.cephcta-mon-658cb618c9 mon.0 >> 137.138.62.69:6789/0 121448 : cluster [INF] osd.2 marked down after no >> beacon for 903.538932 seconds >> 2018-01-22 18:47:56.672315 mon.cephcta-mon-658cb618c9 mon.0 >> 137.138.62.69:6789/0 121449 : cluster [WRN] Health check failed: 1 >> osds down (OSD_DOWN) >> >> >> So, first question is: why didn't that OSD get detected as failing much earlier? >> >> >> The slow requests continue until almost 10 minutes later ceph marks 3 >> of the other 4 OSDs down after seeing no beacons: >> >> 2018-01-22 18:56:31.727970 mon.cephcta-mon-658cb618c9 mon.0 >> 137.138.62.69:6789/0 121539 : cluster [INF] osd.1 marked down after no >> beacon for 900.091770 seconds >> 2018-01-22 18:56:31.728105 mon.cephcta-mon-658cb618c9 mon.0 >> 137.138.62.69:6789/0 121540 : cluster [INF] osd.3 marked down after no >> beacon for 900.091770 seconds >> 2018-01-22 18:56:31.728197 mon.cephcta-mon-658cb618c9 mon.0 >> 137.138.62.69:6789/0 121541 : cluster [INF] osd.4 marked down after no >> beacon for 900.091770 seconds >> 2018-01-22 18:56:31.730108 mon.cephcta-mon-658cb618c9 mon.0 >> 137.138.62.69:6789/0 121542 : cluster [WRN] Health check update: 4 >> osds down (OSD_DOWN) >> >> >> 900 is the default mon_osd_report_timeout -- why are these OSDs all >> stuck not sending beacons? Why haven't they noticed that the osd.2 had >> failed, then recover things on the remaining OSDs? >> >> The config [2] is pretty standard, save for one perhaps culprit: >> >> osd op thread suicide timeout = 1800 >> >> That's part of our standard config, mostly to prevent OSDs from >> suiciding during FileStore splitting. (This particular cluster is 100% >> bluestore, so admittedly we could revert that here). >> >> Any idea what went wrong here? >> >> I can create a tracker and post logs if this is interesting. >> >> Best Regards, >> >> Dan >> >> [1] The failure mode of this OSD appears like its block device just >> froze. It runs inside a VM and the console showed several of the >> typical 120s block dev timeouts. The machine remained pingable, but >> wasn't doing any IO. >> >> [2] https://gist.github.com/dvanders/7eca771b6a8d1164bae8ea1fe45cf9f2 > _______________________________________________ > ceph-users mailing list > ceph-users@xxxxxxxxxxxxxx > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com