Re: Luminous: example of a single down osd taking out a cluster

Dan van der Ster <dan@xxxxxxxxxxxxxx> · Mon, 22 Jan 2018 20:46:22 +0100

Here's a bit more info as I read the logs. Firstly, these are in fact
Filestore OSDs... I was confused, but I don't think it makes a big
difference.

Next, all the other OSDs had indeed noticed that osd.2 had failed:

2018-01-22 18:37:20.456535 7f831728e700 -1 osd.0 598 heartbeat_check:
no reply from 137.138.121.224:6803 osd.2 since back 2018-01-22
18:36:59.514902 front 2018-01-22 18:36:59.514902 (cutoff 2018-01-22
18:37:00.456532)

2018-01-22 18:37:21.085178 7fc911169700 -1 osd.1 598 heartbeat_check:
no reply from 137.138.121.224:6803 osd.2 since back 2018-01-22
18:37:00.518067 front 2018-01-22 18:37:00.518067 (cutoff 2018-01-22
18:37:01.085175)

2018-01-22 18:37:21.408881 7f78b8ea4700 -1 osd.4 598 heartbeat_check:
no reply from 137.138.121.224:6803 osd.2 since back 2018-01-22
18:37:00.873298 front 2018-01-22 18:37:00.873298 (cutoff 2018-01-22
18:37:01.408880)

2018-01-22 18:37:21.117301 7f4ac8138700 -1 osd.3 598 heartbeat_check:
no reply from 137.138.121.224:6803 osd.2 since back 2018-01-22
18:37:01.092182 front 2018-01-22 18:37:01.092182 (cutoff 2018-01-22
18:37:01.117298)

The only "reported failed" came from osd.0, who BTW was the only OSD
who hadn't been marked down for not sending beacons:

2018-01-22 18:37:20.457400 7fc1b51ce700  1
mon.cephcta-mon-658cb618c9@0(leader).osd e598 prepare_failure osd.2
137.138.121.224:6800/1377 from osd.0 137.138.156.51:6800/1286 is
reporting failure:1
2018-01-22 18:37:20.457457 7fc1b51ce700  0 log_channel(cluster) log
[DBG] : osd.2 137.138.121.224:6800/1377 reported failed by osd.0
137.138.156.51:6800/1286

So presumably it's because only 1 reporter showed up that osd.2 was
never marked down. (1 being less than "mon_osd_min_down_reporters":
"2")

And BTW, I didn't mention before that the cluster came fully back to
HEALTH_OK after I hard rebooted the osd.2 machine -- the other OSDs
were unblocked and recovery healed everything:

2018-01-22 19:31:12.381762 7fc907956700  0 log_channel(cluster) log
[WRN] : Monitor daemon marked osd.1 down, but it is still running
2018-01-22 19:31:12.381774 7fc907956700  0 log_channel(cluster) log
[DBG] : map e602 wrongly marked me down at e601

2018-01-22 19:31:12.515178 7f78af691700  0 log_channel(cluster) log
[WRN] : Monitor daemon marked osd.4 down, but it is still running
2018-01-22 19:31:12.515186 7f78af691700  0 log_channel(cluster) log
[DBG] : map e602 wrongly marked me down at e601

2018-01-22 19:31:12.586532 7f4abe925700  0 log_channel(cluster) log
[WRN] : Monitor daemon marked osd.3 down, but it is still running
2018-01-22 19:31:12.586544 7f4abe925700  0 log_channel(cluster) log
[DBG] : map e602 wrongly marked me down at e601

Thanks for the help solving this puzzle,

Dan

On Mon, Jan 22, 2018 at 8:07 PM, Dan van der Ster <dan@xxxxxxxxxxxxxx> wrote:
> Hi all,
>
> We just saw an example of one single down OSD taking down a whole
> (small) luminous 12.2.2 cluster.
>
> The cluster has only 5 OSDs, on 5 different servers. Three of those
> servers also run a mon/mgr combo.
>
> First, we had one server (mon+osd) go down legitimately [1] -- I can
> tell when it went down because the mon quorum broke:
>
> 2018-01-22 18:26:31.521695 mon.cephcta-mon-658cb618c9 mon.0
> 137.138.62.69:6789/0 121277 : cluster [WRN] Health check failed: 1/3
> mons down, quorum cephcta-mon-658cb618c9,cephcta-mon-3e0d524825
> (MON_DOWN)
>
> Then there's a long pileup of slow requests until the OSD is finally
> marked down due to no beacon:
>
> 2018-01-22 18:47:31.549791 mon.cephcta-mon-658cb618c9 mon.0
> 137.138.62.69:6789/0 121447 : cluster [WRN] Health check update: 372
> slow requests are blocked > 32 sec (REQUEST_SLOW)
> 2018-01-22 18:47:56.671360 mon.cephcta-mon-658cb618c9 mon.0
> 137.138.62.69:6789/0 121448 : cluster [INF] osd.2 marked down after no
> beacon for 903.538932 seconds
> 2018-01-22 18:47:56.672315 mon.cephcta-mon-658cb618c9 mon.0
> 137.138.62.69:6789/0 121449 : cluster [WRN] Health check failed: 1
> osds down (OSD_DOWN)
>
>
> So, first question is: why didn't that OSD get detected as failing much earlier?
>
>
> The slow requests continue until almost 10 minutes later ceph marks 3
> of the other 4 OSDs down after seeing no beacons:
>
> 2018-01-22 18:56:31.727970 mon.cephcta-mon-658cb618c9 mon.0
> 137.138.62.69:6789/0 121539 : cluster [INF] osd.1 marked down after no
> beacon for 900.091770 seconds
> 2018-01-22 18:56:31.728105 mon.cephcta-mon-658cb618c9 mon.0
> 137.138.62.69:6789/0 121540 : cluster [INF] osd.3 marked down after no
> beacon for 900.091770 seconds
> 2018-01-22 18:56:31.728197 mon.cephcta-mon-658cb618c9 mon.0
> 137.138.62.69:6789/0 121541 : cluster [INF] osd.4 marked down after no
> beacon for 900.091770 seconds
> 2018-01-22 18:56:31.730108 mon.cephcta-mon-658cb618c9 mon.0
> 137.138.62.69:6789/0 121542 : cluster [WRN] Health check update: 4
> osds down (OSD_DOWN)
>
>
> 900 is the default mon_osd_report_timeout -- why are these OSDs all
> stuck not sending beacons? Why haven't they noticed that the osd.2 had
> failed, then recover things on the remaining OSDs?
>
> The config [2] is pretty standard, save for one perhaps culprit:
>
>    osd op thread suicide timeout = 1800
>
> That's part of our standard config, mostly to prevent OSDs from
> suiciding during FileStore splitting. (This particular cluster is 100%
> bluestore, so admittedly we could revert that here).
>
> Any idea what went wrong here?
>
> I can create a tracker and post logs if this is interesting.
>
> Best Regards,
>
> Dan
>
> [1] The failure mode of this OSD appears like its block device just
> froze. It runs inside a VM and the console showed several of the
> typical 120s block dev timeouts. The machine remained pingable, but
> wasn't doing any IO.
>
> [2] https://gist.github.com/dvanders/7eca771b6a8d1164bae8ea1fe45cf9f2
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com