OSD out/down detection

Adrian Saul <Adrian.Saul@xxxxxxxxxxxxxxxxx> · Mon, 20 Jun 2016 02:12:45 +0000

Hi All,
 We have a Jewel (10.2.1) cluster on Centos 7 - I am using an  elrepo 4.4.1 kernel on all machines and we have an issue where some of the machines hang - not sure if its hardware or OS but essentially the host including the console is unresponsive and can only be recovered with a hardware reset.  Unfortunately nothing useful is logged so I am still trying to figure out what is going on to cause this.   But the result for ceph is that if an OSD host goes down like this we have run into an issue where only some of its OSDs are marked down.    In the instance on the weekend, the host had 8 OSDs and only 5 got marked as down - this lead to the kRBD devices jamming up trying to send IO to non-responsive OSDs that stayed marked up.

The machine went into a slow death - lots of reports of slow or blocked requests:

2016-06-19 09:37:49.070810 osd.36 10.145.2.15:6802/31359 65 : cluster [WRN] 2 slow requests, 2 included below; oldest blocked for > 30.297258 secs
2016-06-19 09:37:54.071542 osd.36 10.145.2.15:6802/31359 82 : cluster [WRN] 112 slow requests, 5 included below; oldest blocked for > 35.297988 secs
2016-06-19 09:37:54.071737 osd.6 10.145.2.15:6801/21836 221 : cluster [WRN] 253 slow requests, 5 included below; oldest blocked for > 35.325155 secs
2016-06-19 09:37:59.072570 osd.6 10.145.2.15:6801/21836 251 : cluster [WRN] 262 slow requests, 5 included below; oldest blocked for > 40.325986 secs

And then when the monitors did report them down the OSDs disputed that:

2016-06-19 09:38:35.821716 mon.0 10.145.2.13:6789/0 244970 : cluster [INF] osd.6 10.145.2.15:6801/21836 failed (2 reporters from different host after 20.000365 >= grace 20.000000)
2016-06-19 09:38:36.950556 mon.0 10.145.2.13:6789/0 244978 : cluster [INF] osd.22 10.145.2.15:6806/21826 failed (2 reporters from different host after 21.613336 >= grace 20.000000)
2016-06-19 09:38:36.951133 mon.0 10.145.2.13:6789/0 244980 : cluster [INF] osd.31 10.145.2.15:6812/21838 failed (2 reporters from different host after 21.613781 >= grace 20.836511)
2016-06-19 09:38:36.951636 mon.0 10.145.2.13:6789/0 244982 : cluster [INF] osd.36 10.145.2.15:6802/31359 failed (2 reporters from different host after 21.614259 >= grace 20.000000)

2016-06-19 09:38:37.156088 osd.36 10.145.2.15:6802/31359 346 : cluster [WRN] map e28730 wrongly marked me down
2016-06-19 09:38:36.002076 osd.6 10.145.2.15:6801/21836 473 : cluster [WRN] map e28729 wrongly marked me down
2016-06-19 09:38:37.046885 osd.22 10.145.2.15:6806/21826 374 : cluster [WRN] map e28730 wrongly marked me down
2016-06-19 09:38:37.050635 osd.31 10.145.2.15:6812/21838 351 : cluster [WRN] map e28730 wrongly marked me down

But shortly after

2016-06-19 09:43:39.940985 mon.0 10.145.2.13:6789/0 245305 : cluster [INF] osd.6 out (down for 303.951251)
2016-06-19 09:43:39.941061 mon.0 10.145.2.13:6789/0 245306 : cluster [INF] osd.22 out (down for 302.908528)
2016-06-19 09:43:39.941099 mon.0 10.145.2.13:6789/0 245307 : cluster [INF] osd.31 out (down for 302.908527)
2016-06-19 09:43:39.941152 mon.0 10.145.2.13:6789/0 245308 : cluster [INF] osd.36 out (down for 302.908527)

2016-06-19 10:09:10.648924 mon.0 10.145.2.13:6789/0 247076 : cluster [INF] osd.23 10.145.2.15:6814/21852 failed (2 reporters from different host after 20.000378 >= grace 20.000000)
2016-06-19 10:09:10.887220 osd.23 10.145.2.15:6814/21852 176 : cluster [WRN] map e28848 wrongly marked me down
2016-06-19 10:14:15.160513 mon.0 10.145.2.13:6789/0 247422 : cluster [INF] osd.23 out (down for 304.288018)

By the time the issue was eventually escalated and I was able to do something about it I manual marked the remaining host OSDs down (which seemed to unclog RBD):

2016-06-19 15:25:06.171395 mon.0 10.145.2.13:6789/0 267212 : cluster [INF] osd.7 10.145.2.15:6808/21837 failed (2 reporters from different host after 22.000367 >= grace 20.000000)
2016-06-19 15:25:06.171905 mon.0 10.145.2.13:6789/0 267214 : cluster [INF] osd.24 10.145.2.15:6800/21813 failed (2 reporters from different host after 22.000748 >= grace 20.710981)
2016-06-19 15:25:06.172426 mon.0 10.145.2.13:6789/0 267216 : cluster [INF] osd.37 10.145.2.15:6810/31936 failed (2 reporters from different host after 22.001167 >= grace 20.000000)

The question I have is why might the these 3 OSDs, despite not being responsive for over 5 hours, stayed in the cluster?  The CRUSH map for all pools is to have the hosts as fault boundaries, so I would have expected other host OSDs to be reporting these as unresponsive and reporting them.  On the OSD logs nothing was logged in the hour prior to the failure, and on the other OSDs it seems like they noticed all the other OSDs timing out but the 3 that stayed up it seemed to be actively attempting backfills.

Any ideas on how I can improve detection of this condition?

Cheers,
 Adrian

Confidentiality: This email and any attachments are confidential and may be subject to copyright, legal or some other professional privilege. They are intended solely for the attention and use of the named addressee(s). They may only be copied, distributed or disclosed with the consent of the copyright owner. If you have received this email by mistake or by breach of the confidentiality clause, please notify the sender immediately by return email and delete or destroy all copies of the email. Any confidentiality, privilege or copyright is not waived or lost because this email has been sent to you by mistake.
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com