OSDs flapping since upgrade to 14.2.10

Ingo Reimann <ireimann@xxxxxxxxx> · Fri, 7 Aug 2020 12:07:45 +0200 (CEST)

Hi list,

since our upgrade 14.2.9 -> 14.2.10 we observe flapping OSDs:
* The mons claim every few minutes:
2020-08-07 09:49:09.783648 osd.243 (osd.243) 246 : cluster [WRN] Monitor daemon marked osd.243 down, but it is still running
2020-08-07 10:04:40.753704 osd.243 (osd.243) 248 : cluster [WRN] Monitor daemon marked osd.243 down, but it is still running
2020-08-07 10:07:21.187945 osd.253 (osd.253) 469 : cluster [WRN] Monitor daemon marked osd.253 down, but it is still running

2020-08-07 10:04:35.440547 mon.cephmon01 (mon.0) 390132 : cluster [DBG] osd.243 reported failed by osd.33
2020-08-07 10:04:35.508412 mon.cephmon01 (mon.0) 390133 : cluster [DBG] osd.243 reported failed by osd.187
2020-08-07 10:04:35.508529 mon.cephmon01 (mon.0) 390134 : cluster [INF] osd.243 failed (root=default,datacenter=of,row=row-of-02,host=cephosd16) (2 reporters from different host after 44.000150 >= grace 25.935545)
2020-08-07 10:04:35.695171 mon.cephmon01 (mon.0) 390135 : cluster [DBG] osd.243 reported failed by osd.203
2020-08-07 10:04:35.771704 mon.cephmon01 (mon.0) 390136 : cluster [DBG] osd.243 reported failed by osd.163
2020-08-07 10:04:41.588530 mon.cephmon01 (mon.0) 390148 : cluster [INF] osd.243 [v2:10.198.10.16:6882/6611,v1:10.198.10.16:6885/6611] boot
2020-08-07 10:04:40.753704 osd.243 (osd.243) 248 : cluster [WRN] Monitor daemon marked osd.243 down, but it is still running
2020-08-07 10:04:40.753712 osd.243 (osd.243) 249 : cluster [DBG] map e2683535 wrongly marked me down at e2683534

osd.33 says:
2020-08-07 10:04:35.437 7fcaaa4f3700 -1 osd.33 2683533 heartbeat_check: no reply from 10.198.10.16:6802 osd.243 since back 2020-08-07 10:03:51.223911 front 2020-08-07 10:03:51.224322 (oldest deadline 2020-08-07 10:04:35.322704)

osd.243 says:
2020-08-07 10:03:55.065 7f0d33911700  1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f0d13acb700' had timed out after 15
2020-08-07 10:03:55.065 7f0d34112700  1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f0d13acb700' had timed out after 15
[.. ~3000(!) Lines ..]
2020-08-07 10:04:33.644 7f0d33110700  1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f0d13acb700' had timed out after 15
2020-08-07 10:04:33.688 7f0d13acb700  0 bluestore(/var/lib/ceph/osd/ceph-243) log_latency_fn slow operation observed for upper_bound, latency = 20.9013s, after =  omap_iterator(cid = 19.58a_head, oid = #19:51a21a27:::
.dir.default.223091333.1.3:head#)
2020-08-07 10:04:33.688 7f0d13acb700  1 heartbeat_map reset_timeout 'OSD::osd_op_tp thread 0x7f0d13acb700' had timed out after 15
2020-08-07 10:04:40.748 7f0d2279b700  0 log_channel(cluster) log [WRN] : Monitor daemon marked osd.243 down, but it is still running
2020-08-07 10:04:40.748 7f0d2279b700  0 log_channel(cluster) log [DBG] : map e2683535 wrongly marked me down at e2683534

* as a consequence, old deep-scrubs did not finish, because they would be interrupted -> ' pgs not deep-scrubbed in time'

for the latter, I increased the op-thread-timeout back to the pre 12(!).2.11 value of 30

i`m am not sure, if we really have a problem, but it does not look healthy.

Any ideas, thoughts?

regards,
Ingo

-- 
Ingo Reimann 
Teamleiter Technik
[ https://www.dunkel.de/ ] 
Dunkel GmbH 
Philipp-Reis-Straße 2 
65795 Hattersheim 
Fon: +49 6190 889-100 
Fax: +49 6190 889-399 
eMail: support@xxxxxxxxx 
https://www.Dunkel.de/ 	Amtsgericht Frankfurt/Main 
HRB: 37971 
Geschäftsführer: Axel Dunkel 
Ust-ID: DE 811622001
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx