Hi, Maybe this help, You can increase the osd_op_tp thread in ceph conf to something similar to: [osd] osd_op_thread_suicide_timeout = 900 osd_op_thread_timeout = 300 osd_recovery_thread_timeout = 300 Regards -----Mensaje original----- De: Ingo Reimann <ireimann@xxxxxxxxx> Enviado el: viernes, 7 de agosto de 2020 12:08 Para: ceph-users <ceph-users@xxxxxxx> Asunto: OSDs flapping since upgrade to 14.2.10 Hi list, since our upgrade 14.2.9 -> 14.2.10 we observe flapping OSDs: * The mons claim every few minutes: 2020-08-07 09:49:09.783648 osd.243 (osd.243) 246 : cluster [WRN] Monitor daemon marked osd.243 down, but it is still running 2020-08-07 10:04:40.753704 osd.243 (osd.243) 248 : cluster [WRN] Monitor daemon marked osd.243 down, but it is still running 2020-08-07 10:07:21.187945 osd.253 (osd.253) 469 : cluster [WRN] Monitor daemon marked osd.253 down, but it is still running 2020-08-07 10:04:35.440547 mon.cephmon01 (mon.0) 390132 : cluster [DBG] osd.243 reported failed by osd.33 2020-08-07 10:04:35.508412 mon.cephmon01 (mon.0) 390133 : cluster [DBG] osd.243 reported failed by osd.187 2020-08-07 10:04:35.508529 mon.cephmon01 (mon.0) 390134 : cluster [INF] osd.243 failed (root=default,datacenter=of,row=row-of-02,host=cephosd16) (2 reporters from different host after 44.000150 >= grace 25.935545) 2020-08-07 10:04:35.695171 mon.cephmon01 (mon.0) 390135 : cluster [DBG] osd.243 reported failed by osd.203 2020-08-07 10:04:35.771704 mon.cephmon01 (mon.0) 390136 : cluster [DBG] osd.243 reported failed by osd.163 2020-08-07 10:04:41.588530 mon.cephmon01 (mon.0) 390148 : cluster [INF] osd.243 [v2:10.198.10.16:6882/6611,v1:10.198.10.16:6885/6611] boot 2020-08-07 10:04:40.753704 osd.243 (osd.243) 248 : cluster [WRN] Monitor daemon marked osd.243 down, but it is still running 2020-08-07 10:04:40.753712 osd.243 (osd.243) 249 : cluster [DBG] map e2683535 wrongly marked me down at e2683534 osd.33 says: 2020-08-07 10:04:35.437 7fcaaa4f3700 -1 osd.33 2683533 heartbeat_check: no reply from 10.198.10.16:6802 osd.243 since back 2020-08-07 10:03:51.223911 front 2020-08-07 10:03:51.224322 (oldest deadline 2020-08-07 10:04:35.322704) osd.243 says: 2020-08-07 10:03:55.065 7f0d33911700 1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f0d13acb700' had timed out after 15 2020-08-07 10:03:55.065 7f0d34112700 1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f0d13acb700' had timed out after 15 [.. ~3000(!) Lines ..] 2020-08-07 10:04:33.644 7f0d33110700 1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f0d13acb700' had timed out after 15 2020-08-07 10:04:33.688 7f0d13acb700 0 bluestore(/var/lib/ceph/osd/ceph-243) log_latency_fn slow operation observed for upper_bound, latency = 20.9013s, after = omap_iterator(cid = 19.58a_head, oid = #19:51a21a27::: .dir.default.223091333.1.3:head#) 2020-08-07 10:04:33.688 7f0d13acb700 1 heartbeat_map reset_timeout 'OSD::osd_op_tp thread 0x7f0d13acb700' had timed out after 15 2020-08-07 10:04:40.748 7f0d2279b700 0 log_channel(cluster) log [WRN] : Monitor daemon marked osd.243 down, but it is still running 2020-08-07 10:04:40.748 7f0d2279b700 0 log_channel(cluster) log [DBG] : map e2683535 wrongly marked me down at e2683534 * as a consequence, old deep-scrubs did not finish, because they would be interrupted -> ' pgs not deep-scrubbed in time' for the latter, I increased the op-thread-timeout back to the pre 12(!).2.11 value of 30 i`m am not sure, if we really have a problem, but it does not look healthy. Any ideas, thoughts? regards, Ingo -- Ingo Reimann Teamleiter Technik [ https://www.dunkel.de/ ] Dunkel GmbH Philipp-Reis-Straße 2 65795 Hattersheim Fon: +49 6190 889-100 Fax: +49 6190 889-399 eMail: support@xxxxxxxxx https://www.Dunkel.de/ Amtsgericht Frankfurt/Main HRB: 37971 Geschäftsführer: Axel Dunkel Ust-ID: DE 811622001 _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx