Manuel, if you check the ceph.log can you see ‘reported failed’ osd entries? For me a manual offline compact helped on that osds which normally went down on laggy slow ops. So now just the rgw goes down because laggy slow ops, but osds up (based on 4 days result) Istvan Szabo Senior Infrastructure Engineer --------------------------------------------------- Agoda Services Co., Ltd. e: istvan.szabo@xxxxxxxxx<mailto:istvan.szabo@xxxxxxxxx> --------------------------------------------------- On 2021. Nov 4., at 10:15, Manuel Lausch <manuel.lausch@xxxxxxxx> wrote: Email received from the internet. If in doubt, don't click any link nor open any attachment ! ________________________________ On Tue, 2 Nov 2021 09:02:31 -0500 Sage Weil <sage@xxxxxxxxxxxx> wrote: Just to be clear, you should try osd_fast_shutdown = true osd_fast_shutdown_notify_mon = false I added some logs to the tracker ticket with this options set. You write if the osd rejects messenger connections, because it is stopped, the peering process will skip the read_lease timeout. If the OSD annouces its shutdown, can we not skip this read_lease timeout as well? If memory serves, yes, but the notify_mon process can take more time than a peer OSD getting ECONNREFUSED. The combination above is the recommended combation (and the default). On my tests yesterday I saw again, that it took about 2 seconds between stopping a OSD and the first blame in the ceph.log With the notification enabled, I got immediately the down message. _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx