On Tue, Nov 2, 2021 at 7:03 AM Sage Weil <sage@xxxxxxxxxxxx> wrote: > On Tue, Nov 2, 2021 at 8:29 AM Manuel Lausch <manuel.lausch@xxxxxxxx> > wrote: > > > Hi Sage, > > > > The "osd_fast_shutdown" is set to "false" > > As we upgraded to luminous I also had blocked IO issuses with this > > enabled. > > > > Some weeks ago I tried out the options "osd_fast_shutdown" and > > "osd_fast_shutdown_notify_mon" and also got slow ops while > > stopping/starting OSDs. But I didn't ceck if this triggert the > > problem with the read_leases or if this triggert my old issue > > with the fast shutodnw. > > > > Just to be clear, you should try > osd_fast_shutdown = true > osd_fast_shutdown_notify_mon = false > > You write if the osd rejects messenger connections, because it is > > stopped, the peering process will skip the read_lease timeout. If the > > OSD annouces its shutdown, can we not skip this read_lease timeout as > > well? > > > > If memory serves, yes, but the notify_mon process can take more time than a > peer OSD getting ECONNREFUSED. The combination above is the recommended > combation (and the default). Hmmm, if the OSDs are detecting shutdown based on networking error codes, could a networking configuration or security switch prevent them from seeing the “correct” failure result? -Greg > > > > These days I will test the fast_shutdown switch again and will share the > > corresponding logs with you. > > > > Thanks! > sage > > > > > > > > > Viele Grüße aus Karlsruhe > > Manuel > > > > > > On Mon, 1 Nov 2021 15:55:35 -0500 > > Sage Weil <sage@xxxxxxxxxxxx> wrote: > > > > > Hi Manuel, > > > > > > I'm looking at the ticket for this issue ( > > > https://tracker.ceph.com/issues/51463) and tried to reproduce. This > > > was initially trivial to do with vstart (rados bench paused for many > > > seconds afters stopping an osd) but it turns out that was because the > > > vstart ceph.conf includes `osd_fast_shutdown = false`. Once I > > > enabled that again (as it is by default on a normal cluster) I did > > > not see any noticeable interruption when an OSD was stopped. > > > > > > Can you confirm what osd_fast_shutdown and > > > osd_fast_shutdown_notify_mon are set to on your cluster? > > > > > > The intent is that when an OSD goes down, it will no longer accept > > > messenger connection attempts, and peer OSDs will inform the monitor > > > with a flag indicating the OSD is definitely dead (vs slow or > > > unresponsive). This will allow the peering process to skip waiting > > > for the read lease to time out. If you're seeing the laggy or > > > 'waiting for readable' state, then that isn't happening.. probably > > > because the OSD shutdown isn't working as originally intended. > > > > > > If it's not one of those two options, make you can include a 'ceph > > > config dump' (or jsut a list of the changed values at least) so we > > > can see what else might be affecting OSD shutdown... > > > > > > Thanks! > > > sage > > > > > > > > > > > > _______________________________________________ > ceph-users mailing list -- ceph-users@xxxxxxx > To unsubscribe send an email to ceph-users-leave@xxxxxxx > _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx