Re: OSD spend too much time on "waiting for readable" -> slow ops -> laggy pg -> rgw stop -> worst case osd restart

Peter Lieven <pl@xxxxxxx> · Tue, 2 Nov 2021 15:11:45 +0100

Am 02.11.21 um 15:02 schrieb Sage Weil:
On Tue, Nov 2, 2021 at 8:29 AM Manuel Lausch <manuel.lausch@xxxxxxxx> wrote:

Hi Sage,

The "osd_fast_shutdown" is set to "false"
As we upgraded to luminous I also had blocked IO issuses with this
enabled.

Some weeks ago I tried out the options "osd_fast_shutdown" and
"osd_fast_shutdown_notify_mon" and also got slow ops while
stopping/starting OSDs. But I didn't ceck if this triggert the
problem with the read_leases or if this triggert my old issue
with the fast shutodnw.
Just to be clear, you should try
   osd_fast_shutdown = true
   osd_fast_shutdown_notify_mon = false

You write if the osd rejects messenger connections, because it is
stopped, the peering process will skip the read_lease timeout. If the
OSD annouces its shutdown, can we not skip this read_lease timeout as
well?

If memory serves, yes, but the notify_mon process can take more time than a
peer OSD getting ECONNREFUSED.  The combination above is the recommended
combation (and the default).

When we fast this issue we had a fresh Octopus install with default values...

If necessary I can upgrade our development cluster to Octopus again and also

run some tests.

Best,

Peter

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx