Re: OSD spend too much time on "waiting for readable" -> slow ops -> laggy pg -> rgw stop -> worst case osd restart

"Szabo, Istvan (Agoda)" <Istvan.Szabo@xxxxxxxxx> · Fri, 19 Nov 2021 06:27:35 +0000

Will it be available in 15.2.16?

Istvan Szabo
Senior Infrastructure Engineer
---------------------------------------------------
Agoda Services Co., Ltd.
e: istvan.szabo@xxxxxxxxx<mailto:istvan.szabo@xxxxxxxxx>
---------------------------------------------------

On 2021. Nov 18., at 23:12, Sage Weil <sage@xxxxxxxxxxxx> wrote:

Email received from the internet. If in doubt, don't click any link nor open any attachment !
________________________________

It looks like the bug has been there since the read leases were introduced,
which I believe was octopus (15.2.z)

s

On Thu, Nov 18, 2021 at 3:55 PM huxiaoyu@xxxxxxxxxxxx <huxiaoyu@xxxxxxxxxxxx>
wrote:

May i ask, which versions are affected by this bug? and which versions are
going to receive backports?

best regards,

samuel

------------------------------
huxiaoyu@xxxxxxxxxxxx

*From:* Sage Weil <sage@xxxxxxxxxxxx>
*Date:* 2021-11-18 22:02
*To:* Manuel Lausch <manuel.lausch@xxxxxxxx>; ceph-users
<ceph-users@xxxxxxx>
*Subject:*  Re: OSD spend too much time on "waiting for
readable" -> slow ops -> laggy pg -> rgw stop -> worst case osd restart
Okay, good news: on the osd start side, I identified the bug (and easily
reproduced locally).  The tracker and fix are:

https://tracker.ceph.com/issues/53326
https://github.com/ceph/ceph/pull/44015

These will take a while to work through QA and get backported.

Also, to reiterate what I said on the call earlier today about the osd
stopping issues:
- A key piece of the original problem you were seeing was because
require_osd_release wasn't up to date, which meant that the the dead_epoch
metadata wasn't encoded in the OSDMap and we would basically *always* go
into the read lease wait when an OSD stopped.
- Now that that is fixed, it appears as though setting both
osd_fast_shutdown *and* osd_fast_shutdown_notify_mon is the winning
combination.

I would be curious to hear if adjusting the icmp throttle kernel setting
makes things behave better when osd_fast_shutdown_notify_mon=false (the
default), but this is more out of curiosity--I think we've concluded that
we should set this option to true by default.

If I'm missing anything, please let me know!

Thanks for your patience in tracking this down.  It's always a bit tricky
when there are multiple contributing factors (in this case, at least 3).

sage

On Tue, Nov 16, 2021 at 9:42 AM Sage Weil <sage@xxxxxxxxxxxx> wrote:

On Tue, Nov 16, 2021 at 8:30 AM Manuel Lausch <manuel.lausch@xxxxxxxx>
wrote:

Hi Sage,

its still the same cluster we talked about. I only upgraded it from
16.2.5 to 16.2.6.

I enabled fast shutdown again and did some tests with debug
logging enabled.
osd_fast_shutdown            true
osd_fast_shutdown_notify_mon false

The logs are here:
ceph-post-file: 59325568-719c-4ec9-b7ab-945244fcf8ae

I took 3 tests.

First I stopped OSD 122 again at 14:22:40 and started it again at
14:23:40.
stopping worked now without issue. But on starting I got 3 Slow
ops.

Then at 14:25:00 I stopped all osds (systemctl stop ceph-osd.target) on
the host "csdeveubs-u02c01b01". Surprisingly there were no slow op as
well. But still on startup at 14:26:00

On 14:28:00 I stopped again all OSDs on host csdeveubs-u02c01b05. This
time I got some slow ops while stopping too.

So far as I understand, ceph skips the read lease time if a OSD is
"dead" but not if it is only down. This is because we do not know for
sure if a down OSD is realy gone and cannot answer reads anymore. right?

Exactly.

If a OSD annouces its shutdown to the mon the cluster marks it as
down. Can we not assume the deadness in this case as well?
Maybe this would help me in the stopping casse.

It could, but that's not how the shutdown process currently works. It
requests that the mon mark it down, but continues servicing IO until it
is
actually marked down.

The starting case will still be an issue.

Yes.  I suspect the root cause(s) there are a bit more complicated--I'll
take a look at the logs today.

Thanks!
sage

Thanks a lot
Manuel

On Mon, 15 Nov 2021 17:32:24 -0600
Sage Weil <sage@xxxxxxxxxxxx> wrote:

Okay, I traced one slow op through the logs, and the problem was that
the PG was laggy.  That happened because of the osd.122 that you
stopped, which was marked down in the OSDMap but *not* dead.  It
looks like that happened because the OSD took the 'clean shutdown'
path instead of the fast stop.

Have you tried enabling osd_fast_shutdown = true *after* you fixed the
require_osd_release to octopus?   It would have led to slow requests
when you tested before because the new dead_epochfied in the OSDMap
that the read leases rely on was not being encoded, making peering
wait for the read lease to time out even though the stopped osd
really died.

I'm not entirely sure if this is the same cluster as the earlier
one.. but given the logs you sent, my suggestion is to enable
osd_fast_shutdown and try again.  If you still get slow requests, can
you capture the logs again?

Thanks!
sage

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx