Re: OSD spend too much time on "waiting for readable" -> slow ops -> laggy pg -> rgw stop -> worst case osd restart

"huxiaoyu@xxxxxxxxxxxx" <huxiaoyu@xxxxxxxxxxxx> · Thu, 18 Nov 2021 22:54:54 +0100

May i ask, which versions are affected by this bug? and which versions are going to receive backports?

best regards,

samuel

huxiaoyu@xxxxxxxxxxxx

From: Sage Weil
Date: 2021-11-18 22:02
To: Manuel Lausch; ceph-users
Subject:  Re: OSD spend too much time on "waiting for readable" -> slow ops -> laggy pg -> rgw stop -> worst case osd restart
Okay, good news: on the osd start side, I identified the bug (and easily
reproduced locally).  The tracker and fix are:

https://tracker.ceph.com/issues/53326
https://github.com/ceph/ceph/pull/44015

These will take a while to work through QA and get backported.

Also, to reiterate what I said on the call earlier today about the osd
stopping issues:
- A key piece of the original problem you were seeing was because
require_osd_release wasn't up to date, which meant that the the dead_epoch
metadata wasn't encoded in the OSDMap and we would basically *always* go
into the read lease wait when an OSD stopped.
- Now that that is fixed, it appears as though setting both
osd_fast_shutdown *and* osd_fast_shutdown_notify_mon is the winning
combination.

I would be curious to hear if adjusting the icmp throttle kernel setting
makes things behave better when osd_fast_shutdown_notify_mon=false (the
default), but this is more out of curiosity--I think we've concluded that
we should set this option to true by default.

If I'm missing anything, please let me know!

Thanks for your patience in tracking this down.  It's always a bit tricky
when there are multiple contributing factors (in this case, at least 3).

sage

On Tue, Nov 16, 2021 at 9:42 AM Sage Weil <sage@xxxxxxxxxxxx> wrote:

> On Tue, Nov 16, 2021 at 8:30 AM Manuel Lausch <manuel.lausch@xxxxxxxx>
> wrote:
>
>> Hi Sage,
>>
>> its still the same cluster we talked about. I only upgraded it from
>> 16.2.5 to 16.2.6.
>>
>> I enabled fast shutdown again and did some tests with debug
>> logging enabled.
>> osd_fast_shutdown            true
>> osd_fast_shutdown_notify_mon false
>>
>> The logs are here:
>> ceph-post-file: 59325568-719c-4ec9-b7ab-945244fcf8ae
>>
>>
>> I took 3 tests.
>>
>> First I stopped OSD 122 again at 14:22:40 and started it again at
>> 14:23:40.
>> stopping worked now without issue. But on starting I got 3 Slow
>> ops.
>>
>> Then at 14:25:00 I stopped all osds (systemctl stop ceph-osd.target) on
>> the host "csdeveubs-u02c01b01". Surprisingly there were no slow op as
>> well. But still on startup at 14:26:00
>>
>> On 14:28:00 I stopped again all OSDs on host csdeveubs-u02c01b05. This
>> time I got some slow ops while stopping too.
>>
>>
>> So far as I understand, ceph skips the read lease time if a OSD is
>> "dead" but not if it is only down. This is because we do not know for
>> sure if a down OSD is realy gone and cannot answer reads anymore. right?
>>
>
> Exactly.
>
>
>> If a OSD annouces its shutdown to the mon the cluster marks it as
>> down. Can we not assume the deadness in this case as well?
>> Maybe this would help me in the stopping casse.
>>
>
> It could, but that's not how the shutdown process currently works. It
> requests that the mon mark it down, but continues servicing IO until it is
> actually marked down.
>
>
>> The starting case will still be an issue.
>
>
> Yes.  I suspect the root cause(s) there are a bit more complicated--I'll
> take a look at the logs today.
>
> Thanks!
> sage
>
>
>
>>
>>
>>
>> Thanks a lot
>> Manuel
>>
>>
>>
>> On Mon, 15 Nov 2021 17:32:24 -0600
>> Sage Weil <sage@xxxxxxxxxxxx> wrote:
>>
>> > Okay, I traced one slow op through the logs, and the problem was that
>> > the PG was laggy.  That happened because of the osd.122 that you
>> > stopped, which was marked down in the OSDMap but *not* dead.  It
>> > looks like that happened because the OSD took the 'clean shutdown'
>> > path instead of the fast stop.
>> >
>> > Have you tried enabling osd_fast_shutdown = true *after* you fixed the
>> > require_osd_release to octopus?   It would have led to slow requests
>> > when you tested before because the new dead_epochfied in the OSDMap
>> > that the read leases rely on was not being encoded, making peering
>> > wait for the read lease to time out even though the stopped osd
>> > really died.
>> >
>> > I'm not entirely sure if this is the same cluster as the earlier
>> > one.. but given the logs you sent, my suggestion is to enable
>> > osd_fast_shutdown and try again.  If you still get slow requests, can
>> > you capture the logs again?
>> >
>> > Thanks!
>> > sage
>> >
>>
>>
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx