Re: OSD spend too much time on "waiting for readable" -> slow ops -> laggy pg -> rgw stop -> worst case osd restart

Sage Weil <sage@xxxxxxxxxxxx> · Thu, 18 Nov 2021 16:11:20 -0600

It looks like the bug has been there since the read leases were introduced,
which I believe was octopus (15.2.z)

s

On Thu, Nov 18, 2021 at 3:55 PM huxiaoyu@xxxxxxxxxxxx <huxiaoyu@xxxxxxxxxxxx>
wrote:

> May i ask, which versions are affected by this bug? and which versions are
> going to receive backports?
>
> best regards,
>
> samuel
>
> ------------------------------
> huxiaoyu@xxxxxxxxxxxx
>
>
> *From:* Sage Weil <sage@xxxxxxxxxxxx>
> *Date:* 2021-11-18 22:02
> *To:* Manuel Lausch <manuel.lausch@xxxxxxxx>; ceph-users
> <ceph-users@xxxxxxx>
> *Subject:*  Re: OSD spend too much time on "waiting for
> readable" -> slow ops -> laggy pg -> rgw stop -> worst case osd restart
> Okay, good news: on the osd start side, I identified the bug (and easily
> reproduced locally).  The tracker and fix are:
>
> https://tracker.ceph.com/issues/53326
> https://github.com/ceph/ceph/pull/44015
>
> These will take a while to work through QA and get backported.
>
> Also, to reiterate what I said on the call earlier today about the osd
> stopping issues:
> - A key piece of the original problem you were seeing was because
> require_osd_release wasn't up to date, which meant that the the dead_epoch
> metadata wasn't encoded in the OSDMap and we would basically *always* go
> into the read lease wait when an OSD stopped.
> - Now that that is fixed, it appears as though setting both
> osd_fast_shutdown *and* osd_fast_shutdown_notify_mon is the winning
> combination.
>
> I would be curious to hear if adjusting the icmp throttle kernel setting
> makes things behave better when osd_fast_shutdown_notify_mon=false (the
> default), but this is more out of curiosity--I think we've concluded that
> we should set this option to true by default.
>
> If I'm missing anything, please let me know!
>
> Thanks for your patience in tracking this down.  It's always a bit tricky
> when there are multiple contributing factors (in this case, at least 3).
>
> sage
>
>
>
> On Tue, Nov 16, 2021 at 9:42 AM Sage Weil <sage@xxxxxxxxxxxx> wrote:
>
> > On Tue, Nov 16, 2021 at 8:30 AM Manuel Lausch <manuel.lausch@xxxxxxxx>
> > wrote:
> >
> >> Hi Sage,
> >>
> >> its still the same cluster we talked about. I only upgraded it from
> >> 16.2.5 to 16.2.6.
> >>
> >> I enabled fast shutdown again and did some tests with debug
> >> logging enabled.
> >> osd_fast_shutdown            true
> >> osd_fast_shutdown_notify_mon false
> >>
> >> The logs are here:
> >> ceph-post-file: 59325568-719c-4ec9-b7ab-945244fcf8ae
> >>
> >>
> >> I took 3 tests.
> >>
> >> First I stopped OSD 122 again at 14:22:40 and started it again at
> >> 14:23:40.
> >> stopping worked now without issue. But on starting I got 3 Slow
> >> ops.
> >>
> >> Then at 14:25:00 I stopped all osds (systemctl stop ceph-osd.target) on
> >> the host "csdeveubs-u02c01b01". Surprisingly there were no slow op as
> >> well. But still on startup at 14:26:00
> >>
> >> On 14:28:00 I stopped again all OSDs on host csdeveubs-u02c01b05. This
> >> time I got some slow ops while stopping too.
> >>
> >>
> >> So far as I understand, ceph skips the read lease time if a OSD is
> >> "dead" but not if it is only down. This is because we do not know for
> >> sure if a down OSD is realy gone and cannot answer reads anymore. right?
> >>
> >
> > Exactly.
> >
> >
> >> If a OSD annouces its shutdown to the mon the cluster marks it as
> >> down. Can we not assume the deadness in this case as well?
> >> Maybe this would help me in the stopping casse.
> >>
> >
> > It could, but that's not how the shutdown process currently works. It
> > requests that the mon mark it down, but continues servicing IO until it
> is
> > actually marked down.
> >
> >
> >> The starting case will still be an issue.
> >
> >
> > Yes.  I suspect the root cause(s) there are a bit more complicated--I'll
> > take a look at the logs today.
> >
> > Thanks!
> > sage
> >
> >
> >
> >>
> >>
> >>
> >> Thanks a lot
> >> Manuel
> >>
> >>
> >>
> >> On Mon, 15 Nov 2021 17:32:24 -0600
> >> Sage Weil <sage@xxxxxxxxxxxx> wrote:
> >>
> >> > Okay, I traced one slow op through the logs, and the problem was that
> >> > the PG was laggy.  That happened because of the osd.122 that you
> >> > stopped, which was marked down in the OSDMap but *not* dead.  It
> >> > looks like that happened because the OSD took the 'clean shutdown'
> >> > path instead of the fast stop.
> >> >
> >> > Have you tried enabling osd_fast_shutdown = true *after* you fixed the
> >> > require_osd_release to octopus?   It would have led to slow requests
> >> > when you tested before because the new dead_epochfied in the OSDMap
> >> > that the read leases rely on was not being encoded, making peering
> >> > wait for the read lease to time out even though the stopped osd
> >> > really died.
> >> >
> >> > I'm not entirely sure if this is the same cluster as the earlier
> >> > one.. but given the logs you sent, my suggestion is to enable
> >> > osd_fast_shutdown and try again.  If you still get slow requests, can
> >> > you capture the logs again?
> >> >
> >> > Thanks!
> >> > sage
> >> >
> >>
> >>
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>
>
>
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx