Re: kernel client osdc ops stuck and mds slow reqs

Dan van der Ster <dan@xxxxxxxxxxxxxx> · Mon, 3 Feb 2020 10:37:49 +0100

On Fri, Jan 31, 2020 at 6:32 PM Ilya Dryomov <idryomov@xxxxxxxxx> wrote:
>
> On Fri, Jan 31, 2020 at 4:57 PM Dan van der Ster <dan@xxxxxxxxxxxxxx> wrote:
> >
> > Hi Ilya,
> >
> > On Fri, Jan 31, 2020 at 11:33 AM Ilya Dryomov <idryomov@xxxxxxxxx> wrote:
> > >
> > > On Fri, Jan 31, 2020 at 11:06 AM Dan van der Ster <dan@xxxxxxxxxxxxxx> wrote:
> > > >
> > > > Hi all,
> > > >
> > > > We are quite regularly (a couple times per week) seeing:
> > > >
> > > > HEALTH_WARN 1 clients failing to respond to capability release; 1 MDSs
> > > > report slow requests
> > > > MDS_CLIENT_LATE_RELEASE 1 clients failing to respond to capability release
> > > >     mdshpc-be143(mds.0): Client hpc-be028.cern.ch: failing to respond
> > > > to capability release client_id: 52919162
> > > > MDS_SLOW_REQUEST 1 MDSs report slow requests
> > > >     mdshpc-be143(mds.0): 1 slow requests are blocked > 30 secs
> > > >
> > > > Which is being caused by osdc ops stuck in a kernel client, e.g.:
> > > >
> > > > 10:57:18 root hpc-be028 /root
> > > > → cat /sys/kernel/debug/ceph/4da6fd06-b069-49af-901f-c9513baabdbd.client52919162/osdc
> > > > REQUESTS 9 homeless 0
> > > > 46559317    osd243    3.ee6ffcdb    3.cdb    [243,501,92]/243
> > > > [243,501,92]/243    e678697
> > > > fsvolumens_355f485c-6319-4ffe-acd6-94a07f2a14b4/10003f09a01.00000057
> > > >  0x400014    1    read
> > > > 46559322    osd243    3.ee6ffcdb    3.cdb    [243,501,92]/243
> > > > [243,501,92]/243    e678697
> > > > fsvolumens_355f485c-6319-4ffe-acd6-94a07f2a14b4/10003f09a01.00000057
> > > >  0x400014    1    read
> > > > 46559323    osd243    3.969cc573    3.573    [243,330,226]/243
> > > > [243,330,226]/243    e678697
> > > > fsvolumens_355f485c-6319-4ffe-acd6-94a07f2a14b4/10003f09a56.00000056
> > > >  0x400014    1    read
> > > > 46559341    osd243    3.969cc573    3.573    [243,330,226]/243
> > > > [243,330,226]/243    e678697
> > > > fsvolumens_355f485c-6319-4ffe-acd6-94a07f2a14b4/10003f09a56.00000056
> > > >  0x400014    1    read
> > > > 46559342    osd243    3.969cc573    3.573    [243,330,226]/243
> > > > [243,330,226]/243    e678697
> > > > fsvolumens_355f485c-6319-4ffe-acd6-94a07f2a14b4/10003f09a56.00000056
> > > >  0x400014    1    read
> > > > 46559345    osd243    3.969cc573    3.573    [243,330,226]/243
> > > > [243,330,226]/243    e678697
> > > > fsvolumens_355f485c-6319-4ffe-acd6-94a07f2a14b4/10003f09a56.00000056
> > > >  0x400014    1    read
> > > > 46559621    osd243    3.6313e8ef    3.8ef    [243,330,521]/243
> > > > [243,330,521]/243    e678697
> > > > fsvolumens_355f485c-6319-4ffe-acd6-94a07f2a14b4/10003f09a45.0000007a
> > > >  0x400014    1    read
> > > > 46559629    osd243    3.b280c852    3.852    [243,113,539]/243
> > > > [243,113,539]/243    e678697
> > > > fsvolumens_355f485c-6319-4ffe-acd6-94a07f2a14b4/10003f09a3a.0000007f
> > > >  0x400014    1    read
> > > > 46559928    osd243    3.1ee7bab4    3.ab4    [243,332,94]/243
> > > > [243,332,94]/243    e678697
> > > > fsvolumens_355f485c-6319-4ffe-acd6-94a07f2a14b4/10003f099ff.0000073f
> > > >  0x400024    1    write
> > > > LINGER REQUESTS
> > > > BACKOFFS
> > > >
> > > >
> > > > We can unblock those requests by doing `ceph osd down osd.243` (or
> > > > restarting osd.243).
> > > >
> > > > This is ceph v14.2.6 and the client kernel is el7 3.10.0-957.27.2.el7.x86_64.
> > > >
> > > > Are there a better way to debug this?
> > >
> > > Hi Dan,
> > >
> > > I assume that these ops don't show up as slow requests on the OSD side?
> > > How long did you see it stuck for before intervening?
> >
> > That's correct -- the osd had no active ops (ceph daemon.... ops).
> >
> > The late release slow req was stuck for 4129s before we intervened.
> >
> > > Do you happen to have "debug ms = 1" logs from osd243?
> >
> > Nope, but I can try to get it afterwards next time. (Though you need
> > it at the moment the ops get stuck, not only from the moment we notice
> > the stuck ops, right?)
>
> Yes, starting before the moment the ops get stuck and ending after you
> kick the OSD.
>
> >
> > > Do you have PG autoscaler enabled?  Any PG splits and/or merges at the time?
> >
> > Not on the cephfs_(meta)data pools (though on the 30th I increased
> > those pool sizes from 2 to 3). And also on the 30th I did some PG
> > merging on an unrelated test pool.
> > And anyway we have seen this type of lockup in the past, without those
> > pool changes (also with mimic MDS until we upgraded to nautilus).
>
> The MDS is out of question here.  This issue is between the kernel
> client and the OSD.
>
> >
> > Looking back further in the client's kernel log we see a page alloc
> > failure on the 30th:
> >
> > Jan 30 16:16:35 hpc-be028.cern.ch kernel: kworker/1:36: page
> > allocation failure: order:5, mode:0x104050
> > Jan 30 16:16:35 hpc-be028.cern.ch kernel: CPU: 1 PID: 78445 Comm:
> > kworker/1:36 Kdump: loaded Tainted: P
> > Jan 30 16:16:35 hpc-be028.cern.ch kernel: Workqueue: ceph-msgr
> > ceph_con_workfn [libceph]
>
> Can you share the stack trace?  That's a 128k allocation, so worth
> taking a look.
>

Pasted here: https://pastebin.com/neyah54k
The same node had a lockup again last night.
And we included the page alloc failure and a resulting dump of osdmap
"corruption".

-- Dan

> Thanks,
>
>                 Ilya
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx