Re: kernel client osdc ops stuck and mds slow reqs

Ilya Dryomov <idryomov@xxxxxxxxx> · Fri, 31 Jan 2020 18:32:20 +0100

On Fri, Jan 31, 2020 at 4:57 PM Dan van der Ster <dan@xxxxxxxxxxxxxx> wrote:
>
> Hi Ilya,
>
> On Fri, Jan 31, 2020 at 11:33 AM Ilya Dryomov <idryomov@xxxxxxxxx> wrote:
> >
> > On Fri, Jan 31, 2020 at 11:06 AM Dan van der Ster <dan@xxxxxxxxxxxxxx> wrote:
> > >
> > > Hi all,
> > >
> > > We are quite regularly (a couple times per week) seeing:
> > >
> > > HEALTH_WARN 1 clients failing to respond to capability release; 1 MDSs
> > > report slow requests
> > > MDS_CLIENT_LATE_RELEASE 1 clients failing to respond to capability release
> > >     mdshpc-be143(mds.0): Client hpc-be028.cern.ch: failing to respond
> > > to capability release client_id: 52919162
> > > MDS_SLOW_REQUEST 1 MDSs report slow requests
> > >     mdshpc-be143(mds.0): 1 slow requests are blocked > 30 secs
> > >
> > > Which is being caused by osdc ops stuck in a kernel client, e.g.:
> > >
> > > 10:57:18 root hpc-be028 /root
> > > → cat /sys/kernel/debug/ceph/4da6fd06-b069-49af-901f-c9513baabdbd.client52919162/osdc
> > > REQUESTS 9 homeless 0
> > > 46559317    osd243    3.ee6ffcdb    3.cdb    [243,501,92]/243
> > > [243,501,92]/243    e678697
> > > fsvolumens_355f485c-6319-4ffe-acd6-94a07f2a14b4/10003f09a01.00000057
> > >  0x400014    1    read
> > > 46559322    osd243    3.ee6ffcdb    3.cdb    [243,501,92]/243
> > > [243,501,92]/243    e678697
> > > fsvolumens_355f485c-6319-4ffe-acd6-94a07f2a14b4/10003f09a01.00000057
> > >  0x400014    1    read
> > > 46559323    osd243    3.969cc573    3.573    [243,330,226]/243
> > > [243,330,226]/243    e678697
> > > fsvolumens_355f485c-6319-4ffe-acd6-94a07f2a14b4/10003f09a56.00000056
> > >  0x400014    1    read
> > > 46559341    osd243    3.969cc573    3.573    [243,330,226]/243
> > > [243,330,226]/243    e678697
> > > fsvolumens_355f485c-6319-4ffe-acd6-94a07f2a14b4/10003f09a56.00000056
> > >  0x400014    1    read
> > > 46559342    osd243    3.969cc573    3.573    [243,330,226]/243
> > > [243,330,226]/243    e678697
> > > fsvolumens_355f485c-6319-4ffe-acd6-94a07f2a14b4/10003f09a56.00000056
> > >  0x400014    1    read
> > > 46559345    osd243    3.969cc573    3.573    [243,330,226]/243
> > > [243,330,226]/243    e678697
> > > fsvolumens_355f485c-6319-4ffe-acd6-94a07f2a14b4/10003f09a56.00000056
> > >  0x400014    1    read
> > > 46559621    osd243    3.6313e8ef    3.8ef    [243,330,521]/243
> > > [243,330,521]/243    e678697
> > > fsvolumens_355f485c-6319-4ffe-acd6-94a07f2a14b4/10003f09a45.0000007a
> > >  0x400014    1    read
> > > 46559629    osd243    3.b280c852    3.852    [243,113,539]/243
> > > [243,113,539]/243    e678697
> > > fsvolumens_355f485c-6319-4ffe-acd6-94a07f2a14b4/10003f09a3a.0000007f
> > >  0x400014    1    read
> > > 46559928    osd243    3.1ee7bab4    3.ab4    [243,332,94]/243
> > > [243,332,94]/243    e678697
> > > fsvolumens_355f485c-6319-4ffe-acd6-94a07f2a14b4/10003f099ff.0000073f
> > >  0x400024    1    write
> > > LINGER REQUESTS
> > > BACKOFFS
> > >
> > >
> > > We can unblock those requests by doing `ceph osd down osd.243` (or
> > > restarting osd.243).
> > >
> > > This is ceph v14.2.6 and the client kernel is el7 3.10.0-957.27.2.el7.x86_64.
> > >
> > > Are there a better way to debug this?
> >
> > Hi Dan,
> >
> > I assume that these ops don't show up as slow requests on the OSD side?
> > How long did you see it stuck for before intervening?
>
> That's correct -- the osd had no active ops (ceph daemon.... ops).
>
> The late release slow req was stuck for 4129s before we intervened.
>
> > Do you happen to have "debug ms = 1" logs from osd243?
>
> Nope, but I can try to get it afterwards next time. (Though you need
> it at the moment the ops get stuck, not only from the moment we notice
> the stuck ops, right?)

Yes, starting before the moment the ops get stuck and ending after you
kick the OSD.

>
> > Do you have PG autoscaler enabled?  Any PG splits and/or merges at the time?
>
> Not on the cephfs_(meta)data pools (though on the 30th I increased
> those pool sizes from 2 to 3). And also on the 30th I did some PG
> merging on an unrelated test pool.
> And anyway we have seen this type of lockup in the past, without those
> pool changes (also with mimic MDS until we upgraded to nautilus).

The MDS is out of question here.  This issue is between the kernel
client and the OSD.

>
> Looking back further in the client's kernel log we see a page alloc
> failure on the 30th:
>
> Jan 30 16:16:35 hpc-be028.cern.ch kernel: kworker/1:36: page
> allocation failure: order:5, mode:0x104050
> Jan 30 16:16:35 hpc-be028.cern.ch kernel: CPU: 1 PID: 78445 Comm:
> kworker/1:36 Kdump: loaded Tainted: P
> Jan 30 16:16:35 hpc-be028.cern.ch kernel: Workqueue: ceph-msgr
> ceph_con_workfn [libceph]

Can you share the stack trace?  That's a 128k allocation, so worth
taking a look.

Thanks,

                Ilya
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx