Re: kernel client osdc ops stuck and mds slow reqs

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Fri, Jan 31, 2020 at 11:06 AM Dan van der Ster <dan@xxxxxxxxxxxxxx> wrote:
>
> Hi all,
>
> We are quite regularly (a couple times per week) seeing:
>
> HEALTH_WARN 1 clients failing to respond to capability release; 1 MDSs
> report slow requests
> MDS_CLIENT_LATE_RELEASE 1 clients failing to respond to capability release
>     mdshpc-be143(mds.0): Client hpc-be028.cern.ch: failing to respond
> to capability release client_id: 52919162
> MDS_SLOW_REQUEST 1 MDSs report slow requests
>     mdshpc-be143(mds.0): 1 slow requests are blocked > 30 secs
>
> Which is being caused by osdc ops stuck in a kernel client, e.g.:
>
> 10:57:18 root hpc-be028 /root
> → cat /sys/kernel/debug/ceph/4da6fd06-b069-49af-901f-c9513baabdbd.client52919162/osdc
> REQUESTS 9 homeless 0
> 46559317    osd243    3.ee6ffcdb    3.cdb    [243,501,92]/243
> [243,501,92]/243    e678697
> fsvolumens_355f485c-6319-4ffe-acd6-94a07f2a14b4/10003f09a01.00000057
>  0x400014    1    read
> 46559322    osd243    3.ee6ffcdb    3.cdb    [243,501,92]/243
> [243,501,92]/243    e678697
> fsvolumens_355f485c-6319-4ffe-acd6-94a07f2a14b4/10003f09a01.00000057
>  0x400014    1    read
> 46559323    osd243    3.969cc573    3.573    [243,330,226]/243
> [243,330,226]/243    e678697
> fsvolumens_355f485c-6319-4ffe-acd6-94a07f2a14b4/10003f09a56.00000056
>  0x400014    1    read
> 46559341    osd243    3.969cc573    3.573    [243,330,226]/243
> [243,330,226]/243    e678697
> fsvolumens_355f485c-6319-4ffe-acd6-94a07f2a14b4/10003f09a56.00000056
>  0x400014    1    read
> 46559342    osd243    3.969cc573    3.573    [243,330,226]/243
> [243,330,226]/243    e678697
> fsvolumens_355f485c-6319-4ffe-acd6-94a07f2a14b4/10003f09a56.00000056
>  0x400014    1    read
> 46559345    osd243    3.969cc573    3.573    [243,330,226]/243
> [243,330,226]/243    e678697
> fsvolumens_355f485c-6319-4ffe-acd6-94a07f2a14b4/10003f09a56.00000056
>  0x400014    1    read
> 46559621    osd243    3.6313e8ef    3.8ef    [243,330,521]/243
> [243,330,521]/243    e678697
> fsvolumens_355f485c-6319-4ffe-acd6-94a07f2a14b4/10003f09a45.0000007a
>  0x400014    1    read
> 46559629    osd243    3.b280c852    3.852    [243,113,539]/243
> [243,113,539]/243    e678697
> fsvolumens_355f485c-6319-4ffe-acd6-94a07f2a14b4/10003f09a3a.0000007f
>  0x400014    1    read
> 46559928    osd243    3.1ee7bab4    3.ab4    [243,332,94]/243
> [243,332,94]/243    e678697
> fsvolumens_355f485c-6319-4ffe-acd6-94a07f2a14b4/10003f099ff.0000073f
>  0x400024    1    write
> LINGER REQUESTS
> BACKOFFS
>
>
> We can unblock those requests by doing `ceph osd down osd.243` (or
> restarting osd.243).
>
> This is ceph v14.2.6 and the client kernel is el7 3.10.0-957.27.2.el7.x86_64.
>
> Are there a better way to debug this?

Hi Dan,

I assume that these ops don't show up as slow requests on the OSD side?
How long did you see it stuck for before intervening?

Do you happen to have "debug ms = 1" logs from osd243?

Do you have PG autoscaler enabled?  Any PG splits and/or merges at the
time?

Thanks,

                Ilya
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx




[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux