Re: Ceph-fuse getting stuck with "currently failed to authpin local pins"

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Might also be the cause of ours too, as we also have quotas set for the affected directory tree, although we use multi-active MDS. 


From: Oliver Freyermuth <freyermuth@xxxxxxxxxxxxxxxxxx>
Sent: Friday, 1 June 2018 11:33:38 AM
To: Yan, Zheng
Cc: Linh Vu; Ceph Users; Peter Wienemann
Subject: Re: Ceph-fuse getting stuck with "currently failed to authpin local pins"
 
Am 01.06.2018 um 02:59 schrieb Yan, Zheng:
> On Wed, May 30, 2018 at 5:17 PM, Oliver Freyermuth
> <freyermuth@xxxxxxxxxxxxxxxxxx> wrote:
>> Am 30.05.2018 um 10:37 schrieb Yan, Zheng:
>>> On Wed, May 30, 2018 at 3:04 PM, Oliver Freyermuth
>>> <freyermuth@xxxxxxxxxxxxxxxxxx> wrote:
>>>> Hi,
>>>>
>>>> ij our case, there's only a single active MDS
>>>> (+1 standby-replay + 1 standby).
>>>> We also get the health warning in case it happens.
>>>>
>>>
>>> Were there "client.xxx isn't responding to mclientcaps(revoke)"
>>> warnings in cluster log.  please send them to me if there were.
>>
>> Yes, indeed, I almost missed them!
>>
>> Here you go:
>>
>> ....
>> 2018-05-29 12:16:02.491186 mon.mon003 mon.0 10.161.8.40:6789/0 11177 : cluster [WRN] MDS health message (mds.0): Client XXXXXXX:XXXXXXX failing to respond to capability release
>> 2018-05-29 12:16:03.401014 mon.mon003 mon.0 10.161.8.40:6789/0 11178 : cluster [WRN] Health check failed: 1 clients failing to respond to capability release (MDS_CLIENT_LATE_RELEASE)
>> ....
>> 2018-05-29 12:16:00.567520 mds.mon001 mds.0 10.161.8.191:6800/3068262341 15745 : cluster [WRN] client.1524813 isn't responding to mclientcaps(revoke), ino 0x10000388ae0 pending pAsLsXsFr issued pAsLsXsFrw, sent 63.908382 seconds ago
>> ....
>>> repetition of message with increasing delays in between>
>> ....
>> 2018-05-29 16:31:00.899416 mds.mon001 mds.0 10.161.8.191:6800/3068262341 17169 : cluster [WRN] client.1524813 isn't responding to mclientcaps(revoke), ino 0x10000388ae0 pending pAsLsXsFr issued pAsLsXsFrw, sent 15364.240272 seconds ago
>> ....
>>
>> After evicting the client, I also get:
>> 2018-05-29 17:00:00.000134 mon.mon003 mon.0 10.161.8.40:6789/0 11293 : cluster [WRN] overall HEALTH_WARN 1 clients failing to respond to capability release; 1 MDSs report slow requests
>> 2018-05-29 17:09:50.964730 mon.mon003 mon.0 10.161.8.40:6789/0 11297 : cluster [INF] MDS health message cleared (mds.0): Client XXXXXXX:XXXXXXX failing to respond to capability release
>> 2018-05-29 17:09:50.964767 mon.mon003 mon.0 10.161.8.40:6789/0 11298 : cluster [INF] MDS health message cleared (mds.0): 123 slow requests are blocked > 30 sec
>> 2018-05-29 17:09:51.015071 mon.mon003 mon.0 10.161.8.40:6789/0 11299 : cluster [INF] Health check cleared: MDS_CLIENT_LATE_RELEASE (was: 1 clients failing to respond to capability release)
>> 2018-05-29 17:09:51.015154 mon.mon003 mon.0 10.161.8.40:6789/0 11300 : cluster [INF] Health check cleared: MDS_SLOW_REQUEST (was: 1 MDSs report slow requests)
>> 2018-05-29 17:09:51.015191 mon.mon003 mon.0 10.161.8.40:6789/0 11301 : cluster [INF] Cluster is now healthy
>> 2018-05-29 17:14:26.178321 mds.mon002 mds.34884 10.161.8.192:6800/2102077019 8 : cluster [WRN]  replayed op client.1495010:32710304,32710299 used ino 0x100003909d0 but session next is 0x10000388af6
>> 2018-05-29 17:14:26.178393 mds.mon002 mds.34884 10.161.8.192:6800/2102077019 9 : cluster [WRN]  replayed op client.1495010:32710306,32710299 used ino 0x100003909d1 but session next is 0x10000388af6
>> 2018-05-29 18:00:00.000132 mon.mon003 mon.0 10.161.8.40:6789/0 11304 : cluster [INF] overall HEALTH_OK
>>
>> Thanks for looking into it!
>>
>> Cheers,
>>         Oliver
>>
>>
>
> I found cause of your issue. http://tracker.ceph.com/issues/24369

Wow, many thanks!
I did not yet manage to reproduce the stuck behaviour, since the user who could reliably cause it made use of the national holiday around here.

But the issue seems extremely likely to be exactly that one - quotas are set for the directory tree which was affected.
Let me know if I still should ask him to reproduce and collect the information from the client to confirm.

Many thanks and cheers,
        Oliver

>
>>>
>>>> Cheers,
>>>> Oliver
>>>>
>>>> Am 30.05.2018 um 03:25 schrieb Yan, Zheng:
>>>>> I could be http://tracker.ceph.com/issues/24172
>>>>>
>>>>>
>>>>> On Wed, May 30, 2018 at 9:01 AM, Linh Vu <vul@xxxxxxxxxxxxxx> wrote:
>>>>>> In my case, I have multiple active MDS (with directory pinning at the very
>>>>>> top level), and there would be "Client xxx failing to respond to capability
>>>>>> release" health warning every single time that happens.
>>>>>>
>>>>>> ________________________________
>>>>>> From: ceph-users <ceph-users-bounces@xxxxxxxxxxxxxx> on behalf of Yan, Zheng
>>>>>> <ukernel@xxxxxxxxx>
>>>>>> Sent: Tuesday, 29 May 2018 9:53:43 PM
>>>>>> To: Oliver Freyermuth
>>>>>> Cc: Ceph Users; Peter Wienemann
>>>>>> Subject: Re: Ceph-fuse getting stuck with "currently failed to
>>>>>> authpin local pins"
>>>>>>
>>>>>> Single or multiple acitve mds? Were there "Client xxx failing to
>>>>>> respond to capability release" health warning?
>>>>>>
>>>>>> On Mon, May 28, 2018 at 10:38 PM, Oliver Freyermuth
>>>>>> <freyermuth@xxxxxxxxxxxxxxxxxx> wrote:
>>>>>>> Dear Cephalopodians,
>>>>>>>
>>>>>>> we just had a "lockup" of many MDS requests, and also trimming fell
>>>>>>> behind, for over 2 days.
>>>>>>> One of the clients (all ceph-fuse 12.2.5 on CentOS 7.5) was in status
>>>>>>> "currently failed to authpin local pins". Metadata pool usage did grow by 10
>>>>>>> GB in those 2 days.
>>>>>>>
>>>>>>> Rebooting the node to force a client eviction solved the issue, and now
>>>>>>> metadata usage is down again, and all stuck requests were processed quickly.
>>>>>>>
>>>>>>> Is there any idea on what could cause something like that? On the client,
>>>>>>> der was no CPU load, but many processes waiting for cephfs to respond.
>>>>>>> Syslog did yield anything. It only affected one user and his user
>>>>>>> directory.
>>>>>>>
>>>>>>> If there are no ideas: How can I collect good debug information in case
>>>>>>> this happens again?
>>>>>>>
>>>>>>> Cheers,
>>>>>>>         Oliver
>>>>>>>
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> ceph-users mailing list
>>>>>>> ceph-users@xxxxxxxxxxxxxx
>>>>>>>
>>>>>>> https://protect-au.mimecast.com/s/Zl9aCXLKNwFxY9nNc6jQJC?domain=lists.ceph.com
>>>>>>>
>>>>>> _______________________________________________
>>>>>> ceph-users mailing list
>>>>>> ceph-users@xxxxxxxxxxxxxx
>>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>>>>
>>>>
>>>>
>>
>>


_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux