Re: Ceph-fuse getting stuck with "currently failed to authpin local pins"

"Yan, Zheng" <ukernel@xxxxxxxxx> · Thu, 31 May 2018 08:42:29 +0800



On Wed, May 30, 2018 at 5:17 PM, Oliver Freyermuth
<freyermuth@xxxxxxxxxxxxxxxxxx> wrote:
> Am 30.05.2018 um 10:37 schrieb Yan, Zheng:
>> On Wed, May 30, 2018 at 3:04 PM, Oliver Freyermuth
>> <freyermuth@xxxxxxxxxxxxxxxxxx> wrote:
>>> Hi,
>>>
>>> ij our case, there's only a single active MDS
>>> (+1 standby-replay + 1 standby).
>>> We also get the health warning in case it happens.
>>>
>>
>> Were there "client.xxx isn't responding to mclientcaps(revoke)"
>> warnings in cluster log.  please send them to me if there were.
>
> Yes, indeed, I almost missed them!
>
> Here you go:
>
> ....
> 2018-05-29 12:16:02.491186 mon.mon003 mon.0 10.161.8.40:6789/0 11177 : cluster [WRN] MDS health message (mds.0): Client XXXXXXX:XXXXXXX failing to respond to capability release
> 2018-05-29 12:16:03.401014 mon.mon003 mon.0 10.161.8.40:6789/0 11178 : cluster [WRN] Health check failed: 1 clients failing to respond to capability release (MDS_CLIENT_LATE_RELEASE)
> ....
> 2018-05-29 12:16:00.567520 mds.mon001 mds.0 10.161.8.191:6800/3068262341 15745 : cluster [WRN] client.1524813 isn't responding to mclientcaps(revoke), ino 0x10000388ae0 pending pAsLsXsFr issued pAsLsXsFrw, sent 63.908382 seconds ago
> ....
>>repetition of message with increasing delays in between>
> ....
> 2018-05-29 16:31:00.899416 mds.mon001 mds.0 10.161.8.191:6800/3068262341 17169 : cluster [WRN] client.1524813 isn't responding to mclientcaps(revoke), ino 0x10000388ae0 pending pAsLsXsFr issued pAsLsXsFrw, sent 15364.240272 seconds ago
> ....

The client failed to release Fw. When it happens again, please check
if there are hung osd requests (ceph
--admin-daemon=/var/run/ceph/ceph-client.admin.xxx.asok
objecter_requests)


>
> After evicting the client, I also get:
> 2018-05-29 17:00:00.000134 mon.mon003 mon.0 10.161.8.40:6789/0 11293 : cluster [WRN] overall HEALTH_WARN 1 clients failing to respond to capability release; 1 MDSs report slow requests
> 2018-05-29 17:09:50.964730 mon.mon003 mon.0 10.161.8.40:6789/0 11297 : cluster [INF] MDS health message cleared (mds.0): Client XXXXXXX:XXXXXXX failing to respond to capability release
> 2018-05-29 17:09:50.964767 mon.mon003 mon.0 10.161.8.40:6789/0 11298 : cluster [INF] MDS health message cleared (mds.0): 123 slow requests are blocked > 30 sec
> 2018-05-29 17:09:51.015071 mon.mon003 mon.0 10.161.8.40:6789/0 11299 : cluster [INF] Health check cleared: MDS_CLIENT_LATE_RELEASE (was: 1 clients failing to respond to capability release)
> 2018-05-29 17:09:51.015154 mon.mon003 mon.0 10.161.8.40:6789/0 11300 : cluster [INF] Health check cleared: MDS_SLOW_REQUEST (was: 1 MDSs report slow requests)
> 2018-05-29 17:09:51.015191 mon.mon003 mon.0 10.161.8.40:6789/0 11301 : cluster [INF] Cluster is now healthy
> 2018-05-29 17:14:26.178321 mds.mon002 mds.34884 10.161.8.192:6800/2102077019 8 : cluster [WRN]  replayed op client.1495010:32710304,32710299 used ino 0x100003909d0 but session next is 0x10000388af6
> 2018-05-29 17:14:26.178393 mds.mon002 mds.34884 10.161.8.192:6800/2102077019 9 : cluster [WRN]  replayed op client.1495010:32710306,32710299 used ino 0x100003909d1 but session next is 0x10000388af6
> 2018-05-29 18:00:00.000132 mon.mon003 mon.0 10.161.8.40:6789/0 11304 : cluster [INF] overall HEALTH_OK
>
> Thanks for looking into it!
>
> Cheers,
>         Oliver
>
>
>>
>>> Cheers,
>>> Oliver
>>>
>>> Am 30.05.2018 um 03:25 schrieb Yan, Zheng:
>>>> I could be http://tracker.ceph.com/issues/24172
>>>>
>>>>
>>>> On Wed, May 30, 2018 at 9:01 AM, Linh Vu <vul@xxxxxxxxxxxxxx> wrote:
>>>>> In my case, I have multiple active MDS (with directory pinning at the very
>>>>> top level), and there would be "Client xxx failing to respond to capability
>>>>> release" health warning every single time that happens.
>>>>>
>>>>> ________________________________
>>>>> From: ceph-users <ceph-users-bounces@xxxxxxxxxxxxxx> on behalf of Yan, Zheng
>>>>> <ukernel@xxxxxxxxx>
>>>>> Sent: Tuesday, 29 May 2018 9:53:43 PM
>>>>> To: Oliver Freyermuth
>>>>> Cc: Ceph Users; Peter Wienemann
>>>>> Subject: Re:  Ceph-fuse getting stuck with "currently failed to
>>>>> authpin local pins"
>>>>>
>>>>> Single or multiple acitve mds? Were there "Client xxx failing to
>>>>> respond to capability release" health warning?
>>>>>
>>>>> On Mon, May 28, 2018 at 10:38 PM, Oliver Freyermuth
>>>>> <freyermuth@xxxxxxxxxxxxxxxxxx> wrote:
>>>>>> Dear Cephalopodians,
>>>>>>
>>>>>> we just had a "lockup" of many MDS requests, and also trimming fell
>>>>>> behind, for over 2 days.
>>>>>> One of the clients (all ceph-fuse 12.2.5 on CentOS 7.5) was in status
>>>>>> "currently failed to authpin local pins". Metadata pool usage did grow by 10
>>>>>> GB in those 2 days.
>>>>>>
>>>>>> Rebooting the node to force a client eviction solved the issue, and now
>>>>>> metadata usage is down again, and all stuck requests were processed quickly.
>>>>>>
>>>>>> Is there any idea on what could cause something like that? On the client,
>>>>>> der was no CPU load, but many processes waiting for cephfs to respond.
>>>>>> Syslog did yield anything. It only affected one user and his user
>>>>>> directory.
>>>>>>
>>>>>> If there are no ideas: How can I collect good debug information in case
>>>>>> this happens again?
>>>>>>
>>>>>> Cheers,
>>>>>>         Oliver
>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> ceph-users mailing list
>>>>>> ceph-users@xxxxxxxxxxxxxx
>>>>>>
>>>>>> https://protect-au.mimecast.com/s/Zl9aCXLKNwFxY9nNc6jQJC?domain=lists.ceph.com
>>>>>>
>>>>> _______________________________________________
>>>>> ceph-users mailing list
>>>>> ceph-users@xxxxxxxxxxxxxx
>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>>>
>>>
>>>
>
>
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com