Re: Ceph-fuse getting stuck with "currently failed to authpin local pins"

Oliver Freyermuth <freyermuth@xxxxxxxxxxxxxxxxxx> · Tue, 29 May 2018 10:14:59 +0200

I get the feeling this is not dependent on the exact Ceph version... 

In our case, I know what the user has done (and he'll not do it again). He misunderstood how our cluster works and started 1100 cluster jobs,
all entering the very same directory on CephFS (mounted via ceph-fuse on 38 machines), all running "make clean; make -j10 install". 
So 1100 processes from 38 clients have been trying to lock / delete / write the very same files. 

In parallel, an IDE (eclipse) and an indexing service (zeitgeist...) may have accessed the very same directory via nfs-ganesha since the user mounted the NFS-exported directory via sshfs into his desktop home directory... 

So I can't really blame CephFS for becoming as unhappy as I would become myself. 
However, I would have hoped it would not enter a "stuck" state in which only client eviction will help... 

Cheers,
	Oliver

Am 29.05.2018 um 03:26 schrieb Linh Vu:
> I get the exact opposite to the same error message "currently failed to authpin local pins". Had a few clients on ceph-fuse 12.2.2 and they ran into those issues a lot (evicting works). Upgrading to ceph-fuse 12.2.5 fixed it. The main cluster is on 12.2.4.
> 
> 
> The cause is user's HPC jobs or even just their login on multiple nodes accessing the same files, in a particular way. Doesn't happen to other users. Haven't quite dug into it deep enough as upgrading to 12.2.5 fixed our problem. 
> 
> ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
> *From:* ceph-users <ceph-users-bounces@xxxxxxxxxxxxxx> on behalf of Oliver Freyermuth <freyermuth@xxxxxxxxxxxxxxxxxx>
> *Sent:* Tuesday, 29 May 2018 7:29:06 AM
> *To:* Paul Emmerich
> *Cc:* Ceph Users; Peter Wienemann
> *Subject:* Re:  Ceph-fuse getting stuck with "currently failed to authpin local pins"
>  
> Dear Paul,
> 
> Am 28.05.2018 um 20:16 schrieb Paul Emmerich:
>> I encountered the exact same issue earlier today immediately after upgrading a customer's cluster from 12.2.2 to 12.2.5.
>> I've evicted the session and restarted the ganesha client to fix it, as I also couldn't find any obvious problem.
> 
> interesting! In our case, the client with the problem (it happened again a few hours later...) always was a ceph-fuse client. Evicting / rebooting the client node helped.
> However, it may well be that the original issue way caused by a Ganesha client, which we also use (and the user in question who complained was accessing files in parallel via NFS and ceph-fuse),
> but I don't have a clear indication of that.
> 
> Cheers,
>         Oliver
> 
>> 
>> Paul
>> 
>> 2018-05-28 16:38 GMT+02:00 Oliver Freyermuth <freyermuth@xxxxxxxxxxxxxxxxxx <mailto:freyermuth@xxxxxxxxxxxxxxxxxx>>:
>> 
>>     Dear Cephalopodians,
>> 
>>     we just had a "lockup" of many MDS requests, and also trimming fell behind, for over 2 days.
>>     One of the clients (all ceph-fuse 12.2.5 on CentOS 7.5) was in status "currently failed to authpin local pins". Metadata pool usage did grow by 10 GB in those 2 days.
>> 
>>     Rebooting the node to force a client eviction solved the issue, and now metadata usage is down again, and all stuck requests were processed quickly.
>> 
>>     Is there any idea on what could cause something like that? On the client, der was no CPU load, but many processes waiting for cephfs to respond.
>>     Syslog did yield anything. It only affected one user and his user directory.
>> 
>>     If there are no ideas: How can I collect good debug information in case this happens again?
>> 
>>     Cheers,
>>             Oliver
>> 
>> 
>>     _______________________________________________
>>     ceph-users mailing list
>>     ceph-users@xxxxxxxxxxxxxx <mailto:ceph-users@xxxxxxxxxxxxxx>
>>     http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com <http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com>
>> 
>> 
>> 
>> 
>> -- 
>> Paul Emmerich
>> 
>> Looking for help with your Ceph cluster? Contact us at https://croit.io
>> 
>> croit GmbH
>> Freseniusstr. 31h
>> 81247 München
>> www.croit.io <http://www.croit.io> <http://www.croit.io>
>> Tel: +49 89 1896585 90
> 

Attachment:
smime.p7s

Description: S/MIME Cryptographic Signature
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com