Re: CephFS MDS stuck (failed to rdlock when getattr / lookup)

Oliver Freyermuth <freyermuth@xxxxxxxxxxxxxxxxxx> · Mon, 16 Apr 2018 17:30:16 +0200

Am 16.04.2018 um 08:58 schrieb Oliver Freyermuth:
> Am 16.04.2018 um 02:43 schrieb Oliver Freyermuth:
>> Am 15.04.2018 um 23:04 schrieb John Spray:
>>> On Fri, Apr 13, 2018 at 5:16 PM, Oliver Freyermuth
>>> <freyermuth@xxxxxxxxxxxxxxxxxx> wrote:
>>>> Dear Cephalopodians,
>>>>
>>>> in our cluster (CentOS 7.4, EC Pool, Snappy compression, Luminous 12.2.4),
>>>> we often have all (~40) clients accessing one file in readonly mode, even with multiple processes per client doing that.
>>>>
>>>> Sometimes (I do not yet know when, nor why!) the MDS ends up in a situation like:
>>>> -----------------------------------------------------------
>>>> 2018-04-13 18:08:34.378888 7f1ce4472700  0 log_channel(cluster) log [WRN] : 292 slow requests, 5 included below; oldest blocked for > 1745.864417 secs
>>>> 2018-04-13 18:08:34.378900 7f1ce4472700  0 log_channel(cluster) log [WRN] : slow request 960.563534 seconds old, received at 2018-04-13 17:52:33.815273: client_request(client.34720:16487379 getattr pAsLsXsFs #0x1000009ff6d 2018-04-13 17:52:33.814904 caller_uid=94894, caller_gid=513{513,}) currently failed to rdlock, waiting
>>>> 2018-04-13 18:08:34.378904 7f1ce4472700  0 log_channel(cluster) log [WRN] : slow request 30.636678 seconds old, received at 2018-04-13 18:08:03.742128: client_request(client.34302:16453640 getattr pAsLsXsFs #0x1000009ff6d 2018-04-13 18:08:03.741630 caller_uid=94894, caller_gid=513{513,}) currently failed to rdlock, waiting
>>>> 2018-04-13 18:08:34.378908 7f1ce4472700  0 log_channel(cluster) log [WRN] : slow request 972.648926 seconds old, received at 2018-04-13 17:52:21.729881: client_request(client.34720:16487334 lookup #0x1000001fcab/sometarball.tar.gz 2018-04-13 17:52:21.729450 caller_uid=94894, caller_gid=513{513,}) currently failed to rdlock, waiting
>>>> 2018-04-13 18:08:34.378913 7f1ce4472700  0 log_channel(cluster) log [WRN] : slow request 1685.953657 seconds old, received at 2018-04-13 17:40:28.425149: client_request(client.34810:16564864 lookup #0x1000001fcab/sometarball.tar.gz 2018-04-13 17:40:28.424961 caller_uid=94894, caller_gid=513{513,}) currently failed to rdlock, waiting
>>>> 2018-04-13 18:08:34.378918 7f1ce4472700  0 log_channel(cluster) log [WRN] : slow request 1552.743795 seconds old, received at 2018-04-13 17:42:41.635012: client_request(client.34302:16453566 getattr pAsLsXsFs #0x1000009ff6d 2018-04-13 17:42:41.634726 caller_uid=94894, caller_gid=513{513,}) currently failed to rdlock, waiting
>>>> -----------------------------------------------------------
>>>> As you can see (oldest blocked for > 1745.864417 secs) it stays in that situation for quite a while.
>>>> The number of blocked requests is also not decreasing, but instead slowly increasing whenever a new request is added to the queue.
>>>>
>>>> We have a setup of one active MDS, a standby-replay, and a standby.
>>>> Triggering a failover does not help, it only resets the "oldest blocked" time.
>>>
>>> Sounds like a client issue (a client is holding a lock on a file but
>>> failing to relinquish it for another client's request to be
>>> processed).
>>>
>>> Are these kernel (and what version?) or fuse clients?
>>
>> The full cluster is running with Fuse clients, all on 12.2.4. 
>> Additionally, there are 6 NFS Ganesha servers using FSAL_CEPH, i.e. libcephfs. 
>>
>> Cheers,
>> 	Oliver
> 
> Related to that: In case it happens again, which it surely will sooner or later, how can I diagnose
> which client is holding a lock and not relinquishing it? 
> Is there a way to dump all held locks, ideally with the time period how long they were held? 
> 
> ceph daemon mds.XXX help
> does not yield anything obvious. 
> 
> Cheers and thanks,
> 	Oliver

It happened again - and I managed to track down the client causing it: 
Dumping all operations in progress on the MDS, and looking for machines without operations which were waiting for an rdlock,
one NFS Ganesha server was not affected. 
Unmounting the NFS share on the clients did not have any effect - things stayed in "stuck" state. 

I then stopped the NFS Ganesha server. This took quite a while, until systemd noticed it's taking too long, and force-killed the server. 
Shortly after, even before the NFS Ganesha libchepfs client was evicted, the MDS got unstuck, processed all requests within about 2 seconds, and everything was fine again. 

After that, the dead NFS Ganesha client was evicted, and I just restarted the NFS Ganesha server. 

Since this appears to be reproducible (even though we still do not know how exactly), this is a rather ugly issue. It seems NFS Ganesha with libcephfs is holding locks
and not returning them. We are using the packages from:
https://eu.ceph.com/nfs-ganesha/
Newer versions are available upstream, but I don't know if they fix this critial issue. 

Is there somebody else with experience on that? 

As far as I can reproduce, the I/O pattern is:
1. Machine with NFS-mount checks datafolder/sometarball.tar.gz exists. 
2. Machines in the cluster (with ceph-fuse) all access datafolder/sometarball.tar.gz (and extract it to a local filesystem)

(1) may happen several times while (2) is going on, and it seems this somehow causes the NFS-server to (sometimes?) keep a persistent lock which is never relinquished. 

Any help appreciated!
Oliver

> 
>>
>>>
>>> John
>>>
>>>>
>>>> I checked the following things on the active MDS:
>>>> -----------------------------------------------------------
>>>> # ceph daemon mds.mon001 objecter_requests
>>>> {
>>>>     "ops": [],
>>>>     "linger_ops": [],
>>>>     "pool_ops": [],
>>>>     "pool_stat_ops": [],
>>>>     "statfs_ops": [],
>>>>     "command_ops": []
>>>> }
>>>> # ceph daemon mds.mon001 ops | grep event | grep -v "initiated" | grep -v "failed to rdlock" | grep -v events
>>>> => no output, only "initiated" and "rdlock" are in the queue.
>>>> -----------------------------------------------------------
>>>>
>>>> There's also almost no CPU load, almost no other I/O, and ceph is deep-scrubbing ~pg (this also finishes and the next pg is scrubbed fine),
>>>> and the scrubbing is not even happening in the metadata pool (easy to see in the Luminous dashboard):
>>>> -----------------------------------------------------------
>>>> # ceph -s
>>>>   cluster:
>>>>     id:     some_funny_hash
>>>>     health: HEALTH_WARN
>>>>             1 MDSs report slow requests
>>>>
>>>>   services:
>>>>     mon: 3 daemons, quorum mon003,mon001,mon002
>>>>     mgr: mon001(active), standbys: mon002, mon003
>>>>     mds: cephfs_baf-1/1/1 up  {0=mon001=up:active}, 1 up:standby-replay, 1 up:standby
>>>>     osd: 196 osds: 196 up, 196 in
>>>>
>>>>   data:
>>>>     pools:   2 pools, 4224 pgs
>>>>     objects: 15649k objects, 61761 GB
>>>>     usage:   114 TB used, 586 TB / 700 TB avail
>>>>     pgs:     4223 active+clean
>>>>              1    active+clean+scrubbing+deep
>>>>
>>>>   io:
>>>>     client:   175 kB/s rd, 3 op/s rd, 0 op/s wr
>>>> -----------------------------------------------------------
>>>>
>>>> Does anybody have any idea what's going on here?
>>>>
>>>> Yesterday, this also happened, but resolved itself after about 1 hour.
>>>> Right now, it's going on for about half an hour...
>>>>
>>>> Cheers,
>>>>         Oliver
>>>>
>>>>
>>>> _______________________________________________
>>>> ceph-users mailing list
>>>> ceph-users@xxxxxxxxxxxxxx
>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>>
>>
>>
>>
>> _______________________________________________
>> ceph-users mailing list
>> ceph-users@xxxxxxxxxxxxxx
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
> 

Attachment:
smime.p7s

Description: S/MIME Cryptographic Signature
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com