Re: CephFS MDS stuck (failed to rdlock when getattr / lookup)

Oliver Freyermuth <freyermuth@xxxxxxxxxxxxxxxxxx> · Mon, 16 Apr 2018 18:24:31 +0200

Hi Paul,

Am 16.04.2018 um 17:51 schrieb Paul Emmerich:
> Hi,
> 
> can you try to get a stack trace from ganesha (with gdb or from procfs) when it's stuck?

I can try, as soon as it happens again. The problem is that it's not fully stuck - only the other clients are stuck when trying to access the directory in question,
since it seems the Ganesha server holds a lock and never gives it back. Not sure if a backtrace can shed much light on this. 
It seems the NFS server itself is completely happy all the time, it only gets problems when I want to stop / restart it. 

But of course, I will try to get a stacktrace next time. 

> Also, try to upgrade to ganesha 2.6. I'm running a bigger deployment with ~30 ganesha 2.6
> gateways that are quite stable so far.

That's good news! Do you also have parallel access from ceph-fuse clients to the same CephFS? Is this also Luminous? 
We are right now using the packages from https://eu.ceph.com/nfs-ganesha/ since we would like not to have to build NFS Ganesha against Ceph ourselves,
but would love to just follow upstream to save the maintenance burden. Are you building packages yourself, or using a repo maintained upstream? 

I see there have been some locking fixes upstream:
https://github.com/nfs-ganesha/nfs-ganesha/search?q=ceph+lock&type=Commits
but I'm not expert enough to deduce whether they apply to our observation. 

Cheers and thanks,
	Oliver

> 
> Paul
> 
> 2018-04-16 17:30 GMT+02:00 Oliver Freyermuth <freyermuth@xxxxxxxxxxxxxxxxxx <mailto:freyermuth@xxxxxxxxxxxxxxxxxx>>:
> 
>     Am 16.04.2018 um 08:58 schrieb Oliver Freyermuth:
>     > Am 16.04.2018 um 02:43 schrieb Oliver Freyermuth:
>     >> Am 15.04.2018 um 23:04 schrieb John Spray:
>     >>> On Fri, Apr 13, 2018 at 5:16 PM, Oliver Freyermuth
>     >>> <freyermuth@xxxxxxxxxxxxxxxxxx <mailto:freyermuth@xxxxxxxxxxxxxxxxxx>> wrote:
>     >>>> Dear Cephalopodians,
>     >>>>
>     >>>> in our cluster (CentOS 7.4, EC Pool, Snappy compression, Luminous 12.2.4),
>     >>>> we often have all (~40) clients accessing one file in readonly mode, even with multiple processes per client doing that.
>     >>>>
>     >>>> Sometimes (I do not yet know when, nor why!) the MDS ends up in a situation like:
>     >>>> -----------------------------------------------------------
>     >>>> 2018-04-13 18:08:34.378888 7f1ce4472700  0 log_channel(cluster) log [WRN] : 292 slow requests, 5 included below; oldest blocked for > 1745.864417 secs
>     >>>> 2018-04-13 18:08:34.378900 7f1ce4472700  0 log_channel(cluster) log [WRN] : slow request 960.563534 seconds old, received at 2018-04-13 17:52:33.815273: client_request(client.34720:16487379 getattr pAsLsXsFs #0x1000009ff6d 2018-04-13 17:52:33.814904 caller_uid=94894, caller_gid=513{513,}) currently failed to rdlock, waiting
>     >>>> 2018-04-13 18:08:34.378904 7f1ce4472700  0 log_channel(cluster) log [WRN] : slow request 30.636678 seconds old, received at 2018-04-13 18:08:03.742128: client_request(client.34302:16453640 getattr pAsLsXsFs #0x1000009ff6d 2018-04-13 18:08:03.741630 caller_uid=94894, caller_gid=513{513,}) currently failed to rdlock, waiting
>     >>>> 2018-04-13 18:08:34.378908 7f1ce4472700  0 log_channel(cluster) log [WRN] : slow request 972.648926 seconds old, received at 2018-04-13 17:52:21.729881: client_request(client.34720:16487334 lookup #0x1000001fcab/sometarball.tar.gz 2018-04-13 17:52:21.729450 caller_uid=94894, caller_gid=513{513,}) currently failed to rdlock, waiting
>     >>>> 2018-04-13 18:08:34.378913 7f1ce4472700  0 log_channel(cluster) log [WRN] : slow request 1685.953657 seconds old, received at 2018-04-13 17:40:28.425149: client_request(client.34810:16564864 lookup #0x1000001fcab/sometarball.tar.gz 2018-04-13 17:40:28.424961 caller_uid=94894, caller_gid=513{513,}) currently failed to rdlock, waiting
>     >>>> 2018-04-13 18:08:34.378918 7f1ce4472700  0 log_channel(cluster) log [WRN] : slow request 1552.743795 seconds old, received at 2018-04-13 17:42:41.635012: client_request(client.34302:16453566 getattr pAsLsXsFs #0x1000009ff6d 2018-04-13 17:42:41.634726 caller_uid=94894, caller_gid=513{513,}) currently failed to rdlock, waiting
>     >>>> -----------------------------------------------------------
>     >>>> As you can see (oldest blocked for > 1745.864417 secs) it stays in that situation for quite a while.
>     >>>> The number of blocked requests is also not decreasing, but instead slowly increasing whenever a new request is added to the queue.
>     >>>>
>     >>>> We have a setup of one active MDS, a standby-replay, and a standby.
>     >>>> Triggering a failover does not help, it only resets the "oldest blocked" time.
>     >>>
>     >>> Sounds like a client issue (a client is holding a lock on a file but
>     >>> failing to relinquish it for another client's request to be
>     >>> processed).
>     >>>
>     >>> Are these kernel (and what version?) or fuse clients?
>     >>
>     >> The full cluster is running with Fuse clients, all on 12.2.4.
>     >> Additionally, there are 6 NFS Ganesha servers using FSAL_CEPH, i.e. libcephfs.
>     >>
>     >> Cheers,
>     >>      Oliver
>     >
>     > Related to that: In case it happens again, which it surely will sooner or later, how can I diagnose
>     > which client is holding a lock and not relinquishing it?
>     > Is there a way to dump all held locks, ideally with the time period how long they were held?
>     >
>     > ceph daemon mds.XXX help
>     > does not yield anything obvious.
>     >
>     > Cheers and thanks,
>     >       Oliver
> 
>     It happened again - and I managed to track down the client causing it:
>     Dumping all operations in progress on the MDS, and looking for machines without operations which were waiting for an rdlock,
>     one NFS Ganesha server was not affected.
>     Unmounting the NFS share on the clients did not have any effect - things stayed in "stuck" state.
> 
>     I then stopped the NFS Ganesha server. This took quite a while, until systemd noticed it's taking too long, and force-killed the server.
>     Shortly after, even before the NFS Ganesha libchepfs client was evicted, the MDS got unstuck, processed all requests within about 2 seconds, and everything was fine again.
> 
>     After that, the dead NFS Ganesha client was evicted, and I just restarted the NFS Ganesha server.
> 
>     Since this appears to be reproducible (even though we still do not know how exactly), this is a rather ugly issue. It seems NFS Ganesha with libcephfs is holding locks
>     and not returning them. We are using the packages from:
>     https://eu.ceph.com/nfs-ganesha/ <https://eu.ceph.com/nfs-ganesha/>
>     Newer versions are available upstream, but I don't know if they fix this critial issue.
> 
>     Is there somebody else with experience on that?
> 
>     As far as I can reproduce, the I/O pattern is:
>     1. Machine with NFS-mount checks datafolder/sometarball.tar.gz exists.
>     2. Machines in the cluster (with ceph-fuse) all access datafolder/sometarball.tar.gz (and extract it to a local filesystem)
> 
>     (1) may happen several times while (2) is going on, and it seems this somehow causes the NFS-server to (sometimes?) keep a persistent lock which is never relinquished.
> 
>     Any help appreciated!
>     Oliver
> 
> 
>     >
>     >>
>     >>>
>     >>> John
>     >>>
>     >>>>
>     >>>> I checked the following things on the active MDS:
>     >>>> -----------------------------------------------------------
>     >>>> # ceph daemon mds.mon001 objecter_requests
>     >>>> {
>     >>>>     "ops": [],
>     >>>>     "linger_ops": [],
>     >>>>     "pool_ops": [],
>     >>>>     "pool_stat_ops": [],
>     >>>>     "statfs_ops": [],
>     >>>>     "command_ops": []
>     >>>> }
>     >>>> # ceph daemon mds.mon001 ops | grep event | grep -v "initiated" | grep -v "failed to rdlock" | grep -v events
>     >>>> => no output, only "initiated" and "rdlock" are in the queue.
>     >>>> -----------------------------------------------------------
>     >>>>
>     >>>> There's also almost no CPU load, almost no other I/O, and ceph is deep-scrubbing ~pg (this also finishes and the next pg is scrubbed fine),
>     >>>> and the scrubbing is not even happening in the metadata pool (easy to see in the Luminous dashboard):
>     >>>> -----------------------------------------------------------
>     >>>> # ceph -s
>     >>>>   cluster:
>     >>>>     id:     some_funny_hash
>     >>>>     health: HEALTH_WARN
>     >>>>             1 MDSs report slow requests
>     >>>>
>     >>>>   services:
>     >>>>     mon: 3 daemons, quorum mon003,mon001,mon002
>     >>>>     mgr: mon001(active), standbys: mon002, mon003
>     >>>>     mds: cephfs_baf-1/1/1 up  {0=mon001=up:active}, 1 up:standby-replay, 1 up:standby
>     >>>>     osd: 196 osds: 196 up, 196 in
>     >>>>
>     >>>>   data:
>     >>>>     pools:   2 pools, 4224 pgs
>     >>>>     objects: 15649k objects, 61761 GB
>     >>>>     usage:   114 TB used, 586 TB / 700 TB avail
>     >>>>     pgs:     4223 active+clean
>     >>>>              1    active+clean+scrubbing+deep
>     >>>>
>     >>>>   io:
>     >>>>     client:   175 kB/s rd, 3 op/s rd, 0 op/s wr
>     >>>> -----------------------------------------------------------
>     >>>>
>     >>>> Does anybody have any idea what's going on here?
>     >>>>
>     >>>> Yesterday, this also happened, but resolved itself after about 1 hour.
>     >>>> Right now, it's going on for about half an hour...
>     >>>>
>     >>>> Cheers,
>     >>>>         Oliver
>     >>>>
>     >>>>
>     >>>> _______________________________________________
>     >>>> ceph-users mailing list
>     >>>> ceph-users@xxxxxxxxxxxxxx <mailto:ceph-users@xxxxxxxxxxxxxx>
>     >>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com <http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com>
>     >>>>
>     >>
>     >>
>     >>
>     >> _______________________________________________
>     >> ceph-users mailing list
>     >> ceph-users@xxxxxxxxxxxxxx <mailto:ceph-users@xxxxxxxxxxxxxx>
>     >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com <http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com>
>     >>
>     >
> 
> 
> 
> 
>     _______________________________________________
>     ceph-users mailing list
>     ceph-users@xxxxxxxxxxxxxx <mailto:ceph-users@xxxxxxxxxxxxxx>
>     http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com <http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com>
> 
> 
> 
> 
> -- 
> -- 
> Paul Emmerich
> 
> croit GmbH
> Freseniusstr. 31h
> 81247 München
> www.croit.io <http://www.croit.io>
> Tel: +49 89 1896585 90

Attachment:
smime.p7s

Description: S/MIME Cryptographic Signature
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com