Re: Clients failing to respond to capability release

Ivan Clayson <ivan@xxxxxxxxxxxxxxxxx> · Thu, 12 Oct 2023 09:22:03 +0100

Hi Tim,

We've been seeing something that may be similar to you with concurrent 
MDS_CLIENT_LATE_RELEASE and MDS_SLOW_REQUEST warning messages as well as 
frequently MDS_CLIENT_RECALL and MDS_SLOW_METADATA_IO warnings from the 
same MDS referring to the same client. We are using 1 MDS for our 
non-containerised filesystem on a 5.7 PiB sized Alma8.8 cluster with 29 
nodes and 348 spinning OSDs for bulk data and 4 OSDs for the metadata on 
NVMe SSDs with 174 million files (we also have additional NVMe drive 
partitions as DB and WAL devices for each OSD). All of our clients are 
using the kernel mount where we also have a SMB gateway which kernel 
mounts the filesystem and shares it to Windows and Mac machines.

This problem seems to have various symptoms including but not limited 
to: (i) a particular file or directory hanging on openfs, read, and or 
statx system calls for all clients mounting the filesystem, (ii) all our 
connected clients hanging on the aforementioned system calls when 
performing any metadata or bulk data I/O on the filesystem, (iii) 1 
client being unable to stat or cd into the filesystem whilst all other 
clients are completely unaffected, and (iv) possibly a near complete 
loss of metadata due to the MDS journal swelling beyond the size of the 
MDS pool where we eventually had some success following the recovery 
docs (https://docs.ceph.com/en/quincy/cephfs/disaster-recovery-experts/) 
but opted for a backup restore instead (we were able to restore the bulk 
data to a read-only state but not all the metadata where we would be 
more than happy to write a "worked example" of this if the community 
would find it useful). We use to solve this by forcefully re-mounting 
the filesystem on the client but as we run multi-user systems and have 
100s of clients, this was not a sustainable option for us.

We notice generally 3 classes of error:

 * The oldest slow or blocked request in the ceph-mds log is waiting to
   acquire locks for an inode (visible via `ceph tell mds.N
   dump_blocked_ops`). By tracing the oldest FS client which has the
   capability for this file and seeing which OSD the client is waiting
   for with `watch 'cat /sys/kernel/debug/ceph/*/osdc'` or by looking
   at the output of `ceph tell mds.N objecter requests`, we can
   determine which exact OSD is being slow (always a spinning disk).
   These issues are due to hardware issues or the OSD crashing (where a
   restart seems to remedy the problem) and thus counterintuitively (at
   least for me) suggests that our error messages for the MDS can refer
   to OSD hard-drive problems. These HDDs are in a completely separate
   pool to where I would have thought the metadata I/O would be being
   performed since our metadata pool is exclusively SSDs.
 * The oldest, or near oldest, blocked MDS op being an "internal op
   fragmentdir" operation and many client_requests at the flag_point
   "failed to authpin, dir is being fragmented". This seems to be an
   MDS bug as the documentation suggests that this should only happen
   when in a multi-MDS system but we have been using only 1 MDS. A
   simple restart of the MDS solves this problem.
 * Client has a crash and either has been blocklisted or should be.
   Blocklisted clients may not immediately know they've been evicted
   and thus may be hanging indefinitely when stating the filesystem.
   This can be detected by searching through the client list via `ceph
   tell mds.N client ls` and seeing whether there is an ID associated
   with the client. However whilst a manual re-mount will allow the
   former client to establish new session, we prefer to use the
   "recover_session=clean" mount option to do this automatically and
   now we have no incidents of this.

Yesterday for example, we had an incident of multiple MDS warning 
messages of: MDS_SLOW_REQUEST,  MDS_TRIM, MDS_CLIENT_RECALL, and 
MDS_CLIENT_LATE_RELEASE. This was caused by a non-responsive hard-drive 
leading to a build up of the MDS cache and trims being unable to be 
completed where we managed to narrow down the hard-drive for the inode 
which the blocked client was waiting for a rdlock on and restarted the 
OSD for that the drive. Notably this hard-drive had no errors with 
smartctl or elsewhere and only had the following slow ops message on OSD 
systemctl status/ log:

    osd.281 123892 get_health_metrics reporting 4 slow ops, oldest is 
osd_op(client.49576654.0:9101 3.d9bs0 3.ee6f2d9b (undecoded) 
ondisk+read+known_if_redirected e123889)

Whilst restarting the hard-drive that the client with the oldest blocked 
op was waiting for did "clear" this /sys/kernel/debug/ceph/*/osdc queue, 
the oldest blocked MDS op then became an "internal op 
fragmentdir:mds.0:1" one where restarting the active MDS cleared this. 
Alas, this resulted in another blocked getattr op at the flag point 
"failed to authpin, dir is being fragmented" which was similarly tackled 
by restarting the MDS that just took over. This finally resulted in only 
two clients failing to respond to caps releases on inodes they were 
holding (despite rebooting at the time) where performing a "ceph tell 
mds.N session kill CLIENT_ID" removed them from the session map and 
allow the MDS' cache to become manageable again, thereby clearing all of 
these warning messages.

We've had this problem since the beginning of this year and upgrading 
from octopus to quincy has unfortunately not solved our problem. We've 
only really been able to solve this problem by undergoing an aggressive 
campaign of replacing hard-drives which were reaching the end of their 
lives. This has substantially reduced the amount of problems we've had 
in relation to this.

We would be very interested to hear about the rest of the community's 
experience in relation to this and I would recommend looking at your 
underlying OSDs Tim to see whether there are any timeout or 
uncorrectable errors. We would also be very eager to hear if these 
approaches are sub-optimal and whether anyone else has any insight into 
our problems. Sorry as well for resurrecting an old thread but we 
thought our experiences may be helpfully for others!

Kindest regards,

Ivan Clayson

On 19/09/2023 12:35, Tim Bishop wrote:
Hi,

I've seen this issue mentioned in the past, but with older releases. So
I'm wondering if anybody has any pointers.

The Ceph cluster is running Pacific 16.2.13 on Ubuntu 20.04. Almost all
clients are working fine, with the exception of our backup server. This
is using the kernel CephFS client on Ubuntu 22.04 with kernel 6.2.0 [1]
(so I suspect a newer Ceph version?).

The backup server has multiple (12) CephFS mount points. One of them,
the busiest, regularly causes this error on the cluster:

HEALTH_WARN 1 clients failing to respond to capability release
[WRN] MDS_CLIENT_LATE_RELEASE: 1 clients failing to respond to capability release
     mds.mds-server(mds.0): Client backupserver:cephfs-backupserver failing to respond to capability release client_id: 521306112

And occasionally, which may be unrelated, but occurs at the same time:

[WRN] MDS_SLOW_REQUEST: 1 MDSs report slow requests
     mds.mds-server(mds.0): 1 slow requests are blocked > 30 secs

The second one clears itself, but the first sticks until I can unmount
the filesystem on the client after the backup completes.

It appears that whilst it's in this stuck state there may be one or more
directory trees that are inaccessible to all clients. The backup server
is walking the whole tree but never gets stuck itself, so either the
inaccessible directory entry is caused after it has gone past, or it's
not affected. Maybe the backup server is holding a directory when it
shouldn't?

It may be that an upgrade to Quincy resolves this, since it's more
likely to be inline with the kernel client version wise, but I don't
want to knee-jerk upgrade just to try and fix this problem.

Thanks for any advice.

Tim.

[1] The reason for the newer kernel is that the backup performance from
CephFS was terrible with older kernels. This newer kernel does at least
resolve that issue.
_______________________________________________
ceph-users mailing list --ceph-users@xxxxxxx
To unsubscribe send an email toceph-users-leave@xxxxxxx

--
CAUTION: This email originated from outside of the LMB.
Do not click links or open attachments unless you recognize the sender and know the content is safe.
.-ceph-users-bounces@xxxxxxx-.

--
Ivan Clayson
-----------------
Scientific Computing Officer
MRC Laboratory of Molecular Biology
Francis Crick Ave, Cambridge
CB2 0QH
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx