We've been seeing something that may be similar to you with concurrent
MDS_CLIENT_LATE_RELEASE and MDS_SLOW_REQUEST warning messages as well as
frequently MDS_CLIENT_RECALL and MDS_SLOW_METADATA_IO warnings from the
same MDS referring to the same client. We are using 1 MDS for our
non-containerised filesystem on a 5.7 PiB sized Alma8.8 cluster with 29
nodes and 348 spinning OSDs for bulk data and 4 OSDs for the metadata on
NVMe SSDs with 174 million files (we also have additional NVMe drive
partitions as DB and WAL devices for each OSD). All of our clients are
using the kernel mount where we also have a SMB gateway which kernel
mounts the filesystem and shares it to Windows and Mac machines.
This problem seems to have various symptoms including but not limited
to: (i) a particular file or directory hanging on openfs, read, and or
statx system calls for all clients mounting the filesystem, (ii) all our
connected clients hanging on the aforementioned system calls when
performing any metadata or bulk data I/O on the filesystem, (iii) 1
client being unable to stat or cd into the filesystem whilst all other
clients are completely unaffected, and (iv) possibly a near complete
loss of metadata due to the MDS journal swelling beyond the size of the
MDS pool where we eventually had some success following the recovery
but opted for a backup restore instead (we were able to restore the bulk
data to a read-only state but not all the metadata where we would be
more than happy to write a "worked example" of this if the community
would find it useful). We use to solve this by forcefully re-mounting
the filesystem on the client but as we run multi-user systems and have
100s of clients, this was not a sustainable option for us.
We notice generally 3 classes of error:
* The oldest slow or blocked request in the ceph-mds log is waiting to
acquire locks for an inode (visible via `ceph tell mds.N
dump_blocked_ops`). By tracing the oldest FS client which has the
capability for this file and seeing which OSD the client is waiting
for with `watch 'cat /sys/kernel/debug/ceph/*/osdc'` or by looking
at the output of `ceph tell mds.N objecter requests`, we can
determine which exact OSD is being slow (always a spinning disk).
These issues are due to hardware issues or the OSD crashing (where a
restart seems to remedy the problem) and thus counterintuitively (at
least for me) suggests that our error messages for the MDS can refer
to OSD hard-drive problems. These HDDs are in a completely separate
pool to where I would have thought the metadata I/O would be being
performed since our metadata pool is exclusively SSDs.
* The oldest, or near oldest, blocked MDS op being an "internal op
fragmentdir" operation and many client_requests at the flag_point
"failed to authpin, dir is being fragmented". This seems to be an
MDS bug as the documentation suggests that this should only happen
when in a multi-MDS system but we have been using only 1 MDS. A
simple restart of the MDS solves this problem.
* Client has a crash and either has been blocklisted or should be.
Blocklisted clients may not immediately know they've been evicted
and thus may be hanging indefinitely when stating the filesystem.
This can be detected by searching through the client list via `ceph
tell mds.N client ls` and seeing whether there is an ID associated
with the client. However whilst a manual re-mount will allow the
former client to establish new session, we prefer to use the
"recover_session=clean" mount option to do this automatically and
now we have no incidents of this.
Yesterday for example, we had an incident of multiple MDS warning
messages of: MDS_SLOW_REQUEST, MDS_TRIM, MDS_CLIENT_RECALL, and
MDS_CLIENT_LATE_RELEASE. This was caused by a non-responsive hard-drive
leading to a build up of the MDS cache and trims being unable to be
completed where we managed to narrow down the hard-drive for the inode
which the blocked client was waiting for a rdlock on and restarted the
OSD for that the drive. Notably this hard-drive had no errors with
smartctl or elsewhere and only had the following slow ops message on OSD
systemctl status/ log:
osd.281 123892 get_health_metrics reporting 4 slow ops, oldest is
osd_op(client.49576654.0:9101 3.d9bs0 3.ee6f2d9b (undecoded)
Whilst restarting the hard-drive that the client with the oldest blocked
op was waiting for did "clear" this /sys/kernel/debug/ceph/*/osdc queue,
the oldest blocked MDS op then became an "internal op
fragmentdir:mds.0:1" one where restarting the active MDS cleared this.
Alas, this resulted in another blocked getattr op at the flag point
"failed to authpin, dir is being fragmented" which was similarly tackled
by restarting the MDS that just took over. This finally resulted in only
two clients failing to respond to caps releases on inodes they were
holding (despite rebooting at the time) where performing a "ceph tell
mds.N session kill CLIENT_ID" removed them from the session map and
allow the MDS' cache to become manageable again, thereby clearing all of
these warning messages.
We've had this problem since the beginning of this year and upgrading
from octopus to quincy has unfortunately not solved our problem. We've
only really been able to solve this problem by undergoing an aggressive
campaign of replacing hard-drives which were reaching the end of their
lives. This has substantially reduced the amount of problems we've had
in relation to this.
We would be very interested to hear about the rest of the community's
experience in relation to this and I would recommend looking at your
underlying OSDs Tim to see whether there are any timeout or
uncorrectable errors. We would also be very eager to hear if these
approaches are sub-optimal and whether anyone else has any insight into
our problems. Sorry as well for resurrecting an old thread but we
thought our experiences may be helpfully for others!
On 19/09/2023 12:35, Tim Bishop wrote:
I've seen this issue mentioned in the past, but with older releases. So
I'm wondering if anybody has any pointers.
The Ceph cluster is running Pacific 16.2.13 on Ubuntu 20.04. Almost all
clients are working fine, with the exception of our backup server. This
is using the kernel CephFS client on Ubuntu 22.04 with kernel 6.2.0 
(so I suspect a newer Ceph version?).
The backup server has multiple (12) CephFS mount points. One of them,
the busiest, regularly causes this error on the cluster:
HEALTH_WARN 1 clients failing to respond to capability release
[WRN] MDS_CLIENT_LATE_RELEASE: 1 clients failing to respond to capability release
mds.mds-server(mds.0): Client backupserver:cephfs-backupserver failing to respond to capability release client_id: 521306112
And occasionally, which may be unrelated, but occurs at the same time:
[WRN] MDS_SLOW_REQUEST: 1 MDSs report slow requests
mds.mds-server(mds.0): 1 slow requests are blocked > 30 secs
The second one clears itself, but the first sticks until I can unmount
the filesystem on the client after the backup completes.
It appears that whilst it's in this stuck state there may be one or more
directory trees that are inaccessible to all clients. The backup server
is walking the whole tree but never gets stuck itself, so either the
inaccessible directory entry is caused after it has gone past, or it's
not affected. Maybe the backup server is holding a directory when it
It may be that an upgrade to Quincy resolves this, since it's more
likely to be inline with the kernel client version wise, but I don't
want to knee-jerk upgrade just to try and fix this problem.
Thanks for any advice.
 The reason for the newer kernel is that the backup performance from
CephFS was terrible with older kernels. This newer kernel does at least
resolve that issue.
ceph-users mailing list --ceph-users@xxxxxxx
To unsubscribe send an email toceph-users-leave@xxxxxxx
CAUTION: This email originated from outside of the LMB.
Do not click links or open attachments unless you recognize the sender and know the content is safe.
Scientific Computing Officer
MRC Laboratory of Molecular Biology
Francis Crick Ave, Cambridge
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx