MDS_CLIENT_LATE_RELEASE, MDS_SLOW_METADATA_IO, and MDS_SLOW_REQUEST errors and slow osd_ops despite hardware being fine

Ivan Clayson <ivan@xxxxxxxxxxxxxxxxx> · Fri, 15 Mar 2024 13:12:12 +0000

Hello everyone,

We've been experiencing on our quincy CephFS clusters (one 17.2.6 and 
another 17.2.7) repeated slow ops with our client kernel mounts 
(Ceph 17.2.7 and version 4 Linux kernels on all clients) that seem to 
originate from slow ops on osds despite the underlying hardware being 
fine. Our 2 clusters are similar and are both Alma8 systems where more 
specifically:

 * Cluster (1) is Alma8.8 running Ceph version 17.2.7 with 7 NVMe SSD
   OSDs storing the metadata and 432 spinning SATA disks storing the
   bulk data in an EC pool (8 data shards and 2 parity blocks) across
   40 nodes. The whole cluster is used to support a single file system
   with 1 active MDS and 2 standby ones.
 * Cluster (2) is Alma8.7 running Ceph version 17.2.6 with 4 NVMe SSD
   OSDs storing the metadata and 348 spinning SAS disks storing the
   bulk data in EC pools  (8 data shards and 2 parity blocks). This
   cluster houses multiple filesystems each with their own dedicated
   MDS along with 3 communal standby ones.

Nearly daily we often find that we're get the following error messages: 
MDS_CLIENT_LATE_RELEASE, MDS_SLOW_METADATA_IO, and MDS_SLOW_REQUEST. 
Along with these messages, certain files and directory cannot be stat-ed 
and any processes involving these files hang indefinitely. We have been 
fixing this by:

   1. First, finding the oldest blocked MDS op and the inode listed there:

       ~$ ceph tell mds.${my_mds} dump_blocked_ops 2>> /dev/null | grep
       -c description

           "description": "client_request(client.251247219:662 getattr
           AsLsXsFs #0x100922d1102 2024-03-13T12:51:57.988115+0000
           caller_uid=26983, caller_gid=26983)",

           # inode/ object of interest: 100922d1102

   2. Second, finding all the current clients that have a cap for this
   blocked inode from the faulty MDS' session list (i.e. ceph tell
   mds.${my_mds} session ls --cap-dump) and then examining the client
   whose had the cap the longest:

       ~$ ceph tell mds.${my_mds} session ls --cap-dump ...

           2024-03-13T13:01:36: client.251247219

           2024-03-13T12:50:28: client.245466949

   3. Then on the aforementioned oldest client, get the current ops in
   flight to the OSDs (via the "/sys/kernel/debug/ceph/*/osdc" files)
   and get the op corresponding to the blocked inode along with the OSD
   the I/O is going to:

       root@client245466949 $ grep 100922d1102
       /sys/kernel/debug/ceph/*/osdc

           48366  osd79 2.249f8a51  2.a51s0
           [79,351,232,179,107,195,323,14,128,167]/79
           [79,351,232,179,107,195,323,14,128,167]/79  e374191
           100922d1102.000000f5  0x400024  1 write

           # osd causing errors is osd.79

   4. Finally, we restart this "hanging" OSD where this results in ls
   and I/O on the previously "stuck" files no longer "hanging" .

Once we get this OSD for which the blocked inode is waiting for, we can 
see in the system logs that the OSD has slow ops:

~$ systemctl --no-pager --full status ceph-osd@79

   ...
   2024-03-13T12:49:37 -1 osd.79 374175 get_health_metrics reporting 3
   slow ops, oldest is osd_op(client.245466949.0:41350 2.ca4s0
   2.ce648ca4 (undecoded) ondisk+write+known_if_redirected e374173)
   ...

Files that these "hanging" inodes correspond to as well as the 
directories housing these files can't be opened or stat-ed (causing 
directories to hang) where we've found restarting this OSD with slow ops 
to be the least disruptive way of resolving this (compared with a forced 
umount and then re-mount on the client). There are no issues with the 
underlying hardware for either the osd reporting these slow ops or any 
other drive within the acting PG and there seems to be no correlation 
between what processes are involved or what type of files these are.

What could be causing these slow ops and certain files and directories 
to "hang"? There aren't workflows being performed that generate a large 
number of small files nor are there directories with a large number of 
files within them. This seems to happen with a wide range of hard-drives 
and we see this on SATA and SAS type drives where our nodes are 
interconnected with 25 Gb/s NICs so we can't see how the underlying 
hardware would be causing any I/O bottlenecks. Has anyone else seen this 
type of behaviour before and have any ideas? Is there a way to stop 
these from happening as we are having to solve these nearly daily now 
and we can't seem to find a way to reduce them. We do use snapshots to 
backup our cluster where we've been doing this for ~6 months and these 
issues have only been occurring on and off for a couple of months but 
much more frequently now.

Kindest regards,

Ivan Clayson

--
Ivan Clayson
-----------------
Scientific Computing Officer
Room 2N249
Structural Studies
MRC Laboratory of Molecular Biology
Francis Crick Ave, Cambridge
CB2 0QH
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx