MDS_CLIENT_LATE_RELEASE, MDS_SLOW_METADATA_IO, and MDS_SLOW_REQUEST errors and slow osd_ops despite hardware being fine

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hello everyone,

We've been experiencing on our quincy CephFS clusters (one 17.2.6 and another 17.2.7) repeated slow ops with our client kernel mounts (Ceph 17.2.7 and version 4 Linux kernels on all clients) that seem to originate from slow ops on osds despite the underlying hardware being fine. Our 2 clusters are similar and are both Alma8 systems where more specifically:

 * Cluster (1) is Alma8.8 running Ceph version 17.2.7 with 7 NVMe SSD
   OSDs storing the metadata and 432 spinning SATA disks storing the
   bulk data in an EC pool (8 data shards and 2 parity blocks) across
   40 nodes. The whole cluster is used to support a single file system
   with 1 active MDS and 2 standby ones.
 * Cluster (2) is Alma8.7 running Ceph version 17.2.6 with 4 NVMe SSD
   OSDs storing the metadata and 348 spinning SAS disks storing the
   bulk data in EC pools  (8 data shards and 2 parity blocks). This
   cluster houses multiple filesystems each with their own dedicated
   MDS along with 3 communal standby ones.

Nearly daily we often find that we're get the following error messages: MDS_CLIENT_LATE_RELEASE, MDS_SLOW_METADATA_IO, and MDS_SLOW_REQUEST. Along with these messages, certain files and directory cannot be stat-ed and any processes involving these files hang indefinitely. We have been fixing this by:

   1. First, finding the oldest blocked MDS op and the inode listed there:

       ~$ ceph tell mds.${my_mds} dump_blocked_ops 2>> /dev/null | grep
       -c description

           "description": "client_request(client.251247219:662 getattr
           AsLsXsFs #0x100922d1102 2024-03-13T12:51:57.988115+0000
           caller_uid=26983, caller_gid=26983)",

           # inode/ object of interest: 100922d1102

   2. Second, finding all the current clients that have a cap for this
   blocked inode from the faulty MDS' session list (i.e. ceph tell
   mds.${my_mds} session ls --cap-dump) and then examining the client
   whose had the cap the longest:

       ~$ ceph tell mds.${my_mds} session ls --cap-dump ...

           2024-03-13T13:01:36: client.251247219

           2024-03-13T12:50:28: client.245466949

   3. Then on the aforementioned oldest client, get the current ops in
   flight to the OSDs (via the "/sys/kernel/debug/ceph/*/osdc" files)
   and get the op corresponding to the blocked inode along with the OSD
   the I/O is going to:

       root@client245466949 $ grep 100922d1102
       /sys/kernel/debug/ceph/*/osdc

           48366  osd79 2.249f8a51  2.a51s0
           [79,351,232,179,107,195,323,14,128,167]/79
           [79,351,232,179,107,195,323,14,128,167]/79  e374191
           100922d1102.000000f5  0x400024  1 write

           # osd causing errors is osd.79

   4. Finally, we restart this "hanging" OSD where this results in ls
   and I/O on the previously "stuck" files no longer "hanging" .

Once we get this OSD for which the blocked inode is waiting for, we can see in the system logs that the OSD has slow ops:

~$ systemctl --no-pager --full status ceph-osd@79

   ...
   2024-03-13T12:49:37 -1 osd.79 374175 get_health_metrics reporting 3
   slow ops, oldest is osd_op(client.245466949.0:41350 2.ca4s0
   2.ce648ca4 (undecoded) ondisk+write+known_if_redirected e374173)
   ...

Files that these "hanging" inodes correspond to as well as the directories housing these files can't be opened or stat-ed (causing directories to hang) where we've found restarting this OSD with slow ops to be the least disruptive way of resolving this (compared with a forced umount and then re-mount on the client). There are no issues with the underlying hardware for either the osd reporting these slow ops or any other drive within the acting PG and there seems to be no correlation between what processes are involved or what type of files these are.

What could be causing these slow ops and certain files and directories to "hang"? There aren't workflows being performed that generate a large number of small files nor are there directories with a large number of files within them. This seems to happen with a wide range of hard-drives and we see this on SATA and SAS type drives where our nodes are interconnected with 25 Gb/s NICs so we can't see how the underlying hardware would be causing any I/O bottlenecks. Has anyone else seen this type of behaviour before and have any ideas? Is there a way to stop these from happening as we are having to solve these nearly daily now and we can't seem to find a way to reduce them. We do use snapshots to backup our cluster where we've been doing this for ~6 months and these issues have only been occurring on and off for a couple of months but much more frequently now.


Kindest regards,

Ivan Clayson

--
Ivan Clayson
-----------------
Scientific Computing Officer
Room 2N249
Structural Studies
MRC Laboratory of Molecular Biology
Francis Crick Ave, Cambridge
CB2 0QH
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx




[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux