Re: MDS_CLIENT_LATE_RELEASE, MDS_SLOW_METADATA_IO, and MDS_SLOW_REQUEST errors and slow osd_ops despite hardware being fine

Gregory Farnum <gfarnum@xxxxxxxxxx> · Fri, 15 Mar 2024 11:07:31 -0700

On Fri, Mar 15, 2024 at 6:15 AM Ivan Clayson <ivan@xxxxxxxxxxxxxxxxx> wrote:

> Hello everyone,
>
> We've been experiencing on our quincy CephFS clusters (one 17.2.6 and
> another 17.2.7) repeated slow ops with our client kernel mounts
> (Ceph 17.2.7 and version 4 Linux kernels on all clients) that seem to
> originate from slow ops on osds despite the underlying hardware being
> fine. Our 2 clusters are similar and are both Alma8 systems where more
> specifically:
>
>   * Cluster (1) is Alma8.8 running Ceph version 17.2.7 with 7 NVMe SSD
>     OSDs storing the metadata and 432 spinning SATA disks storing the
>     bulk data in an EC pool (8 data shards and 2 parity blocks) across
>     40 nodes. The whole cluster is used to support a single file system
>     with 1 active MDS and 2 standby ones.
>   * Cluster (2) is Alma8.7 running Ceph version 17.2.6 with 4 NVMe SSD
>     OSDs storing the metadata and 348 spinning SAS disks storing the
>     bulk data in EC pools  (8 data shards and 2 parity blocks). This
>     cluster houses multiple filesystems each with their own dedicated
>     MDS along with 3 communal standby ones.
>
> Nearly daily we often find that we're get the following error messages:
> MDS_CLIENT_LATE_RELEASE, MDS_SLOW_METADATA_IO, and MDS_SLOW_REQUEST.
> Along with these messages, certain files and directory cannot be stat-ed
> and any processes involving these files hang indefinitely. We have been
> fixing this by:
>
>     1. First, finding the oldest blocked MDS op and the inode listed there:
>
>         ~$ ceph tell mds.${my_mds} dump_blocked_ops 2>> /dev/null | grep
>         -c description
>
>             "description": "client_request(client.251247219:662 getattr
>             AsLsXsFs #0x100922d1102 2024-03-13T12:51:57.988115+0000
>             caller_uid=26983, caller_gid=26983)",
>
>             # inode/ object of interest: 100922d1102
>
>     2. Second, finding all the current clients that have a cap for this
>     blocked inode from the faulty MDS' session list (i.e. ceph tell
>     mds.${my_mds} session ls --cap-dump) and then examining the client
>     whose had the cap the longest:
>
>         ~$ ceph tell mds.${my_mds} session ls --cap-dump ...
>
>             2024-03-13T13:01:36: client.251247219
>
>             2024-03-13T12:50:28: client.245466949
>
>     3. Then on the aforementioned oldest client, get the current ops in
>     flight to the OSDs (via the "/sys/kernel/debug/ceph/*/osdc" files)
>     and get the op corresponding to the blocked inode along with the OSD
>     the I/O is going to:
>
>         root@client245466949 $ grep 100922d1102
>         /sys/kernel/debug/ceph/*/osdc
>
>             48366  osd79 2.249f8a51  2.a51s0
>             [79,351,232,179,107,195,323,14,128,167]/79
>             [79,351,232,179,107,195,323,14,128,167]/79  e374191
>             100922d1102.000000f5  0x400024  1 write
>
>             # osd causing errors is osd.79
>
>     4. Finally, we restart this "hanging" OSD where this results in ls
>     and I/O on the previously "stuck" files no longer "hanging" .
>
> Once we get this OSD for which the blocked inode is waiting for, we can
> see in the system logs that the OSD has slow ops:
>
> ~$ systemctl --no-pager --full status ceph-osd@79
>
>     ...
>     2024-03-13T12:49:37 -1 osd.79 374175 get_health_metrics reporting 3
>     slow ops, oldest is osd_op(client.245466949.0:41350 2.ca4s0
>     2.ce648ca4 (undecoded) ondisk+write+known_if_redirected e374173)
>     ...

Have you dumped_ops_in_flight on the OSD in question to see how far that op
got before getting stuck?

This is some kind of RADOS problem, which isn’t great, but I wonder if
we’ve exceeded some snapshot threshold that is showing up on hard drives as
slow ops, or if there’s a code bug that is just causing them to get lost. :/
-Greg

>
> Files that these "hanging" inodes correspond to as well as the
> directories housing these files can't be opened or stat-ed (causing
> directories to hang) where we've found restarting this OSD with slow ops
> to be the least disruptive way of resolving this (compared with a forced
> umount and then re-mount on the client). There are no issues with the
> underlying hardware for either the osd reporting these slow ops or any
> other drive within the acting PG and there seems to be no correlation
> between what processes are involved or what type of files these are.
>
> What could be causing these slow ops and certain files and directories
> to "hang"? There aren't workflows being performed that generate a large
> number of small files nor are there directories with a large number of
> files within them. This seems to happen with a wide range of hard-drives
> and we see this on SATA and SAS type drives where our nodes are
> interconnected with 25 Gb/s NICs so we can't see how the underlying
> hardware would be causing any I/O bottlenecks. Has anyone else seen this
> type of behaviour before and have any ideas? Is there a way to stop
> these from happening as we are having to solve these nearly daily now
> and we can't seem to find a way to reduce them. We do use snapshots to
> backup our cluster where we've been doing this for ~6 months and these
> issues have only been occurring on and off for a couple of months but
> much more frequently now.
>
>
> Kindest regards,
>
> Ivan Clayson
>
> --
> Ivan Clayson
> -----------------
> Scientific Computing Officer
> Room 2N249
> Structural Studies
> MRC Laboratory of Molecular Biology
> Francis Crick Ave, Cambridge
> CB2 0QH
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx