Re: MDS_CLIENT_LATE_RELEASE, MDS_SLOW_METADATA_IO, and MDS_SLOW_REQUEST errors and slow osd_ops despite hardware being fine

Enrico Bocchi <enrico.bocchi@xxxxxxx> · Tue, 19 Mar 2024 12:55:43 +0100

Hello Ivan,

Do you observe any spikes in the memory utilization of the MDS when the 
lock happens? Particularly in buffer_anon?
We are observing some rdlock issues which are leading to a spinlock on 
the MDS but it does not seem to be related to a hanging operation on an OSD.

Cheers,
Enrico

On 3/19/24 12:42, Ivan Clayson wrote:
Hello Gregory and Nathan,

Having a look at our resource utilization, there doesn't seem to be a 
CPU or memory bottleneck as there is plenty of both available for the 
host which has the blocked OSD as well for the MDS' host.

We've had a repeated of this problem today where the OSD logging slow 
ops did not have any ops in flight despite: (i) being in an active 
state, (ii) clients requesting I/O from this OSD, and (iii) the MDS 
reporting that it was unable to get a rdlock. The blocked op reported 
by the MDS was initially related to our backups (but is not always) 
where this takes a snapshot every night and then we back up the 
snapshot to another Ceph cluster. We then delete this snapshot after 
we've backed it up.

   # the blocked op on the MDS is related to backing up a snapshot of
   the file $FILE:
   ~$ ceph tell mds.0 dump_blocked_ops
                "description": "client_request(client.90018803:265050
   getattr AsLsXsFs #0x100004e2452//1710815659/... ... caller_uid=...,
   caller_gid=...)"
                "initiated_at": "...",
                "age": ...,
                "duration": ...,
                "type_data": {
                    "flag_point": "failed to rdlock, waiting",
   ...

   ~$ ls -lrt $FILE
   .... # ls -lrt hangs as it hangs on a statx syscall on the file
   where this then comes up as another blocked op in the MDS op list

   ~$ ceph tell mds.0 dump_blocked_ops
   ....
   client_request(client.91265572:7 getattr AsLsFs #0x1002fe5d755 ...
   caller_uid=..., caller_gid=...)

   root@client-whose-held-active-cap-for-1002fe5d755-the-longest ~$
   grep 1002fe5d755 /sys/kernel/debug/ceph/*/osdc
   1652    osd7    3.3519a4ff      3.4ffs0
   [7,132,61,143,109,98,18,44,269,238]/7
   [7,132,61,143,109,98,18,44,269,238]/7   e159072
   1002fe5d755.00000011    0x400024        1       write

   ~$ systemctl status --no-pager --full ceph-osd@7
   ....
   ceph-osd[1184036]:  osd.7 158945 get_health_metrics reporting 8 slow
   ops, oldest is osd_op(client.90099026.0:4068839 3.4ffs0
   3:ff28a5a4:::1002feddfaa.00000000:head [write 0~4194304 [1@-1]
   in=4194304b] snapc 6af1=[] ondisk+write+known_if_redirected e158942)
   ceph-osd[1184036]:  osd.7 158945 get_health_metrics reporting 6 slow
   ops, oldest is osd_op(client.90099026.0:4068839 3.4ffs0
   3:ff28a5a4:::1002feddfaa.00000000:head [write 0~4194304 [1@-1]
   in=4194304b] snapc 6af1=[] ondisk+write+known_if_redirected e158942)

There was nothing in dmesg or wrong with the HDD for osd.7 (or any 
drives for that matter) and osd.7 reported no blocked ops or any ops 
in flight from the daemon via `ceph tell`. However when looking at the 
historic slow ops, the oldest one still saved related to this stuck 
$FILE object (1002fe5d755) and it seems that about half of the 
recorded historical slow ops are about this PG with them all occurring 
around the same time the OSD slow ops started occurring:

   ~$ ceph tell osd.7 dump_historic_slow_ops
               "description": "osd_op(client.89624569.0:1151567 3.4ffs0
   3:ff27ac5d:::1002fea2a90.0000000a:head [write 0~4194304] snapc
   6ab9=[6ab9] ondisk+write+known_if_redirected e158919)",
                "initiated_at": "...",
                "age": ...,
                "duration": ...,
                "type_data": {
                    "flag_point": "commit sent; apply or cleanup",
                ...
                        {
                            "event": "header_read",
                            "time": "2024-03-18T19:43:41.875596+0000",
                            "duration": 4294967295.9999967
                        },

I've highlighted this header_read duration as this took apparently 
~136 years(!) so there seems to be something off maybe with the 
messenger layer.

I would be eager to hear what your thoughts are on this as it seems 
after awhile the OSD "forgets" about this slow op and stops reporting 
it in the log. I'm also curious about your thoughts on this being 
related to the number of snapshots we have as we get rid of the 
snapshot on this filesystem when we've copied over to the backup 
system but could this still cause problems and or are they issues with 
snaps?

Kindest regards,

Ivan

On 15/03/2024 18:07, Gregory Farnum wrote:
CAUTION: This email originated from outside of the LMB:
*.-gfarnum@xxxxxxxxxx-.*
Do not click links or open attachments unless you recognize the 
sender and know the content is safe.
If you think this is a phishing email, please forward it to 
phishing@xxxxxxxxxxxxxxxxx

--

On Fri, Mar 15, 2024 at 6:15 AM Ivan Clayson <ivan@xxxxxxxxxxxxxxxxx> 
wrote:

    Hello everyone,

    We've been experiencing on our quincy CephFS clusters (one 17.2.6 
and
    another 17.2.7) repeated slow ops with our client kernel mounts
    (Ceph 17.2.7 and version 4 Linux kernels on all clients) that 
seem to
    originate from slow ops on osds despite the underlying hardware 
being
    fine. Our 2 clusters are similar and are both Alma8 systems where
    more
    specifically:

      * Cluster (1) is Alma8.8 running Ceph version 17.2.7 with 7 
NVMe SSD
        OSDs storing the metadata and 432 spinning SATA disks storing 
the
        bulk data in an EC pool (8 data shards and 2 parity blocks) 
across
        40 nodes. The whole cluster is used to support a single file
    system
        with 1 active MDS and 2 standby ones.
      * Cluster (2) is Alma8.7 running Ceph version 17.2.6 with 4 
NVMe SSD
        OSDs storing the metadata and 348 spinning SAS disks storing the
        bulk data in EC pools  (8 data shards and 2 parity blocks). This
        cluster houses multiple filesystems each with their own 
dedicated
        MDS along with 3 communal standby ones.

    Nearly daily we often find that we're get the following error
    messages:
    MDS_CLIENT_LATE_RELEASE, MDS_SLOW_METADATA_IO, and MDS_SLOW_REQUEST.
    Along with these messages, certain files and directory cannot be
    stat-ed
    and any processes involving these files hang indefinitely. We have
    been
    fixing this by:

        1. First, finding the oldest blocked MDS op and the inode
    listed there:

            ~$ ceph tell mds.${my_mds} dump_blocked_ops 2>> /dev/null
    | grep
            -c description

                "description": "client_request(client.251247219:662
    getattr
                AsLsXsFs #0x100922d1102 2024-03-13T12:51:57.988115+0000
                caller_uid=26983, caller_gid=26983)",

                # inode/ object of interest: 100922d1102

        2. Second, finding all the current clients that have a cap for
    this
        blocked inode from the faulty MDS' session list (i.e. ceph tell
        mds.${my_mds} session ls --cap-dump) and then examining the 
client
        whose had the cap the longest:

            ~$ ceph tell mds.${my_mds} session ls --cap-dump ...

                2024-03-13T13:01:36: client.251247219

                2024-03-13T12:50:28: client.245466949

        3. Then on the aforementioned oldest client, get the current
    ops in
        flight to the OSDs (via the "/sys/kernel/debug/ceph/*/osdc" 
files)
        and get the op corresponding to the blocked inode along with
    the OSD
        the I/O is going to:

            root@client245466949 $ grep 100922d1102
            /sys/kernel/debug/ceph/*/osdc

                48366  osd79 2.249f8a51  2.a51s0
                [79,351,232,179,107,195,323,14,128,167]/79
                [79,351,232,179,107,195,323,14,128,167]/79 e374191
                100922d1102.000000f5  0x400024  1 write

                # osd causing errors is osd.79

        4. Finally, we restart this "hanging" OSD where this results 
in ls
        and I/O on the previously "stuck" files no longer "hanging" .

    Once we get this OSD for which the blocked inode is waiting for,
    we can
    see in the system logs that the OSD has slow ops:

    ~$ systemctl --no-pager --full status ceph-osd@79

        ...
        2024-03-13T12:49:37 -1 osd.79 374175 get_health_metrics
    reporting 3
        slow ops, oldest is osd_op(client.245466949.0:41350 2.ca4s0
        2.ce648ca4 (undecoded) ondisk+write+known_if_redirected e374173)
        ...

Have you dumped_ops_in_flight on the OSD in question to see how far 
that op got before getting stuck?

This is some kind of RADOS problem, which isn’t great, but I wonder 
if we’ve exceeded some snapshot threshold that is showing up on hard 
drives as slow ops, or if there’s a code bug that is just causing 
them to get lost. :/
-Greg

    Files that these "hanging" inodes correspond to as well as the
    directories housing these files can't be opened or stat-ed (causing
    directories to hang) where we've found restarting this OSD with
    slow ops
    to be the least disruptive way of resolving this (compared with a
    forced
    umount and then re-mount on the client). There are no issues with 
the
    underlying hardware for either the osd reporting these slow ops or
    any
    other drive within the acting PG and there seems to be no 
correlation
    between what processes are involved or what type of files these are.

    What could be causing these slow ops and certain files and
    directories
    to "hang"? There aren't workflows being performed that generate a
    large
    number of small files nor are there directories with a large
    number of
    files within them. This seems to happen with a wide range of
    hard-drives
    and we see this on SATA and SAS type drives where our nodes are
    interconnected with 25 Gb/s NICs so we can't see how the underlying
    hardware would be causing any I/O bottlenecks. Has anyone else
    seen this
    type of behaviour before and have any ideas? Is there a way to stop
    these from happening as we are having to solve these nearly daily 
now
    and we can't seem to find a way to reduce them. We do use
    snapshots to
    backup our cluster where we've been doing this for ~6 months and
    these
    issues have only been occurring on and off for a couple of months 
but
    much more frequently now.

    Kindest regards,

    Ivan Clayson

    --     Ivan Clayson
    -----------------
    Scientific Computing Officer
    Room 2N249
    Structural Studies
    MRC Laboratory of Molecular Biology
    Francis Crick Ave, Cambridge
    CB2 0QH
    _______________________________________________
    ceph-users mailing list -- ceph-users@xxxxxxx
    To unsubscribe send an email to ceph-users-leave@xxxxxxx

--
Enrico Bocchi
CERN European Laboratory for Particle Physics
IT - Storage & Data Management  - General Storage Services
Mailbox: G20500 - Office: 31-2-010
1211 Genève 23
Switzerland
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx