Hello everyone,
We've been experiencing on our quincy CephFS clusters (one 17.2.6 and
another 17.2.7) repeated slow ops with our client kernel mounts
(Ceph 17.2.7 and version 4 Linux kernels on all clients) that seem to
originate from slow ops on osds despite the underlying hardware being
fine. Our 2 clusters are similar and are both Alma8 systems where more
specifically:
* Cluster (1) is Alma8.8 running Ceph version 17.2.7 with 7 NVMe SSD
OSDs storing the metadata and 432 spinning SATA disks storing the
bulk data in an EC pool (8 data shards and 2 parity blocks) across
40 nodes. The whole cluster is used to support a single file system
with 1 active MDS and 2 standby ones.
* Cluster (2) is Alma8.7 running Ceph version 17.2.6 with 4 NVMe SSD
OSDs storing the metadata and 348 spinning SAS disks storing the
bulk data in EC pools (8 data shards and 2 parity blocks). This
cluster houses multiple filesystems each with their own dedicated
MDS along with 3 communal standby ones.
Nearly daily we often find that we're get the following error messages:
MDS_CLIENT_LATE_RELEASE, MDS_SLOW_METADATA_IO, and MDS_SLOW_REQUEST.
Along with these messages, certain files and directory cannot be stat-ed
and any processes involving these files hang indefinitely. We have been
fixing this by:
1. First, finding the oldest blocked MDS op and the inode listed there:
~$ ceph tell mds.${my_mds} dump_blocked_ops 2>> /dev/null | grep
-c description
"description": "client_request(client.251247219:662 getattr
AsLsXsFs #0x100922d1102 2024-03-13T12:51:57.988115+0000
caller_uid=26983, caller_gid=26983)",
# inode/ object of interest: 100922d1102
2. Second, finding all the current clients that have a cap for this
blocked inode from the faulty MDS' session list (i.e. ceph tell
mds.${my_mds} session ls --cap-dump) and then examining the client
whose had the cap the longest:
~$ ceph tell mds.${my_mds} session ls --cap-dump ...
2024-03-13T13:01:36: client.251247219
2024-03-13T12:50:28: client.245466949
3. Then on the aforementioned oldest client, get the current ops in
flight to the OSDs (via the "/sys/kernel/debug/ceph/*/osdc" files)
and get the op corresponding to the blocked inode along with the OSD
the I/O is going to:
root@client245466949 $ grep 100922d1102
/sys/kernel/debug/ceph/*/osdc
48366 osd79 2.249f8a51 2.a51s0
[79,351,232,179,107,195,323,14,128,167]/79
[79,351,232,179,107,195,323,14,128,167]/79 e374191
100922d1102.000000f5 0x400024 1 write
# osd causing errors is osd.79
4. Finally, we restart this "hanging" OSD where this results in ls
and I/O on the previously "stuck" files no longer "hanging" .
Once we get this OSD for which the blocked inode is waiting for, we can
see in the system logs that the OSD has slow ops:
~$ systemctl --no-pager --full status ceph-osd@79
...
2024-03-13T12:49:37 -1 osd.79 374175 get_health_metrics reporting 3
slow ops, oldest is osd_op(client.245466949.0:41350 2.ca4s0
2.ce648ca4 (undecoded) ondisk+write+known_if_redirected e374173)
...
Files that these "hanging" inodes correspond to as well as the
directories housing these files can't be opened or stat-ed (causing
directories to hang) where we've found restarting this OSD with slow ops
to be the least disruptive way of resolving this (compared with a forced
umount and then re-mount on the client). There are no issues with the
underlying hardware for either the osd reporting these slow ops or any
other drive within the acting PG and there seems to be no correlation
between what processes are involved or what type of files these are.
What could be causing these slow ops and certain files and directories
to "hang"? There aren't workflows being performed that generate a large
number of small files nor are there directories with a large number of
files within them. This seems to happen with a wide range of hard-drives
and we see this on SATA and SAS type drives where our nodes are
interconnected with 25 Gb/s NICs so we can't see how the underlying
hardware would be causing any I/O bottlenecks. Has anyone else seen this
type of behaviour before and have any ideas? Is there a way to stop
these from happening as we are having to solve these nearly daily now
and we can't seem to find a way to reduce them. We do use snapshots to
backup our cluster where we've been doing this for ~6 months and these
issues have only been occurring on and off for a couple of months but
much more frequently now.
Kindest regards,
Ivan Clayson
--
Ivan Clayson
-----------------
Scientific Computing Officer
Room 2N249
Structural Studies
MRC Laboratory of Molecular Biology
Francis Crick Ave, Cambridge
CB2 0QH
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx