Hello Gregory and Nathan,
Having a look at our resource utilization, there doesn't seem to be a
CPU or memory bottleneck as there is plenty of both available for the
host which has the blocked OSD as well for the MDS' host.
We've had a repeated of this problem today where the OSD logging slow
ops did not have any ops in flight despite: (i) being in an active
state, (ii) clients requesting I/O from this OSD, and (iii) the MDS
reporting that it was unable to get a rdlock. The blocked op reported
by the MDS was initially related to our backups (but is not always)
where this takes a snapshot every night and then we back up the
snapshot to another Ceph cluster. We then delete this snapshot after
we've backed it up.
# the blocked op on the MDS is related to backing up a snapshot of
the file $FILE:
~$ ceph tell mds.0 dump_blocked_ops
"description": "client_request(client.90018803:265050
getattr AsLsXsFs #0x100004e2452//1710815659/... ... caller_uid=...,
caller_gid=...)"
"initiated_at": "...",
"age": ...,
"duration": ...,
"type_data": {
"flag_point": "failed to rdlock, waiting",
...
~$ ls -lrt $FILE
.... # ls -lrt hangs as it hangs on a statx syscall on the file
where this then comes up as another blocked op in the MDS op list
~$ ceph tell mds.0 dump_blocked_ops
....
client_request(client.91265572:7 getattr AsLsFs #0x1002fe5d755 ...
caller_uid=..., caller_gid=...)
root@client-whose-held-active-cap-for-1002fe5d755-the-longest ~$
grep 1002fe5d755 /sys/kernel/debug/ceph/*/osdc
1652 osd7 3.3519a4ff 3.4ffs0
[7,132,61,143,109,98,18,44,269,238]/7
[7,132,61,143,109,98,18,44,269,238]/7 e159072
1002fe5d755.00000011 0x400024 1 write
~$ systemctl status --no-pager --full ceph-osd@7
....
ceph-osd[1184036]: osd.7 158945 get_health_metrics reporting 8 slow
ops, oldest is osd_op(client.90099026.0:4068839 3.4ffs0
3:ff28a5a4:::1002feddfaa.00000000:head [write 0~4194304 [1@-1]
in=4194304b] snapc 6af1=[] ondisk+write+known_if_redirected e158942)
ceph-osd[1184036]: osd.7 158945 get_health_metrics reporting 6 slow
ops, oldest is osd_op(client.90099026.0:4068839 3.4ffs0
3:ff28a5a4:::1002feddfaa.00000000:head [write 0~4194304 [1@-1]
in=4194304b] snapc 6af1=[] ondisk+write+known_if_redirected e158942)
There was nothing in dmesg or wrong with the HDD for osd.7 (or any
drives for that matter) and osd.7 reported no blocked ops or any ops
in flight from the daemon via `ceph tell`. However when looking at the
historic slow ops, the oldest one still saved related to this stuck
$FILE object (1002fe5d755) and it seems that about half of the
recorded historical slow ops are about this PG with them all occurring
around the same time the OSD slow ops started occurring:
~$ ceph tell osd.7 dump_historic_slow_ops
"description": "osd_op(client.89624569.0:1151567 3.4ffs0
3:ff27ac5d:::1002fea2a90.0000000a:head [write 0~4194304] snapc
6ab9=[6ab9] ondisk+write+known_if_redirected e158919)",
"initiated_at": "...",
"age": ...,
"duration": ...,
"type_data": {
"flag_point": "commit sent; apply or cleanup",
...
{
"event": "header_read",
"time": "2024-03-18T19:43:41.875596+0000",
"duration": 4294967295.9999967
},
I've highlighted this header_read duration as this took apparently
~136 years(!) so there seems to be something off maybe with the
messenger layer.
I would be eager to hear what your thoughts are on this as it seems
after awhile the OSD "forgets" about this slow op and stops reporting
it in the log. I'm also curious about your thoughts on this being
related to the number of snapshots we have as we get rid of the
snapshot on this filesystem when we've copied over to the backup
system but could this still cause problems and or are they issues with
snaps?
Kindest regards,
Ivan
On 15/03/2024 18:07, Gregory Farnum wrote:
CAUTION: This email originated from outside of the LMB:
*.-gfarnum@xxxxxxxxxx-.*
Do not click links or open attachments unless you recognize the
sender and know the content is safe.
If you think this is a phishing email, please forward it to
phishing@xxxxxxxxxxxxxxxxx
--
On Fri, Mar 15, 2024 at 6:15 AM Ivan Clayson <ivan@xxxxxxxxxxxxxxxxx>
wrote:
Hello everyone,
We've been experiencing on our quincy CephFS clusters (one 17.2.6
and
another 17.2.7) repeated slow ops with our client kernel mounts
(Ceph 17.2.7 and version 4 Linux kernels on all clients) that
seem to
originate from slow ops on osds despite the underlying hardware
being
fine. Our 2 clusters are similar and are both Alma8 systems where
more
specifically:
* Cluster (1) is Alma8.8 running Ceph version 17.2.7 with 7
NVMe SSD
OSDs storing the metadata and 432 spinning SATA disks storing
the
bulk data in an EC pool (8 data shards and 2 parity blocks)
across
40 nodes. The whole cluster is used to support a single file
system
with 1 active MDS and 2 standby ones.
* Cluster (2) is Alma8.7 running Ceph version 17.2.6 with 4
NVMe SSD
OSDs storing the metadata and 348 spinning SAS disks storing the
bulk data in EC pools (8 data shards and 2 parity blocks). This
cluster houses multiple filesystems each with their own
dedicated
MDS along with 3 communal standby ones.
Nearly daily we often find that we're get the following error
messages:
MDS_CLIENT_LATE_RELEASE, MDS_SLOW_METADATA_IO, and MDS_SLOW_REQUEST.
Along with these messages, certain files and directory cannot be
stat-ed
and any processes involving these files hang indefinitely. We have
been
fixing this by:
1. First, finding the oldest blocked MDS op and the inode
listed there:
~$ ceph tell mds.${my_mds} dump_blocked_ops 2>> /dev/null
| grep
-c description
"description": "client_request(client.251247219:662
getattr
AsLsXsFs #0x100922d1102 2024-03-13T12:51:57.988115+0000
caller_uid=26983, caller_gid=26983)",
# inode/ object of interest: 100922d1102
2. Second, finding all the current clients that have a cap for
this
blocked inode from the faulty MDS' session list (i.e. ceph tell
mds.${my_mds} session ls --cap-dump) and then examining the
client
whose had the cap the longest:
~$ ceph tell mds.${my_mds} session ls --cap-dump ...
2024-03-13T13:01:36: client.251247219
2024-03-13T12:50:28: client.245466949
3. Then on the aforementioned oldest client, get the current
ops in
flight to the OSDs (via the "/sys/kernel/debug/ceph/*/osdc"
files)
and get the op corresponding to the blocked inode along with
the OSD
the I/O is going to:
root@client245466949 $ grep 100922d1102
/sys/kernel/debug/ceph/*/osdc
48366 osd79 2.249f8a51 2.a51s0
[79,351,232,179,107,195,323,14,128,167]/79
[79,351,232,179,107,195,323,14,128,167]/79 e374191
100922d1102.000000f5 0x400024 1 write
# osd causing errors is osd.79
4. Finally, we restart this "hanging" OSD where this results
in ls
and I/O on the previously "stuck" files no longer "hanging" .
Once we get this OSD for which the blocked inode is waiting for,
we can
see in the system logs that the OSD has slow ops:
~$ systemctl --no-pager --full status ceph-osd@79
...
2024-03-13T12:49:37 -1 osd.79 374175 get_health_metrics
reporting 3
slow ops, oldest is osd_op(client.245466949.0:41350 2.ca4s0
2.ce648ca4 (undecoded) ondisk+write+known_if_redirected e374173)
...
Have you dumped_ops_in_flight on the OSD in question to see how far
that op got before getting stuck?
This is some kind of RADOS problem, which isn’t great, but I wonder
if we’ve exceeded some snapshot threshold that is showing up on hard
drives as slow ops, or if there’s a code bug that is just causing
them to get lost. :/
-Greg
Files that these "hanging" inodes correspond to as well as the
directories housing these files can't be opened or stat-ed (causing
directories to hang) where we've found restarting this OSD with
slow ops
to be the least disruptive way of resolving this (compared with a
forced
umount and then re-mount on the client). There are no issues with
the
underlying hardware for either the osd reporting these slow ops or
any
other drive within the acting PG and there seems to be no
correlation
between what processes are involved or what type of files these are.
What could be causing these slow ops and certain files and
directories
to "hang"? There aren't workflows being performed that generate a
large
number of small files nor are there directories with a large
number of
files within them. This seems to happen with a wide range of
hard-drives
and we see this on SATA and SAS type drives where our nodes are
interconnected with 25 Gb/s NICs so we can't see how the underlying
hardware would be causing any I/O bottlenecks. Has anyone else
seen this
type of behaviour before and have any ideas? Is there a way to stop
these from happening as we are having to solve these nearly daily
now
and we can't seem to find a way to reduce them. We do use
snapshots to
backup our cluster where we've been doing this for ~6 months and
these
issues have only been occurring on and off for a couple of months
but
much more frequently now.
Kindest regards,
Ivan Clayson
-- Ivan Clayson
-----------------
Scientific Computing Officer
Room 2N249
Structural Studies
MRC Laboratory of Molecular Biology
Francis Crick Ave, Cambridge
CB2 0QH
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx