What's the CPU and RAM usage look like for the OSDs? CPU has often been our bottleneck, with the main thread hitting 100%. On Fri, Mar 15, 2024 at 9:15 AM Ivan Clayson <ivan@xxxxxxxxxxxxxxxxx> wrote: > > Hello everyone, > > We've been experiencing on our quincy CephFS clusters (one 17.2.6 and > another 17.2.7) repeated slow ops with our client kernel mounts > (Ceph 17.2.7 and version 4 Linux kernels on all clients) that seem to > originate from slow ops on osds despite the underlying hardware being > fine. Our 2 clusters are similar and are both Alma8 systems where more > specifically: > > * Cluster (1) is Alma8.8 running Ceph version 17.2.7 with 7 NVMe SSD > OSDs storing the metadata and 432 spinning SATA disks storing the > bulk data in an EC pool (8 data shards and 2 parity blocks) across > 40 nodes. The whole cluster is used to support a single file system > with 1 active MDS and 2 standby ones. > * Cluster (2) is Alma8.7 running Ceph version 17.2.6 with 4 NVMe SSD > OSDs storing the metadata and 348 spinning SAS disks storing the > bulk data in EC pools (8 data shards and 2 parity blocks). This > cluster houses multiple filesystems each with their own dedicated > MDS along with 3 communal standby ones. > > Nearly daily we often find that we're get the following error messages: > MDS_CLIENT_LATE_RELEASE, MDS_SLOW_METADATA_IO, and MDS_SLOW_REQUEST. > Along with these messages, certain files and directory cannot be stat-ed > and any processes involving these files hang indefinitely. We have been > fixing this by: > > 1. First, finding the oldest blocked MDS op and the inode listed there: > > ~$ ceph tell mds.${my_mds} dump_blocked_ops 2>> /dev/null | grep > -c description > > "description": "client_request(client.251247219:662 getattr > AsLsXsFs #0x100922d1102 2024-03-13T12:51:57.988115+0000 > caller_uid=26983, caller_gid=26983)", > > # inode/ object of interest: 100922d1102 > > 2. Second, finding all the current clients that have a cap for this > blocked inode from the faulty MDS' session list (i.e. ceph tell > mds.${my_mds} session ls --cap-dump) and then examining the client > whose had the cap the longest: > > ~$ ceph tell mds.${my_mds} session ls --cap-dump ... > > 2024-03-13T13:01:36: client.251247219 > > 2024-03-13T12:50:28: client.245466949 > > 3. Then on the aforementioned oldest client, get the current ops in > flight to the OSDs (via the "/sys/kernel/debug/ceph/*/osdc" files) > and get the op corresponding to the blocked inode along with the OSD > the I/O is going to: > > root@client245466949 $ grep 100922d1102 > /sys/kernel/debug/ceph/*/osdc > > 48366 osd79 2.249f8a51 2.a51s0 > [79,351,232,179,107,195,323,14,128,167]/79 > [79,351,232,179,107,195,323,14,128,167]/79 e374191 > 100922d1102.000000f5 0x400024 1 write > > # osd causing errors is osd.79 > > 4. Finally, we restart this "hanging" OSD where this results in ls > and I/O on the previously "stuck" files no longer "hanging" . > > Once we get this OSD for which the blocked inode is waiting for, we can > see in the system logs that the OSD has slow ops: > > ~$ systemctl --no-pager --full status ceph-osd@79 > > ... > 2024-03-13T12:49:37 -1 osd.79 374175 get_health_metrics reporting 3 > slow ops, oldest is osd_op(client.245466949.0:41350 2.ca4s0 > 2.ce648ca4 (undecoded) ondisk+write+known_if_redirected e374173) > ... > > Files that these "hanging" inodes correspond to as well as the > directories housing these files can't be opened or stat-ed (causing > directories to hang) where we've found restarting this OSD with slow ops > to be the least disruptive way of resolving this (compared with a forced > umount and then re-mount on the client). There are no issues with the > underlying hardware for either the osd reporting these slow ops or any > other drive within the acting PG and there seems to be no correlation > between what processes are involved or what type of files these are. > > What could be causing these slow ops and certain files and directories > to "hang"? There aren't workflows being performed that generate a large > number of small files nor are there directories with a large number of > files within them. This seems to happen with a wide range of hard-drives > and we see this on SATA and SAS type drives where our nodes are > interconnected with 25 Gb/s NICs so we can't see how the underlying > hardware would be causing any I/O bottlenecks. Has anyone else seen this > type of behaviour before and have any ideas? Is there a way to stop > these from happening as we are having to solve these nearly daily now > and we can't seem to find a way to reduce them. We do use snapshots to > backup our cluster where we've been doing this for ~6 months and these > issues have only been occurring on and off for a couple of months but > much more frequently now. > > > Kindest regards, > > Ivan Clayson > > -- > Ivan Clayson > ----------------- > Scientific Computing Officer > Room 2N249 > Structural Studies > MRC Laboratory of Molecular Biology > Francis Crick Ave, Cambridge > CB2 0QH > _______________________________________________ > ceph-users mailing list -- ceph-users@xxxxxxx > To unsubscribe send an email to ceph-users-leave@xxxxxxx _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx