Re: What is client request_load_avg? Troubleshooting MDS issues on Luminous

Frank Schilder <frans@xxxxxx> · Mon, 15 Aug 2022 09:00:29 +0000

Hi Chris,

I also have serious problems identifying problematic ceph-fs clients (using mimic). I don't think that even in the newest ceph version there are useful counters for that. Just last week I had the case that a client caused an all-time peak in cluster load and I was not able to locate the client due to the lack of useful rate counters. There are two problems with ceph fs' load monitoring. The first is the complete lack of rate-based IO load counters down to client+PID level and that warnings generated actually flag the wrong clients.

The hallmark of the last problem is basically explained in this thread, specifically, this message:

https://lists.ceph.io/hyperkitty/list/ceph-users@xxxxxxx/message/TWNF2PWM7SONLCT4OLAJLMLXHK3ABPUB/

It states that warnings are generated for *inactive* clients, not for clients that are actually causing the trouble. Worse yet, the proposed solution counteracts the problem that MDS client caps recall is usually way too slow. I had to increase it to 64K just to get the MDS cache balanced, because MDSes don't have a concept of rate-limiting clients that go bonkers. The effect is, that the MDSes punish all others because of a single rogue client instead of rate-limiting the bad one.

The first problem is essentially that useful IO rate counters are missing, for example, for each client the rates with which it acquires and releases caps. What I really would love to see are warnings for "clients acquiring caps much faster than releasing" (with client ID and PID) and MDS-side rate-balancing essentially throttling such aggressive clients. Every client holding more than, say, 2*max-caps caps should be throttled so that caps-acquire rate = caps-release rate. I also don't understand why the MDS is not going after the rich clients first. I get all the time warnings that a client with 4000 caps is not releasing fast enough while some fat cats sit on millions and are not flagged as problematic. Why is the recall rate not proportional to the amount of caps a client holds?

Another counter that is missing is an actual IO rate counter. MDS requests are in no way indicative of a client's IO activity. Once it has the caps for a file it talks to OSDs directly. This communication is not reflected in any counter I'm aware of. To return to my case above, I had clients with more than 50K average load requests, but these were completely harmless (probably served from local cache). The MDS did not show any unusual behaviour like growing cache and the like. Everything looked normal except for OSD server load which sky-rocketed to unprecedented levels due to some client's IO requests.

It must have been small random IO and the only way currently to identify such clients is network packet traffic. Unfortunately, our network monitoring system has a few blind spots and I was not able to find out which client was bombarding the OSDs with a packet storm. Proper IO rate counters down to PID level and appropriate warnings about aggressive clients would really help and are dearly missing.

Best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

________________________________________
From: Chris Smart <distroguy@xxxxxxxxx>
Sent: 14 August 2022 05:47:12
To: ceph-users@xxxxxxx
Subject:  What is client request_load_avg? Troubleshooting MDS issues on Luminous

Hi all,

I have recently inherited a 10 node Ceph cluster running Luminous (12.2.12)
which is running specifically for CephFS (and I don't know much about MDS)
with only one active MDS server (two standby).
It's not a great cluster IMO, the cephfs_data pool is on high density nodes
with high capacity SATA drives but at least the cephfs_metadata pool is on
nvme drives.

Access to the cluster regularly goes slow for clients and I'm seeing lots
of warnings like this:

MDSs behind on trimming (MDS_TRIM)
MDSs report slow metadata IOs (MDS_SLOW_METADATA_IO)
MDSs report slow requests (MDS_SLOW_REQUEST)
MDSs have many clients failing to respond to capability release
(MDS_CLIENT_LATE_RELEASE_MANY)

If there is only one client that's failing to respond to capability release
I can see the client id in the output and work out what user that is and
get their job stopped. Performance then usually improves a bit.

However, if there is more than one, the output only shows a summary of the
number of clients and I don't know who the clients are to get their jobs
cancelled.
Is there a way I can work out what clients these are? I'm guessing some
kind of combination of in_flight_ops, blocked_ops and total num_caps?

However, I also feel like just having a large number of caps isn't
_necessarily_ an indicator of a problem, sometimes restarting MDS and
forcing clients to drop unused caps helps, sometimes it doesn't.

I'm curious if there's a better way to determine any clients that might be
causing issues in the cluster?
To that end, I've noticed there is a metric called "request_load_avg" in
the output of ceph mds client ls but I can't quite find any information
about it. It _seems_ like it could indicate a client that's doing lots and
lots of requests and therefore a useful metric to see what client might be
smashing the cluster, but does anyone know for sure?

Many thanks,
Chris
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx