Re: What is client request_load_avg? Troubleshooting MDS issues on Luminous

Chris Smart <distroguy@xxxxxxxxx> · Tue, 16 Aug 2022 15:22:47 +1000

On Mon, 2022-08-15 at 09:00 +0000, Frank Schilder wrote:
> Hi Chris,

> 

Hi Frank, thanks for the reply.

> I also have serious problems identifying problematic ceph-fs clients
> (using mimic). I don't think that even in the newest ceph version
> there are useful counters for that. Just last week I had the case
> that a client caused an all-time peak in cluster load and I was not
> able to locate the client due to the lack of useful rate counters.
> There are two problems with ceph fs' load monitoring. The first is
> the complete lack of rate-based IO load counters down to client+PID
> level and that warnings generated actually flag the wrong clients.
> 

Yikes, sounds familiar...

> The hallmark of the last problem is basically explained in this
> thread, specifically, this message:
> 
> https://lists.ceph.io/hyperkitty/list/ceph-users@xxxxxxx/message/TWNF2PWM7SONLCT4OLAJLMLXHK3ABPUB/
> 
> It states that warnings are generated for *inactive* clients, not for
> clients that are actually causing the trouble. Worse yet, the
> proposed solution counteracts the problem that MDS client caps recall
> is usually way too slow. I had to increase it to 64K just to get the
> MDS cache balanced, because MDSes don't have a concept of rate-
> limiting clients that go bonkers. The effect is, that the MDSes
> punish all others because of a single rogue client instead of rate-
> limiting the bad one.
> 

Thanks for linking to that thread, it's very interesting.

> The first problem is essentially that useful IO rate counters are
> missing, for example, for each client the rates with which it
> acquires and releases caps. What I really would love to see are
> warnings for "clients acquiring caps much faster than releasing"
> (with client ID and PID) and MDS-side rate-balancing essentially
> throttling such aggressive clients. Every client holding more than,
> say, 2*max-caps caps should be throttled so that caps-acquire rate =
> caps-release rate. I also don't understand why the MDS is not going
> after the rich clients first. I get all the time warnings that a
> client with 4000 caps is not releasing fast enough while some fat
> cats sit on millions and are not flagged as problematic. Why is the
> recall rate not proportional to the amount of caps a client holds?
> 

I don't know the answer, but is it the case that the number of caps in
itself doesn't necessarily indicate a bad client? If I had a long-
running job that slowly trawled through millions of files but didn't
release caps, then I might end up with millions but I'm not really
putting any pressure on MDS?

Versus someone who's got 12 parallel threads running linking and
unlinking thousands of the same files?

If that's true, then maybe some kind of counter that tracks the rate of
caps vs number of metadata updates required or something... I don't
know.

> Another counter that is missing is an actual IO rate counter. MDS
> requests are in no way indicative of a client's IO activity. Once it
> has the caps for a file it talks to OSDs directly. This communication
> is not reflected in any counter I'm aware of. To return to my case
> above, I had clients with more than 50K average load requests, but
> these were completely harmless (probably served from local cache).
> The MDS did not show any unusual behaviour like growing cache and the
> like. Everything looked normal except for OSD server load which sky-
> rocketed to unprecedented levels due to some client's IO requests.
> 

Oh, yeah I think we're thinking similar things and that num_caps itself
doesn't necessarily indicate a problematic client... Do you know what
the request load means? Sounds like it's not actually anything to do
with performance load, but maybe just amount? I don't know what that
metric really is...

> It must have been small random IO and the only way currently to
> identify such clients is network packet traffic. Unfortunately, our
> network monitoring system has a few blind spots and I was not able to
> find out which client was bombarding the OSDs with a packet storm.
> Proper IO rate counters down to PID level and appropriate warnings
> about aggressive clients would really help and are dearly missing.
> 

Yeah, I see... that would be really useful. I'm not sure if my
situation is the same or not, I feel like my MDS is just not able to
keep up and that the OSDs are actually OK... but I don't know for sure.

Thanks, I appreciate all the information! I'm hopeful that with some
help I might be able to work out problematic clients, maybe some
combination of num_caps, ops, load, etc... I still think that would be
useful to know, even if the bottlenecks in my cluster can be discovered
and remedied...

Cheers,
-c

> Best regards,
> =================
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
> 
> ________________________________________
> From: Chris Smart <distroguy@xxxxxxxxx>
> Sent: 14 August 2022 05:47:12
> To: ceph-users@xxxxxxx
> Subject:  What is client request_load_avg?
> Troubleshooting MDS issues on Luminous
> 
> Hi all,
> 
> I have recently inherited a 10 node Ceph cluster running Luminous
> (12.2.12)
> which is running specifically for CephFS (and I don't know much about
> MDS)
> with only one active MDS server (two standby).
> It's not a great cluster IMO, the cephfs_data pool is on high density
> nodes
> with high capacity SATA drives but at least the cephfs_metadata pool
> is on
> nvme drives.
> 
> Access to the cluster regularly goes slow for clients and I'm seeing
> lots
> of warnings like this:
> 
> MDSs behind on trimming (MDS_TRIM)
> MDSs report slow metadata IOs (MDS_SLOW_METADATA_IO)
> MDSs report slow requests (MDS_SLOW_REQUEST)
> MDSs have many clients failing to respond to capability release
> (MDS_CLIENT_LATE_RELEASE_MANY)
> 
> If there is only one client that's failing to respond to capability
> release
> I can see the client id in the output and work out what user that is
> and
> get their job stopped. Performance then usually improves a bit.
> 
> However, if there is more than one, the output only shows a summary
> of the
> number of clients and I don't know who the clients are to get their
> jobs
> cancelled.
> Is there a way I can work out what clients these are? I'm guessing
> some
> kind of combination of in_flight_ops, blocked_ops and total num_caps?
> 
> However, I also feel like just having a large number of caps isn't
> _necessarily_ an indicator of a problem, sometimes restarting MDS and
> forcing clients to drop unused caps helps, sometimes it doesn't.
> 
> I'm curious if there's a better way to determine any clients that
> might be
> causing issues in the cluster?
> To that end, I've noticed there is a metric called "request_load_avg"
> in
> the output of ceph mds client ls but I can't quite find any
> information
> about it. It _seems_ like it could indicate a client that's doing
> lots and
> lots of requests and therefore a useful metric to see what client
> might be
> smashing the cluster, but does anyone know for sure?
> 
> Many thanks,
> Chris
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx