Re: What is client request_load_avg? Troubleshooting MDS issues on Luminous

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Mon, 2022-08-15 at 08:33 +0000, Eugen Block wrote:
> Hi,
> 
> do you see high disk utilization on the OSD nodes? 

Hi Eugen, thanks for the reply, much appreciated.

> How is the load on  
> the active MDS?

Yesterday I rebooted the three MDS nodes one at a time (which obviously
included a failover to a freshly booted node) and since then the
performance has improved. It could be a total coincidence though and
I'd really like to try and understand more of what's really going on.

The load seems to stay pretty low on the active MDS server (currently
1.56, 1.62, 1.57) and it has free ram (60G used, 195G free).

The MDS servers almost never have CPU spent waiting on access
(occasionally ~0.2 wa), so there does not seem to be a bottleneck to
disk or network.

However, the ceph-mds process is pretty much constantly over 100% CPU
and often over 200%. Given it's a single process, right? It makes me
think that some operations are too slow or some task is pegging the CPU
at 100%.

Perhaps profiling the MDS server somehow might tell me the kind of
thing it's stuck on?

> How much RAM is configured for the MDS  
> (mds_cache_memory_limit)?

Currently set to 51539607552, so ~50G?

We do often see this go over and as far as I understand, this triggers
MDS to ask clients to release unused caps (we do get clients who don't
respond).

I think restarting the MDS causes the clients to drop all of their
unused caps, but hold the used ones for when the new MDS comes online
(so as not to overwhelm it)?

I'm not sure whether increasing the cache size helps (because it can
store more caps and put less pressure on the system when it tries to
drop them), or whether that actually increases pressure (because it has
more to track and more things to do).

We do have RAM free on the node though so we could increase it if you
think it might help?

> You can list all MDS sessions with 'ceph daemon mds.<MDS> session
> ls'  
> to identify all your clients

Thanks, yeah there is a lot of nice info in there, although I'm not
quite sure which elements are useful. That's where I saw the
"request_load_avg" which I'm not quite sure what it means.

We do have ~5000 active clients (and that number is pretty consistent).

The top 5 clients have over a million caps each, with the top client
having over 5 million itself.

> and 'ceph daemon mds.<MDS>  
> dump_blocked_ops' to show blocked requests.

There are no blocked ops at the moment, according to (ceph daemon
mds.$(hostname) dump_blocked_ops) but I can try again once the system
performance degrades.

I feel like I need to get some of these metrics out into Prometheus or
something, so that I can look for historical trends (and add alerts).

> But simply killing  
> sessions isn't a solution, so first you need to find out where the  
> bottleneck is.

Yeah, I totally agree with finding the real bottleneck, thanks for your
help.

My thinking could be totally wrong but the reason I was looking into
identifying and killing problematic clients was because we get these
bursts where some clients might be doing some harsh requests (like
multiple jobs trying to read/link/unlink millions of tiny files at
once) and if I can identify them I could try and 1) stop them to
restore cluster performance for everyone else and 2) get them to find a
better way to do that task so we can avoid the issue...

To your point about finding the source of the bottleneck though, I'd
much rather the Ceph cluster was able to handle anything that was
thrown at it... :-) My feeling is that the MDS is easily overwhelmed,
hopefully profiling somehow can help shine a light there.

> Do you see hung requests or something? Anything in  
> 'dmesg' on the client side?

I don't see anything useful on the client side in dmesg, unfortunately.
Just lots of clients talking to mons successfully. The clients are
using kernel ceph, and mounting with relatime (that could explain lots
of caps, even on a ro mount) and acl (assume this puts extra
load/checks on MDS).

At a guess, we can probably optimise the client mounts with noatime
instead and maybe remove acl if we're not using them - not sure of the
impact to workloads though, so haven't tried.

I'm not quite sure of the relationship of operations between MDS and
OSD data. The MDS gets written to nvme pool and clients access data
directly on OSD nodes, but do MDS operations also need to wait for OSDs
to perform operations? I think it makes sense that they do (for
example, to unlink a file MDS needs to check if there are any other
hardlinks to it, and if not, then the data can be deleted from OSDs and
the metadata updated to remove the file)?

So to that end, would slow performing OSDs also impact MDS performance?
Maybe it's stuck waiting for the OSDs to do their thing, and they
aren't fast enough... but then wouldn't I see much more %wa?

One thing that I noticed yesterday is that when the cluster is under
pressure the I/O and throughput of the MDS to the metadata pool goes
very spiky (OSD pool did not go spiky).

I also saw that client requests was spiky and often dropped to zero.
This makes me think either an intermittent network connection issue or
the MDS is getting overloaded and is so busy that it can't actually
accept any new client requests. Then when it's less busy, a flood come
in and then it gets too busy again.

Failing over the MDSs didn't help much but rebooting them seems to have
stopped that behaviour (but again, could just be coincidence)...

There does seem to be a correlation between the number of caps and
performance issues (but I guess that makes sense as it's just getting
more busy), but reduction in total caps doesn't fix it on its own.

I really just don't understand the system well enough yet to know where
to dig, so many thanks for your assistance!

-c


> 
> 
> Zitat von Chris Smart <distroguy@xxxxxxxxx>:
> 
> > Hi all,
> > 
> > I have recently inherited a 10 node Ceph cluster running Luminous
> > (12.2.12)
> > which is running specifically for CephFS (and I don't know much
> > about MDS)
> > with only one active MDS server (two standby).
> > It's not a great cluster IMO, the cephfs_data pool is on high
> > density nodes
> > with high capacity SATA drives but at least the cephfs_metadata
> > pool is on
> > nvme drives.
> > 
> > Access to the cluster regularly goes slow for clients and I'm
> > seeing lots
> > of warnings like this:
> > 
> > MDSs behind on trimming (MDS_TRIM)
> > MDSs report slow metadata IOs (MDS_SLOW_METADATA_IO)
> > MDSs report slow requests (MDS_SLOW_REQUEST)
> > MDSs have many clients failing to respond to capability release
> > (MDS_CLIENT_LATE_RELEASE_MANY)
> > 
> > If there is only one client that's failing to respond to capability
> > release
> > I can see the client id in the output and work out what user that
> > is and
> > get their job stopped. Performance then usually improves a bit.
> > 
> > However, if there is more than one, the output only shows a summary
> > of the
> > number of clients and I don't know who the clients are to get their
> > jobs
> > cancelled.
> > Is there a way I can work out what clients these are? I'm guessing
> > some
> > kind of combination of in_flight_ops, blocked_ops and total
> > num_caps?
> > 
> > However, I also feel like just having a large number of caps isn't
> > _necessarily_ an indicator of a problem, sometimes restarting MDS
> > and
> > forcing clients to drop unused caps helps, sometimes it doesn't.
> > 
> > I'm curious if there's a better way to determine any clients that
> > might be
> > causing issues in the cluster?
> > To that end, I've noticed there is a metric called
> > "request_load_avg" in
> > the output of ceph mds client ls but I can't quite find any
> > information
> > about it. It _seems_ like it could indicate a client that's doing
> > lots and
> > lots of requests and therefore a useful metric to see what client
> > might be
> > smashing the cluster, but does anyone know for sure?
> > 
> > Many thanks,
> > Chris
> > _______________________________________________
> > ceph-users mailing list -- ceph-users@xxxxxxx
> > To unsubscribe send an email to ceph-users-leave@xxxxxxx
> 
> 
> 
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx




[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux