Re: What is client request_load_avg? Troubleshooting MDS issues on Luminous

Eugen Block <eblock@xxxxxx> · Mon, 15 Aug 2022 08:33:42 +0000

Hi,

do you see high disk utilization on the OSD nodes? How is the load on  
the active MDS? How much RAM is configured for the MDS  
(mds_cache_memory_limit)?
You can list all MDS sessions with 'ceph daemon mds.<MDS> session ls'  
to identify all your clients and 'ceph daemon mds.<MDS>  
dump_blocked_ops' to show blocked requests. But simply killing  
sessions isn't a solution, so first you need to find out where the  
bottleneck is. Do you see hung requests or something? Anything in  
'dmesg' on the client side?

Zitat von Chris Smart <distroguy@xxxxxxxxx>:

Hi all,

I have recently inherited a 10 node Ceph cluster running Luminous (12.2.12)
which is running specifically for CephFS (and I don't know much about MDS)
with only one active MDS server (two standby).
It's not a great cluster IMO, the cephfs_data pool is on high density nodes
with high capacity SATA drives but at least the cephfs_metadata pool is on
nvme drives.

Access to the cluster regularly goes slow for clients and I'm seeing lots
of warnings like this:

MDSs behind on trimming (MDS_TRIM)
MDSs report slow metadata IOs (MDS_SLOW_METADATA_IO)
MDSs report slow requests (MDS_SLOW_REQUEST)
MDSs have many clients failing to respond to capability release
(MDS_CLIENT_LATE_RELEASE_MANY)

If there is only one client that's failing to respond to capability release
I can see the client id in the output and work out what user that is and
get their job stopped. Performance then usually improves a bit.

However, if there is more than one, the output only shows a summary of the
number of clients and I don't know who the clients are to get their jobs
cancelled.
Is there a way I can work out what clients these are? I'm guessing some
kind of combination of in_flight_ops, blocked_ops and total num_caps?

However, I also feel like just having a large number of caps isn't
_necessarily_ an indicator of a problem, sometimes restarting MDS and
forcing clients to drop unused caps helps, sometimes it doesn't.

I'm curious if there's a better way to determine any clients that might be
causing issues in the cluster?
To that end, I've noticed there is a metric called "request_load_avg" in
the output of ceph mds client ls but I can't quite find any information
about it. It _seems_ like it could indicate a client that's doing lots and
lots of requests and therefore a useful metric to see what client might be
smashing the cluster, but does anyone know for sure?

Many thanks,
Chris
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx