Hi all, I have recently inherited a 10 node Ceph cluster running Luminous (12.2.12) which is running specifically for CephFS (and I don't know much about MDS) with only one active MDS server (two standby). It's not a great cluster IMO, the cephfs_data pool is on high density nodes with high capacity SATA drives but at least the cephfs_metadata pool is on nvme drives. Access to the cluster regularly goes slow for clients and I'm seeing lots of warnings like this: MDSs behind on trimming (MDS_TRIM) MDSs report slow metadata IOs (MDS_SLOW_METADATA_IO) MDSs report slow requests (MDS_SLOW_REQUEST) MDSs have many clients failing to respond to capability release (MDS_CLIENT_LATE_RELEASE_MANY) If there is only one client that's failing to respond to capability release I can see the client id in the output and work out what user that is and get their job stopped. Performance then usually improves a bit. However, if there is more than one, the output only shows a summary of the number of clients and I don't know who the clients are to get their jobs cancelled. Is there a way I can work out what clients these are? I'm guessing some kind of combination of in_flight_ops, blocked_ops and total num_caps? However, I also feel like just having a large number of caps isn't _necessarily_ an indicator of a problem, sometimes restarting MDS and forcing clients to drop unused caps helps, sometimes it doesn't. I'm curious if there's a better way to determine any clients that might be causing issues in the cluster? To that end, I've noticed there is a metric called "request_load_avg" in the output of ceph mds client ls but I can't quite find any information about it. It _seems_ like it could indicate a client that's doing lots and lots of requests and therefore a useful metric to see what client might be smashing the cluster, but does anyone know for sure? Many thanks, Chris _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx