Hi,
do you see high disk utilization on the OSD nodes? How is the load on
the active MDS? How much RAM is configured for the MDS
(mds_cache_memory_limit)?
You can list all MDS sessions with 'ceph daemon mds.<MDS> session ls'
to identify all your clients and 'ceph daemon mds.<MDS>
dump_blocked_ops' to show blocked requests. But simply killing
sessions isn't a solution, so first you need to find out where the
bottleneck is. Do you see hung requests or something? Anything in
'dmesg' on the client side?
Zitat von Chris Smart <distroguy@xxxxxxxxx>:
Hi all,
I have recently inherited a 10 node Ceph cluster running Luminous (12.2.12)
which is running specifically for CephFS (and I don't know much about MDS)
with only one active MDS server (two standby).
It's not a great cluster IMO, the cephfs_data pool is on high density nodes
with high capacity SATA drives but at least the cephfs_metadata pool is on
nvme drives.
Access to the cluster regularly goes slow for clients and I'm seeing lots
of warnings like this:
MDSs behind on trimming (MDS_TRIM)
MDSs report slow metadata IOs (MDS_SLOW_METADATA_IO)
MDSs report slow requests (MDS_SLOW_REQUEST)
MDSs have many clients failing to respond to capability release
(MDS_CLIENT_LATE_RELEASE_MANY)
If there is only one client that's failing to respond to capability release
I can see the client id in the output and work out what user that is and
get their job stopped. Performance then usually improves a bit.
However, if there is more than one, the output only shows a summary of the
number of clients and I don't know who the clients are to get their jobs
cancelled.
Is there a way I can work out what clients these are? I'm guessing some
kind of combination of in_flight_ops, blocked_ops and total num_caps?
However, I also feel like just having a large number of caps isn't
_necessarily_ an indicator of a problem, sometimes restarting MDS and
forcing clients to drop unused caps helps, sometimes it doesn't.
I'm curious if there's a better way to determine any clients that might be
causing issues in the cluster?
To that end, I've noticed there is a metric called "request_load_avg" in
the output of ceph mds client ls but I can't quite find any information
about it. It _seems_ like it could indicate a client that's doing lots and
lots of requests and therefore a useful metric to see what client might be
smashing the cluster, but does anyone know for sure?
Many thanks,
Chris
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx