Hello Frank, On Tue, Aug 22, 2023 at 11:42 AM Frank Schilder <frans@xxxxxx> wrote: > > Hi all, > > I have this warning the whole day already (octopus latest cluster): > > HEALTH_WARN 4 clients failing to respond to capability release; 1 pgs not deep-scrubbed in time > [WRN] MDS_CLIENT_LATE_RELEASE: 4 clients failing to respond to capability release > mds.ceph-24(mds.1): Client sn352.hpc.ait.dtu.dk:con-fs2-hpc failing to respond to capability release client_id: 145698301 > mds.ceph-24(mds.1): Client sn463.hpc.ait.dtu.dk:con-fs2-hpc failing to respond to capability release client_id: 189511877 > mds.ceph-24(mds.1): Client sn350.hpc.ait.dtu.dk:con-fs2-hpc failing to respond to capability release client_id: 189511887 > mds.ceph-24(mds.1): Client sn403.hpc.ait.dtu.dk:con-fs2-hpc failing to respond to capability release client_id: 231250695 > > If I look at the session info from mds.1 for these clients I see this: > > # ceph tell mds.1 session ls | jq -c '[.[] | {id: .id, h: .client_metadata.hostname, addr: .inst, fs: .client_metadata.root, caps: .num_caps, req: .request_load_avg}]|sort_by(.caps)|.[]' | grep -e 145698301 -e 189511877 -e 189511887 -e 231250695 > {"id":189511887,"h":"sn350.hpc.ait.dtu.dk","addr":"client.189511887 v1:192.168.57.221:0/4262844211","fs":"/hpc/groups","caps":2,"req":0} > {"id":231250695,"h":"sn403.hpc.ait.dtu.dk","addr":"client.231250695 v1:192.168.58.18:0/1334540218","fs":"/hpc/groups","caps":3,"req":0} > {"id":189511877,"h":"sn463.hpc.ait.dtu.dk","addr":"client.189511877 v1:192.168.58.78:0/3535879569","fs":"/hpc/groups","caps":4,"req":0} > {"id":145698301,"h":"sn352.hpc.ait.dtu.dk","addr":"client.145698301 v1:192.168.57.223:0/2146607320","fs":"/hpc/groups","caps":7,"req":0} > > We have mds_min_caps_per_client=4096, so it looks like the limit is well satisfied. Also, the file system is pretty idle at the moment. > > Why and what exactly is the MDS complaining about here? These days, you'll generally see this because the client is "quiet" and the MDS is opportunistically recalling caps to reduce future work when shrinking its cache is necessary. This would be indicated by: * The MDS is not complaining about an oversized cache. * The session listing shows the session is quiet (the "session_cache_liveness" is near 0). However, the MDS should respect mds_min_caps_per_client by (a) not recalling more caps than mds_min_caps_per_client and (b) not complaining the client has caps < mds_min_caps_per_client when it's quiet. So, you may have found a bug. The next time this happens, a `ceph tell mds.X config diff`, `ceph tell mds.X perf dump`, and selection of the relevant `ceph tell mds.X session ls` will help debug this I think. -- Patrick Donnelly, Ph.D. He / Him / His Red Hat Partner Engineer IBM, Inc. GPG: 19F28A586F808C2402351B93C3301A3E258DD79D _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx