Re: Client failing to respond to capability release

Patrick Donnelly <pdonnell@xxxxxxxxxx> · Fri, 1 Sep 2023 16:03:45 -0400

Hello Frank,

On Tue, Aug 22, 2023 at 11:42 AM Frank Schilder <frans@xxxxxx> wrote:
>
> Hi all,
>
> I have this warning the whole day already (octopus latest cluster):
>
> HEALTH_WARN 4 clients failing to respond to capability release; 1 pgs not deep-scrubbed in time
> [WRN] MDS_CLIENT_LATE_RELEASE: 4 clients failing to respond to capability release
>     mds.ceph-24(mds.1): Client sn352.hpc.ait.dtu.dk:con-fs2-hpc failing to respond to capability release client_id: 145698301
>     mds.ceph-24(mds.1): Client sn463.hpc.ait.dtu.dk:con-fs2-hpc failing to respond to capability release client_id: 189511877
>     mds.ceph-24(mds.1): Client sn350.hpc.ait.dtu.dk:con-fs2-hpc failing to respond to capability release client_id: 189511887
>     mds.ceph-24(mds.1): Client sn403.hpc.ait.dtu.dk:con-fs2-hpc failing to respond to capability release client_id: 231250695
>
> If I look at the session info from mds.1 for these clients I see this:
>
> # ceph tell mds.1 session ls | jq -c '[.[] | {id: .id, h: .client_metadata.hostname, addr: .inst, fs: .client_metadata.root, caps: .num_caps, req: .request_load_avg}]|sort_by(.caps)|.[]' | grep -e 145698301 -e 189511877 -e 189511887 -e 231250695
> {"id":189511887,"h":"sn350.hpc.ait.dtu.dk","addr":"client.189511887 v1:192.168.57.221:0/4262844211","fs":"/hpc/groups","caps":2,"req":0}
> {"id":231250695,"h":"sn403.hpc.ait.dtu.dk","addr":"client.231250695 v1:192.168.58.18:0/1334540218","fs":"/hpc/groups","caps":3,"req":0}
> {"id":189511877,"h":"sn463.hpc.ait.dtu.dk","addr":"client.189511877 v1:192.168.58.78:0/3535879569","fs":"/hpc/groups","caps":4,"req":0}
> {"id":145698301,"h":"sn352.hpc.ait.dtu.dk","addr":"client.145698301 v1:192.168.57.223:0/2146607320","fs":"/hpc/groups","caps":7,"req":0}
>
> We have mds_min_caps_per_client=4096, so it looks like the limit is well satisfied. Also, the file system is pretty idle at the moment.
>
> Why and what exactly is the MDS complaining about here?

These days, you'll generally see this because the client is "quiet"
and the MDS is opportunistically recalling caps to reduce future work
when shrinking its cache is necessary. This would be indicated by:

* The MDS is not complaining about an oversized cache.
* The session listing shows the session is quiet (the
"session_cache_liveness" is near 0).

However, the MDS should respect mds_min_caps_per_client by (a) not
recalling more caps than mds_min_caps_per_client and (b) not
complaining the client has caps < mds_min_caps_per_client when it's
quiet.

So, you may have found a bug. The next time this happens, a `ceph tell
mds.X config diff`, `ceph tell mds.X perf dump`, and selection of the
relevant `ceph tell mds.X session ls` will help debug this I think.

-- 
Patrick Donnelly, Ph.D.
He / Him / His
Red Hat Partner Engineer
IBM, Inc.
GPG: 19F28A586F808C2402351B93C3301A3E258DD79D
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx