Re: Client failing to respond to capability release

Eugen Block <eblock@xxxxxx> · Wed, 23 Aug 2023 09:41:07 +0000

I see, that was also SUSE's recommendation [2] but without a real  
explanation, just some assumptions about a possible network disconnect.

[2] https://www.suse.com/support/kb/doc/?id=000019628

Zitat von Frank Schilder <frans@xxxxxx>:

Hi Eugen, thanks for that :D

This time it was something different. Possibly a bug in the kclient.  
On these nodes I found sync commands stuck in D-state. I guess a  
file/dir was not possible to sync or there was some kind of  
corruption of buffer data. We had to reboot the servers to clear  
that out.

On first inspection these clients looked OK. Only some deeper  
debugging revealed that something was off.

Best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

________________________________________
From: Eugen Block <eblock@xxxxxx>
Sent: Wednesday, August 23, 2023 8:55 AM
To: ceph-users@xxxxxxx
Subject:  Re: Client failing to respond to capability release

Hi,

pointing you to your own thread [1] ;-)

[1]
https://lists.ceph.io/hyperkitty/list/ceph-users@xxxxxxx/thread/HFILR5NMUCEZH7TJSGSACPI4P23XTULI/

Zitat von Frank Schilder <frans@xxxxxx>:

Hi all,

I have this warning the whole day already (octopus latest cluster):

HEALTH_WARN 4 clients failing to respond to capability release; 1
pgs not deep-scrubbed in time
[WRN] MDS_CLIENT_LATE_RELEASE: 4 clients failing to respond to
capability release
    mds.ceph-24(mds.1): Client sn352.hpc.ait.dtu.dk:con-fs2-hpc
failing to respond to capability release client_id: 145698301
    mds.ceph-24(mds.1): Client sn463.hpc.ait.dtu.dk:con-fs2-hpc
failing to respond to capability release client_id: 189511877
    mds.ceph-24(mds.1): Client sn350.hpc.ait.dtu.dk:con-fs2-hpc
failing to respond to capability release client_id: 189511887
    mds.ceph-24(mds.1): Client sn403.hpc.ait.dtu.dk:con-fs2-hpc
failing to respond to capability release client_id: 231250695

If I look at the session info from mds.1 for these clients I see this:

# ceph tell mds.1 session ls | jq -c '[.[] | {id: .id, h:
.client_metadata.hostname, addr: .inst, fs: .client_metadata.root,
caps: .num_caps, req: .request_load_avg}]|sort_by(.caps)|.[]' | grep
-e 145698301 -e 189511877 -e 189511887 -e 231250695
{"id":189511887,"h":"sn350.hpc.ait.dtu.dk","addr":"client.189511887
v1:192.168.57.221:0/4262844211","fs":"/hpc/groups","caps":2,"req":0}
{"id":231250695,"h":"sn403.hpc.ait.dtu.dk","addr":"client.231250695
v1:192.168.58.18:0/1334540218","fs":"/hpc/groups","caps":3,"req":0}
{"id":189511877,"h":"sn463.hpc.ait.dtu.dk","addr":"client.189511877
v1:192.168.58.78:0/3535879569","fs":"/hpc/groups","caps":4,"req":0}
{"id":145698301,"h":"sn352.hpc.ait.dtu.dk","addr":"client.145698301
v1:192.168.57.223:0/2146607320","fs":"/hpc/groups","caps":7,"req":0}

We have mds_min_caps_per_client=4096, so it looks like the limit is
well satisfied. Also, the file system is pretty idle at the moment.

Why and what exactly is the MDS complaining about here?

Thanks and best regards.
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx