Re: Clients failing to respond to capability release

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi,


On 19-09-2023 13:35, Tim Bishop wrote:
Hi,

I've seen this issue mentioned in the past, but with older releases. So
I'm wondering if anybody has any pointers.

The Ceph cluster is running Pacific 16.2.13 on Ubuntu 20.04. Almost all
clients are working fine, with the exception of our backup server. This
is using the kernel CephFS client on Ubuntu 22.04 with kernel 6.2.0 [1]
(so I suspect a newer Ceph version?).

The backup server has multiple (12) CephFS mount points. One of them,
the busiest, regularly causes this error on the cluster:

HEALTH_WARN 1 clients failing to respond to capability release
[WRN] MDS_CLIENT_LATE_RELEASE: 1 clients failing to respond to capability release
     mds.mds-server(mds.0): Client backupserver:cephfs-backupserver failing to respond to capability release client_id: 521306112

And occasionally, which may be unrelated, but occurs at the same time:

[WRN] MDS_SLOW_REQUEST: 1 MDSs report slow requests
     mds.mds-server(mds.0): 1 slow requests are blocked > 30 secs

The second one clears itself, but the first sticks until I can unmount
the filesystem on the client after the backup completes.

You are not alone. We also have a backup server running 22.04 and 6.2 and occasionally hit this issue. We hit this with mainly 5.12.19 clients and a 6.2 backup server. We're on 16.2.11.

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Sidenote:

For those of you who are wondering: why would you want to use latest (greatest?) linux kernel for CephFS ... this is why. To try to get rid of 1) slow requests because of some deadlock / locking issue, clients failing to capability release, and 3) bug fixes / improvements (thx devs!).
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Questions:

Do you have the filesystem read only mounted and given the backup server CephFS client read only caps on the MDS?

Are you running a multiple active MDS setup?


It appears that whilst it's in this stuck state there may be one or more
directory trees that are inaccessible to all clients. The backup server
is walking the whole tree but never gets stuck itself, so either the
inaccessible directory entry is caused after it has gone past, or it's
not affected. Maybe the backup server is holding a directory when it
shouldn't?

We have seen both cases, yet most of the time the backup server would not be able to make progress and be stuck on a file.


It may be that an upgrade to Quincy resolves this, since it's more
likely to be inline with the kernel client version wise, but I don't
want to knee-jerk upgrade just to try and fix this problem.

We are testing with 6.5 kernel clients (see other recent threads about this). We have not seen this issue there (but time will tell, it does not happen *that* often, but hit other issues).

The MDS server itself is indeed older than the newer kernel clients. It might certainly be a factor. And that raises the question what kind of interoperability / compatibility tests (if any) are done between CephFS (kernel) clients and MDS server versions. This might be a good "focus topic" for a ceph User + Dev meeting ...


Thanks for any advice.

You might want to try 6.5.x kernel on the clients. But might run into other issues. Not sure about that, these might be only relevant for one of our workloads, only one way to find out ...

Enable debug logging on the MDS to gather logs that might shine some light on what is happening with that request.

ceph daemon mds.name dump_ops_in_flight might help here to get client id and request.

Another thing that you might do is to dump the cache on the MDS to gather more info. This however is highly dependent on the amount of RAM the MDS is using. In the past we would kill the MDS (unresponsive, replaced by standby-replay). Improvements to prevent that have been made ... but we have not tried after that. See this thread [1]. What MDS_MEMORY_TARGET have you set? Make sure you have enough disk space to store the dump file. To actually make sense of that dump file / debug logging you should understand _exactly_ how the CAPS mechanism works, and see if it is violated somewhere ... and then look in the code to see why. Short of that knowledge, the CephFS developers might help out.

If running a multple active MDS setup, you might set an export pin on the problematic path and export it to a dedicated MDS. This one might be easier to troubleshoot (isolate the problem).


Gr. Stefan

[1]: https://lists.ceph.io/hyperkitty/list/ceph-users@xxxxxxx/thread/LZ25PAD4YFLUUYLX2HDVZYLJKZWHC3QB/
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx



[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux