Re: Newer linux kernel cephfs clients is more trouble?

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 5/13/22 09:38, Xiubo Li wrote:

On 5/12/22 12:06 AM, Stefan Kooman wrote:
Hi List,

We have quite a few linux kernel clients for CephFS. One of our customers has been running mainline kernels (CentOS 7 elrepo) for the past two years. They started out with 3.x kernels (default CentOS 7), but upgraded to mainline when those kernels would frequently generate MDS warnings like "failing to respond to capability release". That worked fine until 5.14 kernel. 5.14 and up would use a lot of CPU and *way* more bandwidth on CephFS than older kernels (order of magnitude). After the MDS was upgraded from Nautilus to Octopus that behavior is gone (comparable CPU / bandwidth usage as older kernels). However, the newer kernels are now the ones that give "failing to respond to capability release", and worse, clients get evicted (unresponsive as far as the MDS is concerned). Even the latest 5.17 kernels have that. No difference is observed between using messenger v1 or v2. MDS version is 15.2.16. Surprisingly the latest stable kernels from CentOS 7 work flawlessly now. Although that is good news, newer operating systems come with newer kernels.

Does anyone else observe the same behavior with newish kernel clients?

It was a bit more subtle than that. Not all 5.16 kernels were the same:

5.16.14-1.el7.elrepo.x86_64
5.16.8-1.el7.elrepo.x86_64

The latter seems to be the one that introduced issues.

Regarding the 5.17.4-1.el7.elrepo.x86_64 ... it's the only one of that web cluster. The rest is on 3.10.0-1160.59.1.el7.x86_64. Might that be an issue?


There have some known bugs, which have been fixed or under fixing recently, even in the mainline and, not sure whether are they related. Such as [1][2][3][4]. More detail please see ceph-client repo testing branch [5].

I've checked all trackers. But none of the issues described there were applicable to the running kernels.



I have never see the "failing to respond to capability release" issue yet, if you have the MDS logs(debug_mds = 25 and debug_ms = 1) and kernel debug logs will be better to help debug it further, or provide the steps to reproduce it.

debug_mds=20 gives 2 GB op of logging output in less than 10 seconds on our cluster. Not sure what debug_mds=25 would give. Will try to gather output when the issue appears again.

Gr. Stefan

[1] https://tracker.ceph.com/issues/55332
[2] https://tracker.ceph.com/issues/55421
[3] https://bugzilla.redhat.com/show_bug.cgi?id=2063929
[4] https://tracker.ceph.com/issues/55377
[5] https://github.com/ceph/ceph-client/commits/testing
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx



[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux