Re: Clients failing to respond to capability release with LTS kernel 6.6

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi Alex,

thanks for your reply! We download official vanilla kernel.org sources with git, merge RSBAC and some small things on top and compile our own packages. No distro patches involved. The Ceph part has only very small RSBAC related changes, I am sure they are not the problem.

Unfortunately, the bug only shows under load in customer clusters. Only few of these customers like experiments with production clusters - they have no internet access, if the cluster breaks.

This said, I will ask our support people to save any log snippets they can get their hands on and pass them over to you.

Regards,

Amon.

Am 18.12.24 um 16:05 schrieb Alex Markuze:
Hi Amon,
We are already investigating similar issues, if possible two things
might be of help to use.
1. A recreate test scenario
2. A dmesg log if it contains any errors or warnings.

LTS kernels I assume ubuntu? Knowing the exact kernel version would
help bisecting and finding what caused the degradation.

On Wed, Dec 18, 2024 at 11:39 AM Amon Ott <a.ott@xxxxxxxxxxxx> wrote:

Hi!

After changing client systems from kernel 5.10 to 6.6 about a year ago,
we got many of these messages:

Health check failed: 1 clients failing to respond to capability
release (MDS_CLIENT_LATE_RELEASE)

Recent MDS changes provide a workaround that at least avoids going
read-only, but it can still lead to hanging Ceph requests.

Kernel 6.6.55 brought fixes that seemed to help a bit, this one might be
relevant:
      ceph: fix cap ref leak via netfs init_request
      commit ccda9910d8490f4fb067131598e4b2e986faa5a0 upstream.

However, with 6.6.58 we still got some of these messages and hanging
requests. There seem to have been no relevant Ceph fixes after that, so
we have not dared testing since.

As these clusters are in production use, we switched back to kernel 5.10
again, which has been working with Ceph without problems for some years.
All our tests show that this problem is only related to the kernel
client version, it happens with various Ceph server versions from 10 to 19.

We would appreciate if someone with deeper knowledge of the Ceph kernel
client could look into this problem again.

In January (after the Xmas break) we could test on affected customer
systems with any proposed fixes. The new LTS kernel 6.12 would be fine
for us, too. It does not seem to have any relevant Ceph changes either,
though.

Thanks for your work!


Amon Ott
--
Dr. Amon Ott
m-privacy GmbH           Tel: +49 30 24342334
Werner-Voß-Damm 62       Fax: +49 30 99296856
12101 Berlin             http://www.m-privacy.de

Amtsgericht Charlottenburg, HRB 84946

Geschäftsführer:
 Dipl.-Kfm. Holger Maczkowsky,
 Roman Maczkowsky

GnuPG-Key-ID: 0x2DD3A649




[Index of Archives]     [CEPH Users]     [Ceph Large]     [Ceph Dev]     [Information on CEPH]     [Linux BTRFS]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux