Hi Alex,
thanks for your reply! We download official vanilla kernel.org sources
with git, merge RSBAC and some small things on top and compile our own
packages. No distro patches involved. The Ceph part has only very small
RSBAC related changes, I am sure they are not the problem.
Unfortunately, the bug only shows under load in customer clusters. Only
few of these customers like experiments with production clusters - they
have no internet access, if the cluster breaks.
This said, I will ask our support people to save any log snippets they
can get their hands on and pass them over to you.
Regards,
Amon.
Am 18.12.24 um 16:05 schrieb Alex Markuze:
Hi Amon,
We are already investigating similar issues, if possible two things
might be of help to use.
1. A recreate test scenario
2. A dmesg log if it contains any errors or warnings.
LTS kernels I assume ubuntu? Knowing the exact kernel version would
help bisecting and finding what caused the degradation.
On Wed, Dec 18, 2024 at 11:39 AM Amon Ott <a.ott@xxxxxxxxxxxx> wrote:
Hi!
After changing client systems from kernel 5.10 to 6.6 about a year ago,
we got many of these messages:
Health check failed: 1 clients failing to respond to capability
release (MDS_CLIENT_LATE_RELEASE)
Recent MDS changes provide a workaround that at least avoids going
read-only, but it can still lead to hanging Ceph requests.
Kernel 6.6.55 brought fixes that seemed to help a bit, this one might be
relevant:
ceph: fix cap ref leak via netfs init_request
commit ccda9910d8490f4fb067131598e4b2e986faa5a0 upstream.
However, with 6.6.58 we still got some of these messages and hanging
requests. There seem to have been no relevant Ceph fixes after that, so
we have not dared testing since.
As these clusters are in production use, we switched back to kernel 5.10
again, which has been working with Ceph without problems for some years.
All our tests show that this problem is only related to the kernel
client version, it happens with various Ceph server versions from 10 to 19.
We would appreciate if someone with deeper knowledge of the Ceph kernel
client could look into this problem again.
In January (after the Xmas break) we could test on affected customer
systems with any proposed fixes. The new LTS kernel 6.12 would be fine
for us, too. It does not seem to have any relevant Ceph changes either,
though.
Thanks for your work!
Amon Ott
--
Dr. Amon Ott
m-privacy GmbH Tel: +49 30 24342334
Werner-Voß-Damm 62 Fax: +49 30 99296856
12101 Berlin http://www.m-privacy.de
Amtsgericht Charlottenburg, HRB 84946
Geschäftsführer:
Dipl.-Kfm. Holger Maczkowsky,
Roman Maczkowsky
GnuPG-Key-ID: 0x2DD3A649