Hi!
After changing client systems from kernel 5.10 to 6.6 about a year ago,
we got many of these messages:
Health check failed: 1 clients failing to respond to capability
release (MDS_CLIENT_LATE_RELEASE)
Recent MDS changes provide a workaround that at least avoids going
read-only, but it can still lead to hanging Ceph requests.
Kernel 6.6.55 brought fixes that seemed to help a bit, this one might be
relevant:
ceph: fix cap ref leak via netfs init_request
commit ccda9910d8490f4fb067131598e4b2e986faa5a0 upstream.
However, with 6.6.58 we still got some of these messages and hanging
requests. There seem to have been no relevant Ceph fixes after that, so
we have not dared testing since.
As these clusters are in production use, we switched back to kernel 5.10
again, which has been working with Ceph without problems for some years.
All our tests show that this problem is only related to the kernel
client version, it happens with various Ceph server versions from 10 to 19.
We would appreciate if someone with deeper knowledge of the Ceph kernel
client could look into this problem again.
In January (after the Xmas break) we could test on affected customer
systems with any proposed fixes. The new LTS kernel 6.12 would be fine
for us, too. It does not seem to have any relevant Ceph changes either,
though.
Thanks for your work!
Amon Ott
--
Dr. Amon Ott
m-privacy GmbH Tel: +49 30 24342334
Werner-Voß-Damm 62 Fax: +49 30 99296856
12101 Berlin http://www.m-privacy.de
Amtsgericht Charlottenburg, HRB 84946
Geschäftsführer:
Dipl.-Kfm. Holger Maczkowsky,
Roman Maczkowsky
GnuPG-Key-ID: 0x2DD3A649
Amon Ott
--
Dr. Amon Ott
m-privacy GmbH Tel: +49 30 24342334
Werner-Voß-Damm 62 Fax: +49 30 99296856
12101 Berlin http://www.m-privacy.de
Amtsgericht Charlottenburg, HRB 84946
Geschäftsführer:
Dipl.-Kfm. Holger Maczkowsky,
Roman Maczkowsky
GnuPG-Key-ID: 0x2DD3A649