Hello All!
just to bring this knowledge to a wider audience...
Under some circumstances osds/clusters might report (and even suffer
from) spurious disk read errors. The following comment's re-post sheds
light on the root cause. Many thanks to Canonical's folks for that.
Originally posted at: https://tracker.ceph.com/issues/22464#note-72
"
At Canonical we tracked down and solved the cause of this bug. Credit to
my colleague Mauricio Faria de Oliveira for identifying and fixing the
issue. We fixed this a little while ago but this bug never got updated
with the details, so adding them for future travellers.
The true cause is a bug in the Linux MADV_FREE implementation, which was
first introduced in Linux v4.5. It's a race condition between MADV_FREE
and Direct I/O that is triggered under memory pressure.
Upstream kernel fix with very detailed analysis in the commit message is
here:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=6c8e2a256915a223f6289f651d6b926cd7135c9e
MADV_FREE is not directly used by Ceph so much as by tcmalloc. MADV_FREE
was used by tcmalloc (gperftools) based on a compile-time detection. In
2016 they then disabled use of MADV_FREE on linux because it was
untested - released into v2.5.90 and v2.6+
Hence to hit this issue you needed to have a tcmalloc that was compiled
on Linux v4.5+, running on Linux v4.5+ and before they intentionally
disabled support for MADV_FREE. See this issue for details on disabling
MADV_FREE:
https://github.com/gperftools/gperftools/issues/780
This was the case in Ubuntu Bionic 18.04 which shipped v2.5
Seems many moved on since but if you do experience this then upgrade to
a kernel with the above fix:
mm: fix race between MADV_FREE reclaim and blkdev direct IO read
It typically manifests in two different ways, sometimes the checksum
fails at the bluefs layer in which case newer Ceph versions added a
retry on the read which often works around it since you don't hit the
race twice. But you can also hit it in rocksdb which crashes the OSD.
"
--
Igor Fedotov
Ceph Lead Developer
Looking for help with your Ceph cluster? Contact us at https://croit.io
croit GmbH, Freseniusstr. 31h, 81247 Munich
CEO: Martin Verges - VAT-ID: DE310638492
Com. register: Amtsgericht Munich HRB 231263
Web: https://croit.io | YouTube: https://goo.gl/PGE1Bx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx