Why OSD could report spurious read errors.

Igor Fedotov <igor.fedotov@xxxxxxxx> · Fri, 23 Sep 2022 15:20:52 +0300

Hello All!

just to bring this knowledge to a wider audience...

Under some circumstances osds/clusters might report (and even suffer 
from) spurious disk read errors. The following comment's re-post sheds 
light on the root cause. Many thanks to Canonical's folks for that.

Originally posted at: https://tracker.ceph.com/issues/22464#note-72

"

At Canonical we tracked down and solved the cause of this bug. Credit to 
my colleague Mauricio Faria de Oliveira for identifying and fixing the 
issue. We fixed this a little while ago but this bug never got updated 
with the details, so adding them for future travellers.

The true cause is a bug in the Linux MADV_FREE implementation, which was 
first introduced in Linux v4.5. It's a race condition between MADV_FREE 
and Direct I/O that is triggered under memory pressure.

Upstream kernel fix with very detailed analysis in the commit message is 
here:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=6c8e2a256915a223f6289f651d6b926cd7135c9e

MADV_FREE is not directly used by Ceph so much as by tcmalloc. MADV_FREE 
was used by tcmalloc (gperftools) based on a compile-time detection. In 
2016 they then disabled use of MADV_FREE on linux because it was 
untested - released into v2.5.90 and v2.6+

Hence to hit this issue you needed to have a tcmalloc that was compiled 
on Linux v4.5+, running on Linux v4.5+ and before they intentionally 
disabled support for MADV_FREE. See this issue for details on disabling 
MADV_FREE:
https://github.com/gperftools/gperftools/issues/780

This was the case in Ubuntu Bionic 18.04 which shipped v2.5

Seems many moved on since but if you do experience this then upgrade to 
a kernel with the above fix:
mm: fix race between MADV_FREE reclaim and blkdev direct IO read

It typically manifests in two different ways, sometimes the checksum 
fails at the bluefs layer in which case newer Ceph versions added a 
retry on the read which often works around it since you don't hit the 
race twice. But you can also hit it in rocksdb which crashes the OSD.
"

--
Igor Fedotov
Ceph Lead Developer

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH, Freseniusstr. 31h, 81247 Munich
CEO: Martin Verges - VAT-ID: DE310638492
Com. register: Amtsgericht Munich HRB 231263
Web: https://croit.io | YouTube: https://goo.gl/PGE1Bx

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx