Why OSD could report spurious read errors.

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hello All!

just to bring this knowledge to a wider audience...

Under some circumstances osds/clusters might report (and even suffer from) spurious disk read errors. The following comment's re-post sheds light on the root cause. Many thanks to Canonical's folks for that.

Originally posted at: https://tracker.ceph.com/issues/22464#note-72

"

At Canonical we tracked down and solved the cause of this bug. Credit to my colleague Mauricio Faria de Oliveira for identifying and fixing the issue. We fixed this a little while ago but this bug never got updated with the details, so adding them for future travellers.

The true cause is a bug in the Linux MADV_FREE implementation, which was first introduced in Linux v4.5. It's a race condition between MADV_FREE and Direct I/O that is triggered under memory pressure.

Upstream kernel fix with very detailed analysis in the commit message is here:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=6c8e2a256915a223f6289f651d6b926cd7135c9e

MADV_FREE is not directly used by Ceph so much as by tcmalloc. MADV_FREE was used by tcmalloc (gperftools) based on a compile-time detection. In 2016 they then disabled use of MADV_FREE on linux because it was untested - released into v2.5.90 and v2.6+

Hence to hit this issue you needed to have a tcmalloc that was compiled on Linux v4.5+, running on Linux v4.5+ and before they intentionally disabled support for MADV_FREE. See this issue for details on disabling MADV_FREE:
https://github.com/gperftools/gperftools/issues/780

This was the case in Ubuntu Bionic 18.04 which shipped v2.5

Seems many moved on since but if you do experience this then upgrade to a kernel with the above fix:
mm: fix race between MADV_FREE reclaim and blkdev direct IO read

It typically manifests in two different ways, sometimes the checksum fails at the bluefs layer in which case newer Ceph versions added a retry on the read which often works around it since you don't hit the race twice. But you can also hit it in rocksdb which crashes the OSD.
"

--
Igor Fedotov
Ceph Lead Developer

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH, Freseniusstr. 31h, 81247 Munich
CEO: Martin Verges - VAT-ID: DE310638492
Com. register: Amtsgericht Munich HRB 231263
Web: https://croit.io | YouTube: https://goo.gl/PGE1Bx

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx




[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux