Hi Jay,
this alert was introduced in Pacific indeed. That's probably why you
haven't seen it before.
And it definitely implies read retries, the following output mentions
that explicitly:
HEALTH_WARN 1 OSD(s) have spurious read errors [WRN] BLUESTORE_SPURIOUS_READ_ERRORS: 1 OSD(s) have spurious read errors
osd.117 reads with retries: 1
"reads with retries" is actually a replica of
"bluestore_reads_with_retries" perf counter at the corresponding OSD
hence one can monitor it directly with "ceph daemon osd.N perf dump"
command.
Additionally one can increase "debug bluestore" log level to 5 to get
relevant logging output in OSD log, here is the code line to print it:
dout(5) << __func__ << " read at 0x" << std::hex << offset << "~"
<< length
<< " failed " << std::dec << retry_count << " times before
succeeding" << dendl;
Thanks,
Igor
On 6/22/2021 2:10 AM, Jay Sullivan wrote:
In the week since upgrading one of our clusters from Nautilus 14.2.21 to Pacific 16.2.4 I've seen four spurious read errors that always have the same bad checksum of 0x6706be76. I've never seen this in any of our clusters before. Here's an example of what I'm seeing in the logs:
ceph-osd.132.log:2021-06-20T22:53:20.584-0400 7fde2e4fc700 -1 bluestore(/var/lib/ceph/osd/ceph-132) _verify_csum bad crc32c/0x1000 checksum at blob offset 0x0, got 0x6706be76, expected 0xee74a56a, device location [0x18c81b40000~1000], logical extent 0x200000~1000, object #29:2d8210bf:::rbd_data.94f4232ae8944a.0000000000026c57:head#
I'm not seeing any indication of inconsistent PGs, only the spurious read error. I don't see an explicit indication of a retry in the logs following the above message. Bluestore code to retry three times was introduced in 2018 following a similar issue with the same checksum: https://tracker.ceph.com/issues/22464
Here's an example of what my health detail looks like:
HEALTH_WARN 1 OSD(s) have spurious read errors [WRN] BLUESTORE_SPURIOUS_READ_ERRORS: 1 OSD(s) have spurious read errors
osd.117 reads with retries: 1
I followed this (unresolved) thread, too: https://lists.ceph.io/hyperkitty/list/ceph-users@xxxxxxx/thread/DRBVFQLZ5ZYMNPKLAWS5AR4Z2MJQYLLC/
I do have swap enabled, but I don't think memory pressure is an issue with 30GB available out of 96GB (and no sign I've been close to summoning the OOMkiller). The OSDs that have thrown the cluster into HEALTH_WARN with the spurious read errors are busy 12TB rotational HDDs and I _think_ it's only happening during a deep scrub. We're on Ubuntu 18.04; uname: 5.4.0-74-generic #83~18.04.1-Ubuntu SMP Tue May 11 16:01:00 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux.
Does Pacific retry three times on a spurious read error? Would I see an indication of a retry in the logs?
Thanks!
~Jay
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx