Hi David,
you might want to try to disable swap for your nodes. Look like there is
some implicit correlation between such read errors and enabled swapping.
Also wondering whether you can observe non-zero values for
"bluestore_reads_with_retries" performance counters over your OSDs. How
wide-spread these cases are present? How high this counter might get?
Thanks,
Igor
On 9/9/2020 4:59 PM, David Orman wrote:
Right, you can see the previously referenced ticket/bug in the link I had
provided. It's definitely not an unknown situation.
We have another one today:
debug 2020-09-09T06:49:36.595+0000 7f570871d700 -1
bluestore(/var/lib/ceph/osd/ceph-123) _verify_csum bad crc32c/0x1000
checksum at blob offset 0x60000, got 0x6706be76, expected 0x929a618, device
location [0x2f387d70000~1000], logical extent 0xe0000~1000, object
0#2:7ff493bc:::rbd_data.3.20d195d612942.0000000004228a96:head#
debug 2020-09-09T06:49:36.611+0000 7f570871d700 -1
bluestore(/var/lib/ceph/osd/ceph-123) _verify_csum bad crc32c/0x1000
checksum at blob offset 0x60000, got 0x6706be76, expected 0x929a618, device
location [0x2f387d70000~1000], logical extent 0xe0000~1000, object
0#2:7ff493bc:::rbd_data.3.20d195d612942.0000000004228a96:head#
debug 2020-09-09T06:49:36.611+0000 7f570871d700 -1
bluestore(/var/lib/ceph/osd/ceph-123) _verify_csum bad crc32c/0x1000
checksum at blob offset 0x60000, got 0x6706be76, expected 0x929a618, device
location [0x2f387d70000~1000], logical extent 0xe0000~1000, object
0#2:7ff493bc:::rbd_data.3.20d195d612942.0000000004228a96:head#
debug 2020-09-09T06:49:36.611+0000 7f570871d700 -1
bluestore(/var/lib/ceph/osd/ceph-123) _verify_csum bad crc32c/0x1000
checksum at blob offset 0x60000, got 0x6706be76, expected 0x929a618, device
location [0x2f387d70000~1000], logical extent 0xe0000~1000, object
0#2:7ff493bc:::rbd_data.3.20d195d612942.0000000004228a96:head#
debug 2020-09-09T06:49:37.315+0000 7f570871d700 -1 log_channel(cluster) log
[ERR] : 2.3fe shard 123(0) soid
2:7ff493bc:::rbd_data.3.20d195d612942.0000000004228a96:head : candidate had
a read error
debug 2020-09-09T06:57:08.930+0000 7f570871d700 -1 log_channel(cluster) log
[ERR] : 2.3fes0 deep-scrub 0 missing, 1 inconsistent objects
debug 2020-09-09T06:57:08.930+0000 7f570871d700 -1 log_channel(cluster) log
[ERR] : 2.3fe deep-scrub 1 errors
This happens across the entire cluster, not just one server, so we don't
think it's faulty hardware.
On Wed, Sep 9, 2020 at 12:51 AM Janne Johansson <icepic.dz@xxxxxxxxx> wrote:
I googled "got 0x6706be76, expected" and found some hits regarding ceph,
so whatever it is, you are not the first, and that number has some internal
meaning.
Redhat solution for similar issue says that checksum is for seeing all
zeroes, and hints at a bad write cache on the controller or something that
ends up clearing data instead of writing the correct information on
shutdowns.
Den tis 8 sep. 2020 kl 23:21 skrev David Orman <ormandj@xxxxxxxxxxxx>:
We're seeing repeated inconsistent PG warnings, generally on the order of
3-10 per week.
pg 2.b9 is active+clean+inconsistent, acting [25,117,128,95,151,15]
Every time we look at them, we see the same checksum (0x6706be76):
debug 2020-08-13T18:39:01.731+0000 7fbc037a7700 -1
bluestore(/var/lib/ceph/osd/ceph-25) _verify_csum bad crc32c/0x1000
checksum at blob offset 0x0, got 0x6706be76, expected 0x61f2021c, device
location [0x12b403c0000~1000], logical extent 0x0~1000, object
2#2:0f1a338f:::rbd_data.3.20d195d612942.0000000001db869b:head#
This looks a lot like: https://tracker.ceph.com/issues/22464
That said, we've got the following versions in play (cluster was created
with 15.2.3):
ceph version 15.2.4 (7447c15c6ff58d7fce91843b705a268a1917325c) octopus
(stable)
--
May the most significant bit of your life be positive.
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx