Hi Igor We'll take a look at disabling swap on the nodes and see if that improves the situation. Having checked across all osds we're not seeing bluestore_reads_with_retries as anything other than a zero value. We get the error anywhere from 3 - 10 occurrences of the error a week, but it's usually only one or two PGs that are inconsistent at any one time. Thanks Welby On Mon, Sep 14, 2020 at 12:17 PM Igor Fedotov <ifedotov@xxxxxxx> wrote: > Hi David, > > you might want to try to disable swap for your nodes. Look like there is > some implicit correlation between such read errors and enabled swapping. > > Also wondering whether you can observe non-zero values for > "bluestore_reads_with_retries" performance counters over your OSDs. How > wide-spread these cases are present? How high this counter might get? > > > Thanks, > > Igor > > > On 9/9/2020 4:59 PM, David Orman wrote: > > Right, you can see the previously referenced ticket/bug in the link I had > > provided. It's definitely not an unknown situation. > > > > We have another one today: > > > > debug 2020-09-09T06:49:36.595+0000 7f570871d700 -1 > > bluestore(/var/lib/ceph/osd/ceph-123) _verify_csum bad crc32c/0x1000 > > checksum at blob offset 0x60000, got 0x6706be76, expected 0x929a618, > device > > location [0x2f387d70000~1000], logical extent 0xe0000~1000, object > > 0#2:7ff493bc:::rbd_data.3.20d195d612942.0000000004228a96:head# > > > > debug 2020-09-09T06:49:36.611+0000 7f570871d700 -1 > > bluestore(/var/lib/ceph/osd/ceph-123) _verify_csum bad crc32c/0x1000 > > checksum at blob offset 0x60000, got 0x6706be76, expected 0x929a618, > device > > location [0x2f387d70000~1000], logical extent 0xe0000~1000, object > > 0#2:7ff493bc:::rbd_data.3.20d195d612942.0000000004228a96:head# > > > > debug 2020-09-09T06:49:36.611+0000 7f570871d700 -1 > > bluestore(/var/lib/ceph/osd/ceph-123) _verify_csum bad crc32c/0x1000 > > checksum at blob offset 0x60000, got 0x6706be76, expected 0x929a618, > device > > location [0x2f387d70000~1000], logical extent 0xe0000~1000, object > > 0#2:7ff493bc:::rbd_data.3.20d195d612942.0000000004228a96:head# > > > > debug 2020-09-09T06:49:36.611+0000 7f570871d700 -1 > > bluestore(/var/lib/ceph/osd/ceph-123) _verify_csum bad crc32c/0x1000 > > checksum at blob offset 0x60000, got 0x6706be76, expected 0x929a618, > device > > location [0x2f387d70000~1000], logical extent 0xe0000~1000, object > > 0#2:7ff493bc:::rbd_data.3.20d195d612942.0000000004228a96:head# > > > > debug 2020-09-09T06:49:37.315+0000 7f570871d700 -1 log_channel(cluster) > log > > [ERR] : 2.3fe shard 123(0) soid > > 2:7ff493bc:::rbd_data.3.20d195d612942.0000000004228a96:head : candidate > had > > a read error > > > > debug 2020-09-09T06:57:08.930+0000 7f570871d700 -1 log_channel(cluster) > log > > [ERR] : 2.3fes0 deep-scrub 0 missing, 1 inconsistent objects > > > > debug 2020-09-09T06:57:08.930+0000 7f570871d700 -1 log_channel(cluster) > log > > [ERR] : 2.3fe deep-scrub 1 errors > > > > This happens across the entire cluster, not just one server, so we don't > > think it's faulty hardware. > > > > On Wed, Sep 9, 2020 at 12:51 AM Janne Johansson <icepic.dz@xxxxxxxxx> > wrote: > > > >> I googled "got 0x6706be76, expected" and found some hits regarding ceph, > >> so whatever it is, you are not the first, and that number has some > internal > >> meaning. > >> Redhat solution for similar issue says that checksum is for seeing all > >> zeroes, and hints at a bad write cache on the controller or something > that > >> ends up clearing data instead of writing the correct information on > >> shutdowns. > >> > >> > >> Den tis 8 sep. 2020 kl 23:21 skrev David Orman <ormandj@xxxxxxxxxxxx>: > >> > >>> > >>> We're seeing repeated inconsistent PG warnings, generally on the order > of > >>> 3-10 per week. > >>> > >>> pg 2.b9 is active+clean+inconsistent, acting > [25,117,128,95,151,15] > >>> > >>> > >> > >>> Every time we look at them, we see the same checksum (0x6706be76): > >>> > >>> debug 2020-08-13T18:39:01.731+0000 7fbc037a7700 -1 > >>> bluestore(/var/lib/ceph/osd/ceph-25) _verify_csum bad crc32c/0x1000 > >>> checksum at blob offset 0x0, got 0x6706be76, expected 0x61f2021c, > device > >>> location [0x12b403c0000~1000], logical extent 0x0~1000, object > >>> 2#2:0f1a338f:::rbd_data.3.20d195d612942.0000000001db869b:head# > >>> > >>> This looks a lot like: https://tracker.ceph.com/issues/22464 > >>> That said, we've got the following versions in play (cluster was > created > >>> with 15.2.3): > >>> ceph version 15.2.4 (7447c15c6ff58d7fce91843b705a268a1917325c) octopus > >>> (stable) > >>> > >> > >> -- > >> May the most significant bit of your life be positive. > >> > > _______________________________________________ > > ceph-users mailing list -- ceph-users@xxxxxxx > > To unsubscribe send an email to ceph-users-leave@xxxxxxx > _______________________________________________ > ceph-users mailing list -- ceph-users@xxxxxxx > To unsubscribe send an email to ceph-users-leave@xxxxxxx > _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx