Re: ceph pgs inconsistent, always the same checksum

David Orman <ormandj@xxxxxxxxxxxx> · Wed, 9 Sep 2020 08:59:03 -0500

Right, you can see the previously referenced ticket/bug in the link I had
provided. It's definitely not an unknown situation.

We have another one today:

debug 2020-09-09T06:49:36.595+0000 7f570871d700 -1
bluestore(/var/lib/ceph/osd/ceph-123) _verify_csum bad crc32c/0x1000
checksum at blob offset 0x60000, got 0x6706be76, expected 0x929a618, device
location [0x2f387d70000~1000], logical extent 0xe0000~1000, object
0#2:7ff493bc:::rbd_data.3.20d195d612942.0000000004228a96:head#

debug 2020-09-09T06:49:36.611+0000 7f570871d700 -1
bluestore(/var/lib/ceph/osd/ceph-123) _verify_csum bad crc32c/0x1000
checksum at blob offset 0x60000, got 0x6706be76, expected 0x929a618, device
location [0x2f387d70000~1000], logical extent 0xe0000~1000, object
0#2:7ff493bc:::rbd_data.3.20d195d612942.0000000004228a96:head#

debug 2020-09-09T06:49:36.611+0000 7f570871d700 -1
bluestore(/var/lib/ceph/osd/ceph-123) _verify_csum bad crc32c/0x1000
checksum at blob offset 0x60000, got 0x6706be76, expected 0x929a618, device
location [0x2f387d70000~1000], logical extent 0xe0000~1000, object
0#2:7ff493bc:::rbd_data.3.20d195d612942.0000000004228a96:head#

debug 2020-09-09T06:49:36.611+0000 7f570871d700 -1
bluestore(/var/lib/ceph/osd/ceph-123) _verify_csum bad crc32c/0x1000
checksum at blob offset 0x60000, got 0x6706be76, expected 0x929a618, device
location [0x2f387d70000~1000], logical extent 0xe0000~1000, object
0#2:7ff493bc:::rbd_data.3.20d195d612942.0000000004228a96:head#

debug 2020-09-09T06:49:37.315+0000 7f570871d700 -1 log_channel(cluster) log
[ERR] : 2.3fe shard 123(0) soid
2:7ff493bc:::rbd_data.3.20d195d612942.0000000004228a96:head : candidate had
a read error

debug 2020-09-09T06:57:08.930+0000 7f570871d700 -1 log_channel(cluster) log
[ERR] : 2.3fes0 deep-scrub 0 missing, 1 inconsistent objects

debug 2020-09-09T06:57:08.930+0000 7f570871d700 -1 log_channel(cluster) log
[ERR] : 2.3fe deep-scrub 1 errors

This happens across the entire cluster, not just one server, so we don't
think it's faulty hardware.

On Wed, Sep 9, 2020 at 12:51 AM Janne Johansson <icepic.dz@xxxxxxxxx> wrote:

> I googled "got 0x6706be76, expected" and found some hits regarding ceph,
> so whatever it is, you are not the first, and that number has some internal
> meaning.
> Redhat solution for similar issue says that checksum is for seeing all
> zeroes, and hints at a bad write cache on the controller or something that
> ends up clearing data instead of writing the correct information on
> shutdowns.
>
>
> Den tis 8 sep. 2020 kl 23:21 skrev David Orman <ormandj@xxxxxxxxxxxx>:
>
>>
>>
>> We're seeing repeated inconsistent PG warnings, generally on the order of
>> 3-10 per week.
>>
>>     pg 2.b9 is active+clean+inconsistent, acting [25,117,128,95,151,15]
>>
>>
>
>
>> Every time we look at them, we see the same checksum (0x6706be76):
>>
>> debug 2020-08-13T18:39:01.731+0000 7fbc037a7700 -1
>> bluestore(/var/lib/ceph/osd/ceph-25) _verify_csum bad crc32c/0x1000
>> checksum at blob offset 0x0, got 0x6706be76, expected 0x61f2021c, device
>> location [0x12b403c0000~1000], logical extent 0x0~1000, object
>> 2#2:0f1a338f:::rbd_data.3.20d195d612942.0000000001db869b:head#
>>
>> This looks a lot like: https://tracker.ceph.com/issues/22464
>> That said, we've got the following versions in play (cluster was created
>> with 15.2.3):
>> ceph version 15.2.4 (7447c15c6ff58d7fce91843b705a268a1917325c) octopus
>> (stable)
>>
>
>
> --
> May the most significant bit of your life be positive.
>
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx