Hi Xuehan,
Are there any smartctl warnings for the drives that OSD is associated
with? Probably the first step is to verify that the HW and OS are still
functioning/sane. These days I don't see as many problems, but in the
past we've run into (consumer grade) HW that lied about durability to
gain performance. We also occasionally used to run into situations in
the filestore days where inexplicably -o nobarrier was set on the mount
and no one knew how. :P
It certainly could be a bug in Ceph, but my first course of action in
these kinds of situations is to look at the whole system and any
potential points of corruption in the entire write path and start
minimizing the problem space by ruling things out. If you think the HW
and OS are all functioning correctly, the next step would be to
investigate the state of that object to see if you can determine what
was being done to it that left it in the corrupted state. Not easy
work, but it might help shed light on what's going on.
The good news is that bluestore actually detected the crc mismatch.
Mark
On 4/27/21 3:54 AM, Xuehan Xu wrote:
Hi, everyone.
Recently, one of our online cluster experienced a whole cluster power
outage, and after the power recovered, many osd started to log the
following error:
2021-04-27 15:38:05.503 2b372b957700 -1
bluestore(/var/lib/ceph/osd/ceph-3) _verify_csum bad crc32c/0x1000
checksum at blob offset 0x36000, got 0x41fe1397, expected 0x8d7f5975,
device location [0xa7e76000~1000], logical extent 0x1b6000~1000,
object #9:45a4e02a:::rbd_data.3b35df93038d.0000000000000095:head#
2021-04-27 15:38:05.504 2b372b957700 -1
bluestore(/var/lib/ceph/osd/ceph-3) _verify_csum bad crc32c/0x1000
checksum at blob offset 0x36000, got 0x41fe1397, expected 0x8d7f5975,
device location [0xa7e76000~1000], logical extent 0x1b6000~1000,
object #9:45a4e02a:::rbd_data.3b35df93038d.0000000000000095:head#
2021-04-27 15:38:05.505 2b372b957700 -1
bluestore(/var/lib/ceph/osd/ceph-3) _verify_csum bad crc32c/0x1000
checksum at blob offset 0x36000, got 0x41fe1397, expected 0x8d7f5975,
device location [0xa7e76000~1000], logical extent 0x1b6000~1000,
object #9:45a4e02a:::rbd_data.3b35df93038d.0000000000000095:head#
2021-04-27 15:38:05.506 2b372b957700 -1
bluestore(/var/lib/ceph/osd/ceph-3) _verify_csum bad crc32c/0x1000
checksum at blob offset 0x36000, got 0x41fe1397, expected 0x8d7f5975,
device location [0xa7e76000~1000], logical extent 0x1b6000~1000,
object #9:45a4e02a:::rbd_data.3b35df93038d.0000000000000095:head#
2021-04-27 15:38:28.379 2b372c158700 -1
bluestore(/var/lib/ceph/osd/ceph-3) _verify_csum bad crc32c/0x1000
checksum at blob offset 0x40000, got 0xce935e16, expected 0x9b502da7,
device location [0xa9a80000~1000], logical extent 0x80000~1000, object
#9:c2a6d9ae:::rbd_data.3b35df93038d.0000000000000696:head#
We are using Nautilus 14.2.10 version, and we put rocksdb on top of
SSDs while bluestore data on SATA disks. It seems that the BlueStore
didn't survive the power outage, is it supposed to behave this way? Is
there any way to prevent it?
Thanks:-)
_______________________________________________
Dev mailing list -- dev@xxxxxxx
To unsubscribe send an email to dev-leave@xxxxxxx
_______________________________________________
Dev mailing list -- dev@xxxxxxx
To unsubscribe send an email to dev-leave@xxxxxxx