FWIW, we've had similar reports in the past:
https://tracker.ceph.com/issues/37282
https://tracker.ceph.com/issues/48002
https://lists.ceph.io/hyperkitty/list/ceph-users@xxxxxxx/thread/2GBK5NJFOSQGMN25GQ3CZNX4W2ZGQV5U/?sort=date
https://www.spinics.net/lists/ceph-users/msg59466.html
https://www.bountysource.com/issues/49313514-block-checksum-mismatch
...but we aren't the only ones:
https://github.com/facebook/rocksdb/issues/5251
https://github.com/facebook/rocksdb/issues/7033
https://jira.mariadb.org/browse/MDEV-20456
https://lists.launchpad.net/maria-discuss/msg05614.html
https://githubmemory.com/repo/openethereum/openethereum/issues/416
https://githubmemory.com/repo/FISCO-BCOS/FISCO-BCOS/issues/1895
https://groups.google.com/g/rocksdb/c/gUD4kCGTw-0/m/uLpFwkO5AgAJ
At least in one case for us, the user was using consumer grade SSDs
without power loss protection. I don't think we ever fully diagnosed if
that was the cause though. Another case potentially was related to high
memory usage on the node. Hardware errors are a legitimate concern here
so probably checking dmesg/smartctl/etc is warranted. ECC memory
obviously helps too (or rather the lack of which makes it more difficult
to diagnose).
For folks that have experienced this, any info you can give related to
the HW involved would be helpful. We (and other projects) have seen
similar things over the years but this is a notoriously difficult issue
to track down given that it could be any one of many different things
and it may or may not be our code.
Mark
On 9/20/21 10:09 AM, Neha Ojha wrote:
Can we please create a bluestore tracker issue for this
(if one does not exist already), where we can start capturing all the
relevant information needed to debug this? Given that this has been
encountered in previous 16.2.* versions, it doesn't sound like a
regression in 16.2.6 to me, rather an issue in pacific. In any case,
we'll prioritize fixing it.
Thanks,
Neha
On Mon, Sep 20, 2021 at 8:03 AM Andrej Filipcic <andrej.filipcic@xxxxxx> wrote:
On 20/09/2021 16:02, David Orman wrote:
Same question here, for clarity, was this on upgrading to 16.2.6 from
16.2.5? Or upgrading
from some other release?
from 16.2.5. but the OSD services were never restarted after upgrade to
.5, so it could be a leftover of previous issues.
Cheers,
Andrej
On Mon, Sep 20, 2021 at 8:57 AM Sean <sean@xxxxxxxxx> wrote:
I also ran into this with v16. In my case, trying to run a repair totally
exhausted the RAM on the box, and was unable to complete.
After removing/recreating the OSD, I did notice that it has a drastically
smaller OMAP size than the other OSDs. I don’t know if that actually means
anything, but just wanted to mention it in case it does.
ID CLASS WEIGHT REWEIGHT SIZE RAW USE DATA OMAP META
AVAIL %USE VAR PGS STATUS TYPE NAME
14 hdd 10.91409 1.00000 11 TiB 3.3 TiB 3.2 TiB 4.6 MiB 5.4 GiB
7.7 TiB 29.81 1.02 34 up osd.14
16 hdd 10.91409 1.00000 11 TiB 3.3 TiB 3.3 TiB 20 KiB 9.4 GiB
7.6 TiB 30.03 1.03 35 up osd.16
~ Sean
On Sep 20, 2021 at 8:27:39 AM, Paul Mezzanini <pfmeec@xxxxxxx> wrote:
I got the exact same error on one of my OSDs when upgrading to 16. I
used it as an exercise on trying to fix a corrupt rocksdb. A spent a few
days of poking with no success. I got mostly tool crashes like you are
seeing with no forward progress.
I eventually just gave up, purged the OSD, did a smart long test on the
drive to be sure and then threw it back in the mix. Been HEALTH OK for
a week now after it finished refilling the drive.
On 9/19/21 10:47 AM, Andrej Filipcic wrote:
2021-09-19T15:47:13.610+0200 7f8bc1f0e700 2 rocksdb:
[db_impl/db_impl_compaction_flush.cc:2344] Waiting after background
compaction error: Corruption: block checksum mismatch: expected
2427092066, got 4051549320 in db/251935.sst offset 18414386 size
4032, Accumulated background error counts: 1
2021-09-19T15:47:13.636+0200 7f8bbacf1700 -1 rocksdb: submit_common
error: Corruption: block checksum mismatch: expected 2427092066, got
4051549320 in db/251935.sst offset 18414386 size 4032 code = 2
Rocksdb transaction:
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
--
_____________________________________________________________
prof. dr. Andrej Filipcic, E-mail: Andrej.Filipcic@xxxxxx
Department of Experimental High Energy Physics - F9
Jozef Stefan Institute, Jamova 39, P.o.Box 3000
SI-1001 Ljubljana, Slovenia
Tel.: +386-1-477-3674 Fax: +386-1-425-7074
-------------------------------------------------------------
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx