Re: rocksdb corruption with 16.2.6

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



FWIW, we've had similar reports in the past:


https://tracker.ceph.com/issues/37282

https://tracker.ceph.com/issues/48002

https://lists.ceph.io/hyperkitty/list/ceph-users@xxxxxxx/thread/2GBK5NJFOSQGMN25GQ3CZNX4W2ZGQV5U/?sort=date

https://www.spinics.net/lists/ceph-users/msg59466.html

https://www.bountysource.com/issues/49313514-block-checksum-mismatch


...but we aren't the only ones:

https://github.com/facebook/rocksdb/issues/5251

https://github.com/facebook/rocksdb/issues/7033

https://jira.mariadb.org/browse/MDEV-20456

https://lists.launchpad.net/maria-discuss/msg05614.html

https://githubmemory.com/repo/openethereum/openethereum/issues/416

https://githubmemory.com/repo/FISCO-BCOS/FISCO-BCOS/issues/1895

https://groups.google.com/g/rocksdb/c/gUD4kCGTw-0/m/uLpFwkO5AgAJ


At least in one case for us, the user was using consumer grade SSDs without power loss protection.  I don't think we ever fully diagnosed if that was the cause though.  Another case potentially was related to high memory usage on the node.  Hardware errors are a legitimate concern here so probably checking dmesg/smartctl/etc is warranted.  ECC memory obviously helps too (or rather the lack of which makes it more difficult to diagnose).


For folks that have experienced this, any info you can give related to the HW involved would be helpful.  We (and other projects) have seen similar things over the years but this is a notoriously difficult issue to track down given that it could be any one of many different things and it may or may not be our code.


Mark


On 9/20/21 10:09 AM, Neha Ojha wrote:
Can we please create a bluestore tracker issue for this
(if one does not exist already), where we can start capturing all the
relevant information needed to debug this? Given that this has been
encountered in previous 16.2.* versions, it doesn't sound like a
regression in 16.2.6 to me, rather an issue in pacific. In any case,
we'll prioritize fixing it.

Thanks,
Neha

On Mon, Sep 20, 2021 at 8:03 AM Andrej Filipcic <andrej.filipcic@xxxxxx> wrote:
On 20/09/2021 16:02, David Orman wrote:
Same question here, for clarity, was this on upgrading to 16.2.6 from
16.2.5? Or upgrading
from some other release?
from 16.2.5. but the OSD services were never restarted after upgrade to
.5, so it could be a leftover of previous issues.

Cheers,
Andrej
On Mon, Sep 20, 2021 at 8:57 AM Sean <sean@xxxxxxxxx> wrote:
   I also ran into this with v16. In my case, trying to run a repair totally
exhausted the RAM on the box, and was unable to complete.

After removing/recreating the OSD, I did notice that it has a drastically
   smaller OMAP size than the other OSDs. I don’t know if that actually means
anything, but just wanted to mention it in case it does.

ID   CLASS  WEIGHT     REWEIGHT  SIZE     RAW USE  DATA     OMAP     META
    AVAIL    %USE   VAR   PGS  STATUS  TYPE NAME
14   hdd    10.91409   1.00000   11 TiB  3.3 TiB  3.2 TiB  4.6 MiB  5.4 GiB
   7.7 TiB  29.81  1.02   34      up    osd.14
16   hdd    10.91409   1.00000   11 TiB  3.3 TiB  3.3 TiB   20 KiB  9.4 GiB
   7.6 TiB  30.03  1.03   35      up    osd.16

~ Sean


On Sep 20, 2021 at 8:27:39 AM, Paul Mezzanini <pfmeec@xxxxxxx> wrote:

I got the exact same error on one of my OSDs when upgrading to 16.  I
used it as an exercise on trying to fix a corrupt rocksdb. A spent a few
days of poking with no success.  I got mostly tool crashes like you are
seeing with no forward progress.

I eventually just gave up, purged the OSD, did a smart long test on the
drive to be sure and then threw it back in the mix.  Been HEALTH OK for
a week now after it finished refilling the drive.


On 9/19/21 10:47 AM, Andrej Filipcic wrote:

2021-09-19T15:47:13.610+0200 7f8bc1f0e700  2 rocksdb:

[db_impl/db_impl_compaction_flush.cc:2344] Waiting after background

compaction error: Corruption: block checksum mismatch: expected

2427092066, got 4051549320  in db/251935.sst offset 18414386 size

4032, Accumulated background error counts: 1

2021-09-19T15:47:13.636+0200 7f8bbacf1700 -1 rocksdb: submit_common

error: Corruption: block checksum mismatch: expected 2427092066, got

4051549320  in db/251935.sst offset 18414386 size 4032 code = 2

Rocksdb transaction:

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx

--
_____________________________________________________________
     prof. dr. Andrej Filipcic,   E-mail: Andrej.Filipcic@xxxxxx
     Department of Experimental High Energy Physics - F9
     Jozef Stefan Institute, Jamova 39, P.o.Box 3000
     SI-1001 Ljubljana, Slovenia
     Tel.: +386-1-477-3674    Fax: +386-1-425-7074
-------------------------------------------------------------

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx




[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux