Re: osd crashing and rocksdb corruption

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Here is the output of ceph-bluestore-tool bluefs-bdev-sizes
inferring bluefs devices from bluestore path
 slot 1 /var/lib/ceph/osd/ceph-5/block -> /dev/dm-17
1 : device size 0x746c0000000 : own 0x[37e1eb00000~4a82900000] = 0x4a82900000 : using 0x5bc780000(23 GiB)


the result of the debug-bluestore (and debug-bluefs) set to 20 for osd.5
is at the following address (28MB).

https://wetransfer.com/downloads/a193ab15ab5e2395fe2462c963507a7f20200428141355/5da2ebf0d33750a5fde85bf662cf0e6d20200428141415/55849f?utm_campaign=WT_email_tracking&utm_content=general&utm_medium=download_button&utm_source=notify_recipient_email

Thanks for your help.
F.

Le 28/04/2020 à 13:33, Igor Fedotov a écrit :
Hi Francois,


Could you please share OSD startup log with debug-bluestore (and debug-bluefs) set to 20.

Also please run ceph-bluestore-tool's bluefs-bdev-sizes command and share the output.

Thanks,

Igor


On 4/28/2020 12:55 AM, Francois Legrand wrote:
Hi all,

*** Short version ***
Is there a way to repair a rocksdb from errors "Encountered error while reading data from compression dictionary block Corruption: block checksum mismatch" and "_open_db erroring opening db" ?


*** Long version ***
We operate a nautilus ceph cluster (with 100 disks of 8TB in 6 servers + 4 mons/mgr + 3 mds). We recently (Monday 20) upgraded from 14.2.7 to 14.2.8. This triggered a rebalancing of some data. Two days later (Wednesday 22) we had a very short power outage. Only one of the osd servers went down (and unfortunately died). This triggered a reconstruction of the losts osds. Operations went fine until Saturday 25 where some osds in the 5 remaining servers started to crash apparently with no reasons. We tryed to restart them, but they crashed again. We ended with 18 osd down (+ 16 in the dead server so 34 osd downs out of 100).
Looking at the logs we found for all the crashed osd :

-237> 2020-04-25 16:32:51.835 7f1f45527a80  3 rocksdb: [table/block_based_table_reader.cc:1117] Encountered error while reading data from compression dictionary block Corruption: block checksum mismatch: expected 0, got 2729370997  in db/181355.sst offset 18446744073709551615 size 18446744073709551615

and

2020-04-25 16:05:47.251 7fcbd1e46a80 -1 bluestore(/var/lib/ceph/osd/ceph-3) _open_db erroring opening db:

We also noticed that the "Encountered error while reading data from compression dictionary block Corruption: block checksum mismatch" was present few days before the crash.
We also have some osd with this error but still up.

We tryed to repair with :
ceph-kvstore-tool bluestore-kv /var/lib/ceph/osd/ceph-3 repair
But no success (it ends with _open_db erroring opening db).

Thus does somebody have an idea to fix this or at least know if it's possible to repair and correct the "Encountered error while reading data from compression dictionary block Corruption: block checksum mismatch" and "_open_db erroring opening db" ! Thanks for your help (we are desperate because we will loose datas and are fighting to save something) !!!
F.

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx




[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux