Re: 16.2.7 pacific rocksdb Corruption: CURRENT

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi Andrej,

first of all I'd like to mention that this issue is rather not new to 16.2.7. There is a ticket: https://tracker.ceph.com/issues/47330 which has mentions the similar case for mimic. And the ticket erroneously tagged as resolved - but the proposed fix just introduces bluefs file import option to ceph-bluestore-tool which permits manual recovery. Please be aware that this import is present in master only and it has a bug (still open) by itself: https://github.com/ceph/ceph/pull/44317


So some more questions which might help in troubleshooting:

1) Did the error pop up immediately after the upgrade or some successful starts happened on 16.2.7 before the failure?

2) Could you please share an OSD log prior to the shutdown which triggerd the corruption? And the one for the first start afterwards.

3) Please set debug-bluefs to 20, retry the OSD start and share the log.

4) Please share the content of the broken CURRENT file


Thanks,

Igor


On 12/20/2021 11:17 AM, Andrej Filipcic wrote:

Hi,

When upgrading to 16.2.7 from 16.2.6, 8 out of ~1600 OSDs failed to start. The first 16.2.7 startup crashes here:

2021-12-19T09:52:34.128+0100 7ff7104c0080  1 bluefs mount
2021-12-19T09:52:34.129+0100 7ff7104c0080  1 bluefs _init_alloc shared, id 1, capacity 0xe8d7fc00000, block size 0x10000 2021-12-19T09:52:34.238+0100 7ff7104c0080  1 bluefs mount shared_bdev_used = 0 2021-12-19T09:52:34.238+0100 7ff7104c0080  1 bluestore(/var/lib/ceph/osd/ceph-611) _prepare_db_environment set db_paths to db,15200851643596 db.slow,15200851643596 2021-12-19T09:52:34.257+0100 7ff7104c0080 -1 rocksdb: verify_sharding unable to list column families: Corruption: CURRENT file does not end with newline 2021-12-19T09:52:34.257+0100 7ff7104c0080 -1 bluestore(/var/lib/ceph/osd/ceph-611) _open_db erroring opening db:
2021-12-19T09:52:34.257+0100 7ff7104c0080  1 bluefs umount

I could export the rocksdb, and the contents of the CURRENT file is corruped, I understand it should contain the MANIFEST-* info.

I have attached the full osd log of one failure, the others failed OSD all fail for the same reason.

Any hint? for now, I keep those osds off if they can be further debugged.

(resending with shortened log)

Best regards,
Andrej

--
Igor Fedotov
Ceph Lead Developer

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH, Freseniusstr. 31h, 81247 Munich
CEO: Martin Verges - VAT-ID: DE310638492
Com. register: Amtsgericht Munich HRB 231263
Web: https://croit.io | YouTube: https://goo.gl/PGE1Bx

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx




[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux