Hi Andrej,
first of all I'd like to mention that this issue is rather not new to
16.2.7. There is a ticket: https://tracker.ceph.com/issues/47330 which
has mentions the similar case for mimic. And the ticket erroneously
tagged as resolved - but the proposed fix just introduces bluefs file
import option to ceph-bluestore-tool which permits manual recovery.
Please be aware that this import is present in master only and it has a
bug (still open) by itself: https://github.com/ceph/ceph/pull/44317
So some more questions which might help in troubleshooting:
1) Did the error pop up immediately after the upgrade or some successful
starts happened on 16.2.7 before the failure?
2) Could you please share an OSD log prior to the shutdown which
triggerd the corruption? And the one for the first start afterwards.
3) Please set debug-bluefs to 20, retry the OSD start and share the log.
4) Please share the content of the broken CURRENT file
Thanks,
Igor
On 12/20/2021 11:17 AM, Andrej Filipcic wrote:
Hi,
When upgrading to 16.2.7 from 16.2.6, 8 out of ~1600 OSDs failed to
start. The first 16.2.7 startup crashes here:
2021-12-19T09:52:34.128+0100 7ff7104c0080 1 bluefs mount
2021-12-19T09:52:34.129+0100 7ff7104c0080 1 bluefs _init_alloc
shared, id 1, capacity 0xe8d7fc00000, block size 0x10000
2021-12-19T09:52:34.238+0100 7ff7104c0080 1 bluefs mount
shared_bdev_used = 0
2021-12-19T09:52:34.238+0100 7ff7104c0080 1
bluestore(/var/lib/ceph/osd/ceph-611) _prepare_db_environment set
db_paths to db,15200851643596 db.slow,15200851643596
2021-12-19T09:52:34.257+0100 7ff7104c0080 -1 rocksdb: verify_sharding
unable to list column families: Corruption: CURRENT file does not end
with newline
2021-12-19T09:52:34.257+0100 7ff7104c0080 -1
bluestore(/var/lib/ceph/osd/ceph-611) _open_db erroring opening db:
2021-12-19T09:52:34.257+0100 7ff7104c0080 1 bluefs umount
I could export the rocksdb, and the contents of the CURRENT file is
corruped, I understand it should contain the MANIFEST-* info.
I have attached the full osd log of one failure, the others failed OSD
all fail for the same reason.
Any hint? for now, I keep those osds off if they can be further debugged.
(resending with shortened log)
Best regards,
Andrej
--
Igor Fedotov
Ceph Lead Developer
Looking for help with your Ceph cluster? Contact us at https://croit.io
croit GmbH, Freseniusstr. 31h, 81247 Munich
CEO: Martin Verges - VAT-ID: DE310638492
Com. register: Amtsgericht Munich HRB 231263
Web: https://croit.io | YouTube: https://goo.gl/PGE1Bx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx