Re: 16.2.7 pacific rocksdb Corruption: CURRENT

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 12/20/21 10:09, Igor Fedotov wrote:
Hi Andrej,

first of all I'd like to mention that this issue is rather not new to 16.2.7. There is a ticket: https://tracker.ceph.com/issues/47330 which has mentions the similar case for mimic. And the ticket erroneously tagged as resolved - but the proposed fix just introduces bluefs file import option to ceph-bluestore-tool which permits manual recovery. Please be aware that this import is present in master only and it has a bug (still open) by itself: https://github.com/ceph/ceph/pull/44317


So some more questions which might help in troubleshooting:

1) Did the error pop up immediately after the upgrade or some successful starts happened on 16.2.7 before the failure?
no successful restarts

2) Could you please share an OSD log prior to the shutdown which triggerd the corruption? And the one for the first start afterwards.
here
http://www-f9.ijs.si/~andrej/ceph-osd.611.log-20211220-short.gz

3) Please set debug-bluefs to 20, retry the OSD start and share the log.
http://www-f9.ijs.si/~andrej/ceph-osd.611.log-20211220-short.gz

4) Please share the content of the broken CURRENT file

[root@lcst0032 db]# hexdump CURRENT
0000000 909f d59e f778 4f50 acb0 b1ea 59a2 9e90
0000010


Thanks,
Andrej


Thanks,

Igor


On 12/20/2021 11:17 AM, Andrej Filipcic wrote:

Hi,

When upgrading to 16.2.7 from 16.2.6, 8 out of ~1600 OSDs failed to start. The first 16.2.7 startup crashes here:

2021-12-19T09:52:34.128+0100 7ff7104c0080  1 bluefs mount
2021-12-19T09:52:34.129+0100 7ff7104c0080  1 bluefs _init_alloc shared, id 1, capacity 0xe8d7fc00000, block size 0x10000 2021-12-19T09:52:34.238+0100 7ff7104c0080  1 bluefs mount shared_bdev_used = 0 2021-12-19T09:52:34.238+0100 7ff7104c0080  1 bluestore(/var/lib/ceph/osd/ceph-611) _prepare_db_environment set db_paths to db,15200851643596 db.slow,15200851643596 2021-12-19T09:52:34.257+0100 7ff7104c0080 -1 rocksdb: verify_sharding unable to list column families: Corruption: CURRENT file does not end with newline 2021-12-19T09:52:34.257+0100 7ff7104c0080 -1 bluestore(/var/lib/ceph/osd/ceph-611) _open_db erroring opening db:
2021-12-19T09:52:34.257+0100 7ff7104c0080  1 bluefs umount

I could export the rocksdb, and the contents of the CURRENT file is corruped, I understand it should contain the MANIFEST-* info.

I have attached the full osd log of one failure, the others failed OSD all fail for the same reason.

Any hint? for now, I keep those osds off if they can be further debugged.

(resending with shortened log)

Best regards,
Andrej



--
_____________________________________________________________
   prof. dr. Andrej Filipcic,   E-mail:Andrej.Filipcic@xxxxxx
   Department of Experimental High Energy Physics - F9
   Jozef Stefan Institute, Jamova 39, P.o.Box 3000
   SI-1001 Ljubljana, Slovenia
   Tel.: +386-1-477-3674    Fax: +386-1-477-3166
-------------------------------------------------------------
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx




[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux