Re: 3 OSDs can not be started after a server reboot - rocksdb Corruption

Sebastian Mazza <sebastian@xxxxxxxxxxx> · Wed, 22 Dec 2021 00:25:12 +0100

Hi Igor,

I now fixed my wrong OSD debug config to: 
[osd.7] 
        debug bluefs = 20
        debug bdev = 20

and you can download the debug log from: https://we.tl/t-3e4do1PQGj

Thanks,
Sebastian

> On 21.12.2021, at 19:44, Igor Fedotov <igor.fedotov@xxxxxxxx> wrote:
> 
> Hi Sebastian,
> 
> first of all I'm not sure this issue has the same root cause as Francois one. Highly likely it's just another BlueFS/RocksDB data corruption which is indicated in the same way.
> 
> In this respect I would rather mention this one reported just yesterday: https://lists.ceph.io/hyperkitty/list/ceph-users@xxxxxxx/thread/M2ZRZD4725SRPFE5MMZPI7JBNO23FNU6/
> 
> So similarly I'd like to ask some questions/collect more data. Please find the list below:
> 
> 1) Is this a bare metal or containerized deployment?
> 
> 2) What's the output for "hdparm -W <dev>" for devices in question? Any enabled write caching at the disk controller?
> 
> 3) Could you please share the broken OSD startup log with debug-bluefs set to 20?
> 
> 4) Could you please export bluefs files (this might need some extra space to keep all the bluefs data at target filesystem) via ceph-bluestore-tool and share the content of db/002182.sst file? The first 4M would be generally sufficient if it's huge.
> 
> 5) Have you seen RocksDB data corruptions at this cluster before
> 
> 6)What's the disk h/w for these OSDs - disk drives and controllers?
> 
> 7) Did you reboot the nodes or just restart the OSDs?  Did all the issues happen at the same or at different nodes? How many OSDs were restarted total?
> 
> 8) Is  that correct that this is a hdd-only setup, there is no standaone SSD/NVMe for WAL/DB?
> 
> 9) Would you be able to run some long lasting (and potentially data corrupting) experiments at this cluster in an attempt to pin point the issue. I'm thinking about periodic OSD shutdown under the load to catch the corrupting event. With a raised debug level for that specific OSD. The major problem with this bug debugging is that we can see its consequences - but we have no clue about what was happening when actual corruption happened. Hence we need to reproduce that somehow. So please let me know if we can use your cluster/help for that...
> 
> 
> Thanks in advance,
> 
> Igor
> 
> On 12/21/2021 7:47 PM, Sebastian Mazza wrote:
>> Hi all,
>> 
>> after a reboot of a cluster 3 OSDs can not be started. The OSDs exit with  the following error message:
>> 	2021-12-21T01:01:02.209+0100 7fd368cebf00  4 rocksdb: [db_impl/db_impl.cc:396] Shutdown: canceling all background work
>> 	2021-12-21T01:01:02.209+0100 7fd368cebf00  4 rocksdb: [db_impl/db_impl.cc:573] Shutdown complete
>> 	2021-12-21T01:01:02.209+0100 7fd368cebf00 -1 rocksdb: Corruption: Bad table magic number: expected 9863518390377041911, found 0 in db/002182.sst
>> 	2021-12-21T01:01:02.213+0100 7fd368cebf00 -1 bluestore(/var/lib/ceph/osd/ceph-7) _open_db erroring opening db:
>> 	2021-12-21T01:01:02.213+0100 7fd368cebf00  1 bluefs umount
>> 	2021-12-21T01:01:02.213+0100 7fd368cebf00  1 bdev(0x559bbe0ea800 /var/lib/ceph/osd/ceph-7/block) close
>> 	2021-12-21T01:01:02.293+0100 7fd368cebf00  1 bdev(0x559bbe0ea400 /var/lib/ceph/osd/ceph-7/block) close
>> 	2021-12-21T01:01:02.537+0100 7fd368cebf00 -1 osd.7 0 OSD:init: unable to mount object store
>> 	2021-12-21T01:01:02.537+0100 7fd368cebf00 -1  ** ERROR: osd init failed: (5) Input/output error
>> 
>> 
>> I found a similar problem in this Mailing list: https://lists.ceph.io/hyperkitty/list/ceph-users@xxxxxxx/thread/MJLVS7UPJ5AZKOYN3K2VQW7WIOEQGC5V/#MABLFA4FHG6SX7YN4S6BGSCP6DOAX6UE
>> 
>> In this thread, Francois was able to successfully repair his OSD data with `ceph-bluestore-tool fsck`. I tried to run:
>> `ceph-bluestore-tool fsck --path /var/lib/ceph/osd/ceph-7 -l /var/log/ceph/bluestore-tool-fsck-osd-7.log --log-level 20  > /var/log/ceph/bluestore-tool-fsck-osd-7.out  2>&1`
>> But that results in:
>> 	2021-12-21T16:44:18.455+0100 7fc54ef7a240 -1 rocksdb: Corruption: Bad table magic number: expected 9863518390377041911, found 0 in db/002182.sst
>> 	2021-12-21T16:44:18.455+0100 7fc54ef7a240 -1 bluestore(/var/lib/ceph/osd/ceph-7) _open_db erroring opening db:
>> 	fsck failed: (5) Input/output error
>> 
>> I also tried to run `ceph-kvstore-tool bluestore-kv /var/lib/ceph/osd/ceph-7 repair`. But that also fails with:
>> 	2021-12-21T17:34:06.780+0100 7f35765f7240  0 bluestore(/var/lib/ceph/osd/ceph-7) _open_db_and_around read-only:0 repair:0
>> 	2021-12-21T17:34:06.780+0100 7f35765f7240  1 bdev(0x55fce5a1a800 /var/lib/ceph/osd/ceph-7/block) open path /var/lib/ceph/osd/ceph-7/block
>> 	2021-12-21T17:34:06.780+0100 7f35765f7240  1 bdev(0x55fce5a1a800 /var/lib/ceph/osd/ceph-7/block) open size 12000134430720 (0xae9ffc00000, 11 TiB)
>> 		block_size 4096 (4 KiB) rotational discard not supported
>> 	2021-12-21T17:34:06.780+0100 7f35765f7240  1 bluestore(/var/lib/ceph/osd/ceph-7) _set_cache_sizes cache_size 1073741824 meta 0.45 kv 0.45 data 0.06
>> 	2021-12-21T17:34:06.780+0100 7f35765f7240  1 bdev(0x55fce5a1ac00 /var/lib/ceph/osd/ceph-7/block) open path /var/lib/ceph/osd/ceph-7/block
>> 	2021-12-21T17:34:06.780+0100 7f35765f7240  1 bdev(0x55fce5a1ac00 /var/lib/ceph/osd/ceph-7/block) open size 12000134430720 (0xae9ffc00000, 11 TiB)
>> 		block_size 4096 (4 KiB) rotational discard not supported
>> 	2021-12-21T17:34:06.780+0100 7f35765f7240  1 bluefs add_block_device bdev 1 path /var/lib/ceph/osd/ceph-7/block size 11 TiB
>> 	2021-12-21T17:34:06.780+0100 7f35765f7240  1 bluefs mount
>> 	2021-12-21T17:34:06.780+0100 7f35765f7240  1 bluefs _init_alloc shared, id 1, capacity 0xae9ffc00000, block size 0x10000
>> 	2021-12-21T17:34:06.904+0100 7f35765f7240  1 bluefs mount shared_bdev_used = 0
>> 	2021-12-21T17:34:06.904+0100 7f35765f7240  1 bluestore(/var/lib/ceph/osd/ceph-7) _prepare_db_environment set db_paths to db,11400127709184 db.slow,11400127709184
>> 	2021-12-21T17:34:06.908+0100 7f35765f7240 -1 rocksdb: Corruption: Bad table magic number: expected 9863518390377041911, found 0 in db/002182.sst
>> 	2021-12-21T17:34:06.908+0100 7f35765f7240 -1 bluestore(/var/lib/ceph/osd/ceph-7) _open_db erroring opening db:
>> 	2021-12-21T17:34:06.908+0100 7f35765f7240  1 bluefs umount
>> 	2021-12-21T17:34:06.908+0100 7f35765f7240  1 bdev(0x55fce5a1ac00 /var/lib/ceph/osd/ceph-7/block) close
>> 	2021-12-21T17:34:07.072+0100 7f35765f7240  1 bdev(0x55fce5a1a800 /var/lib/ceph/osd/ceph-7/block) close
>> 
>> 
>> The cluster is not in production, therefore, I can remove all corrupt pools and delete the OSDs. However, I would like to understand what was going on, in order to be able to avoid such a situation in the future.
>> 
>> I will provide the OSD logs from the time around the server reboot at the following link: https://we.tl/t-fArHXTmSM7
>> 
>> Ceph version: 16.2.6
>> 
>> 
>> Thanks,
>> Sebastian
>> 
>> _______________________________________________
>> ceph-users mailing list -- ceph-users@xxxxxxx
>> To unsubscribe send an email to ceph-users-leave@xxxxxxx
> 
> -- 
> Igor Fedotov
> Ceph Lead Developer
> 
> Looking for help with your Ceph cluster? Contact us at https://croit.io
> 
> croit GmbH, Freseniusstr. 31h, 81247 Munich
> CEO: Martin Verges - VAT-ID: DE310638492
> Com. register: Amtsgericht Munich HRB 231263
> Web: https://croit.io | YouTube: https://goo.gl/PGE1Bx
> 
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx