Re: 3 OSDs can not be started after a server reboot - rocksdb Corruption

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi Sebastian,

first of all I'm not sure this issue has the same root cause as Francois one. Highly likely it's just another BlueFS/RocksDB data corruption which is indicated in the same way.

In this respect I would rather mention this one reported just yesterday: https://lists.ceph.io/hyperkitty/list/ceph-users@xxxxxxx/thread/M2ZRZD4725SRPFE5MMZPI7JBNO23FNU6/

So similarly I'd like to ask some questions/collect more data. Please find the list below:

1) Is this a bare metal or containerized deployment?

2) What's the output for "hdparm -W <dev>" for devices in question? Any enabled write caching at the disk controller?

3) Could you please share the broken OSD startup log with debug-bluefs set to 20?

4) Could you please export bluefs files (this might need some extra space to keep all the bluefs data at target filesystem) via ceph-bluestore-tool and share the content of db/002182.sst file? The first 4M would be generally sufficient if it's huge.

5) Have you seen RocksDB data corruptions at this cluster before

6)What's the disk h/w for these OSDs - disk drives and controllers?

7) Did you reboot the nodes or just restart the OSDs?  Did all the issues happen at the same or at different nodes? How many OSDs were restarted total?

8) Is  that correct that this is a hdd-only setup, there is no standaone SSD/NVMe for WAL/DB?

9) Would you be able to run some long lasting (and potentially data corrupting) experiments at this cluster in an attempt to pin point the issue. I'm thinking about periodic OSD shutdown under the load to catch the corrupting event. With a raised debug level for that specific OSD. The major problem with this bug debugging is that we can see its consequences - but we have no clue about what was happening when actual corruption happened. Hence we need to reproduce that somehow. So please let me know if we can use your cluster/help for that...


Thanks in advance,

Igor

On 12/21/2021 7:47 PM, Sebastian Mazza wrote:
Hi all,

after a reboot of a cluster 3 OSDs can not be started. The OSDs exit with  the following error message:
	2021-12-21T01:01:02.209+0100 7fd368cebf00  4 rocksdb: [db_impl/db_impl.cc:396] Shutdown: canceling all background work
	2021-12-21T01:01:02.209+0100 7fd368cebf00  4 rocksdb: [db_impl/db_impl.cc:573] Shutdown complete
	2021-12-21T01:01:02.209+0100 7fd368cebf00 -1 rocksdb: Corruption: Bad table magic number: expected 9863518390377041911, found 0 in db/002182.sst
	2021-12-21T01:01:02.213+0100 7fd368cebf00 -1 bluestore(/var/lib/ceph/osd/ceph-7) _open_db erroring opening db:
	2021-12-21T01:01:02.213+0100 7fd368cebf00  1 bluefs umount
	2021-12-21T01:01:02.213+0100 7fd368cebf00  1 bdev(0x559bbe0ea800 /var/lib/ceph/osd/ceph-7/block) close
	2021-12-21T01:01:02.293+0100 7fd368cebf00  1 bdev(0x559bbe0ea400 /var/lib/ceph/osd/ceph-7/block) close
	2021-12-21T01:01:02.537+0100 7fd368cebf00 -1 osd.7 0 OSD:init: unable to mount object store
	2021-12-21T01:01:02.537+0100 7fd368cebf00 -1  ** ERROR: osd init failed: (5) Input/output error


I found a similar problem in this Mailing list: https://lists.ceph.io/hyperkitty/list/ceph-users@xxxxxxx/thread/MJLVS7UPJ5AZKOYN3K2VQW7WIOEQGC5V/#MABLFA4FHG6SX7YN4S6BGSCP6DOAX6UE

In this thread, Francois was able to successfully repair his OSD data with `ceph-bluestore-tool fsck`. I tried to run:
`ceph-bluestore-tool fsck --path /var/lib/ceph/osd/ceph-7 -l /var/log/ceph/bluestore-tool-fsck-osd-7.log --log-level 20  > /var/log/ceph/bluestore-tool-fsck-osd-7.out  2>&1`
But that results in:
	2021-12-21T16:44:18.455+0100 7fc54ef7a240 -1 rocksdb: Corruption: Bad table magic number: expected 9863518390377041911, found 0 in db/002182.sst
	2021-12-21T16:44:18.455+0100 7fc54ef7a240 -1 bluestore(/var/lib/ceph/osd/ceph-7) _open_db erroring opening db:
	fsck failed: (5) Input/output error

I also tried to run `ceph-kvstore-tool bluestore-kv /var/lib/ceph/osd/ceph-7 repair`. But that also fails with:
	2021-12-21T17:34:06.780+0100 7f35765f7240  0 bluestore(/var/lib/ceph/osd/ceph-7) _open_db_and_around read-only:0 repair:0
	2021-12-21T17:34:06.780+0100 7f35765f7240  1 bdev(0x55fce5a1a800 /var/lib/ceph/osd/ceph-7/block) open path /var/lib/ceph/osd/ceph-7/block
	2021-12-21T17:34:06.780+0100 7f35765f7240  1 bdev(0x55fce5a1a800 /var/lib/ceph/osd/ceph-7/block) open size 12000134430720 (0xae9ffc00000, 11 TiB)
		block_size 4096 (4 KiB) rotational discard not supported
	2021-12-21T17:34:06.780+0100 7f35765f7240  1 bluestore(/var/lib/ceph/osd/ceph-7) _set_cache_sizes cache_size 1073741824 meta 0.45 kv 0.45 data 0.06
	2021-12-21T17:34:06.780+0100 7f35765f7240  1 bdev(0x55fce5a1ac00 /var/lib/ceph/osd/ceph-7/block) open path /var/lib/ceph/osd/ceph-7/block
	2021-12-21T17:34:06.780+0100 7f35765f7240  1 bdev(0x55fce5a1ac00 /var/lib/ceph/osd/ceph-7/block) open size 12000134430720 (0xae9ffc00000, 11 TiB)
		block_size 4096 (4 KiB) rotational discard not supported
	2021-12-21T17:34:06.780+0100 7f35765f7240  1 bluefs add_block_device bdev 1 path /var/lib/ceph/osd/ceph-7/block size 11 TiB
	2021-12-21T17:34:06.780+0100 7f35765f7240  1 bluefs mount
	2021-12-21T17:34:06.780+0100 7f35765f7240  1 bluefs _init_alloc shared, id 1, capacity 0xae9ffc00000, block size 0x10000
	2021-12-21T17:34:06.904+0100 7f35765f7240  1 bluefs mount shared_bdev_used = 0
	2021-12-21T17:34:06.904+0100 7f35765f7240  1 bluestore(/var/lib/ceph/osd/ceph-7) _prepare_db_environment set db_paths to db,11400127709184 db.slow,11400127709184
	2021-12-21T17:34:06.908+0100 7f35765f7240 -1 rocksdb: Corruption: Bad table magic number: expected 9863518390377041911, found 0 in db/002182.sst
	2021-12-21T17:34:06.908+0100 7f35765f7240 -1 bluestore(/var/lib/ceph/osd/ceph-7) _open_db erroring opening db:
	2021-12-21T17:34:06.908+0100 7f35765f7240  1 bluefs umount
	2021-12-21T17:34:06.908+0100 7f35765f7240  1 bdev(0x55fce5a1ac00 /var/lib/ceph/osd/ceph-7/block) close
	2021-12-21T17:34:07.072+0100 7f35765f7240  1 bdev(0x55fce5a1a800 /var/lib/ceph/osd/ceph-7/block) close


The cluster is not in production, therefore, I can remove all corrupt pools and delete the OSDs. However, I would like to understand what was going on, in order to be able to avoid such a situation in the future.

I will provide the OSD logs from the time around the server reboot at the following link: https://we.tl/t-fArHXTmSM7

Ceph version: 16.2.6


Thanks,
Sebastian

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx

--
Igor Fedotov
Ceph Lead Developer

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH, Freseniusstr. 31h, 81247 Munich
CEO: Martin Verges - VAT-ID: DE310638492
Com. register: Amtsgericht Munich HRB 231263
Web: https://croit.io | YouTube: https://goo.gl/PGE1Bx

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx




[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux