Hi all, we are running a pretty small instance of Ceph (v13.2.6) with 1 host and 8 OSDs and are planning to expand to a more default setup with 3 hosts and more OSDs. However tonight one of our redudant PSUs died and it did failover, but it looks like this has corrupted 3 out of 8 OSDs. The pools all have a replication level of 2. All OSDs are BlueStore with rocksdb, no external journal or wal 2 of them report a missing rocksdb: Jun 30 01:33:32 tecoceph systemd[1]: Starting Ceph object storage daemon osd.3... Jun 30 01:33:32 tecoceph systemd[1]: Started Ceph object storage daemon osd.3. Jun 30 01:33:32 tecoceph ceph-osd[11431]: 2019-06-30 01:33:32.242 7f2666a75d80 -1 Public network was set, but cluster network was not set Jun 30 01:33:32 tecoceph ceph-osd[11431]: 2019-06-30 01:33:32.242 7f2666a75d80 -1 Using public network also for cluster network Jun 30 01:33:32 tecoceph ceph-osd[11431]: starting osd.3 at - osd_data /var/lib/ceph/osd/ceph-3 /var/lib/ceph/osd/ceph-3/journal Jun 30 01:33:32 tecoceph ceph-osd[11431]: 2019-06-30 01:33:32.898 7f2666a75d80 -1 rocksdb: NotFound: Jun 30 01:33:32 tecoceph ceph-osd[11431]: 2019-06-30 01:33:32.898 7f2666a75d80 -1 bluestore(/var/lib/ceph/osd/ceph-3) _open_db erroring opening db: Jun 30 01:33:33 tecoceph ceph-osd[11431]: 2019-06-30 01:33:33.267 7f2666a75d80 -1 osd.3 0 OSD:init: unable to mount object store Jun 30 01:33:33 tecoceph ceph-osd[11431]: 2019-06-30 01:33:33.267 7f2666a75d80 -1 ** ERROR: osd init failed: (5) Input/output error Jun 30 01:33:33 tecoceph systemd[1]: ceph-osd@3.service: main process exited, code=exited, status=1/FAILURE So I tried working with the bluestore-tool [root@tecoceph osd]# ceph-bluestore-tool show-label --path /var/lib/ceph/osd/ceph-3/ inferring bluefs devices from bluestore path { "/var/lib/ceph/osd/ceph-3//block": { "osd_uuid": "c28c092c-00aa-4db0-9925-642bf99f0662", "size": 8001561821184, "btime": "2018-05-28 22:44:58.712336", "description": "main", "bluefs": "1", "ceph_fsid": "a9493143-3e4e-450e-b3b8-28508d48d412", "kv_backend": "rocksdb", "magic": "ceph osd volume v026", "mkfs_done": "yes", "osd_key": "AQBG************************", "ready": "ready", "whoami": "3" } } [root@tecoceph osd]# ceph-bluestore-tool fsck --deep yes --path /var/lib/ceph/osd/ceph-3/ 2019-06-30 14:38:35.998 7f9947432940 -1 rocksdb: NotFound: 2019-06-30 14:38:35.998 7f9947432940 -1 bluestore(/var/lib/ceph/osd/ceph-3/) _open_db erroring opening db: error from fsck: (5) Input/output error Trying to access the rocksdb with the kvstore-tool fails as well [root@tecoceph osd]# ceph-kvstore-tool rocksdb /var/lib/ceph/osd/ceph-3 list 2019-06-30 14:39:36.021 7faa747e8a80 1 rocksdb: do_open column families: [] failed to open type 2019-06-30 14:39:36.022 7faa747e8a80 -1 rocksdb: Invalid argument: /var/lib/ceph/osd/ceph-3: does not exist (create_if_missing is false) rocksdb path /var/lib/ceph/osd/ceph-3: (22) Invalid argument Repairing it with the kvstore-tool results in a segmentation fault… [root@tecoceph osd]# ceph-kvstore-tool rocksdb /var/lib/ceph/osd/ceph-3 repair *** Caught signal (Segmentation fault) ** in thread 7ff8fde03a80 thread_name:ceph-kvstore-to ceph version 13.2.6 (7b695f835b03642f85998b2ae7b6dd093d9fbce4) mimic (stable) 1: (()+0xf5d0) [0x7ff8f23925d0] 2: (main()+0x2c4) [0x55ae6dadb4e4] 3: (__libc_start_main()+0xf5) [0x7ff8f0d673d5] 4: (()+0x21dde0) [0x55ae6dbafde0] 2019-06-30 14:39:15.785 7ff8fde03a80 -1 *** Caught signal (Segmentation fault) ** in thread 7ff8fde03a80 thread_name:ceph-kvstore-to ceph version 13.2.6 (7b695f835b03642f85998b2ae7b6dd093d9fbce4) mimic (stable) 1: (()+0xf5d0) [0x7ff8f23925d0] 2: (main()+0x2c4) [0x55ae6dadb4e4] 3: (__libc_start_main()+0xf5) [0x7ff8f0d673d5] 4: (()+0x21dde0) [0x55ae6dbafde0] NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this. The other one crashes with a segfault and any tool other as well, because of a wrong magic number Jun 30 01:32:29 tecoceph ceph-osd[8661]: -324> 2019-06-30 01:32:27.805 7fa9bd453d80 -1 Public network was set, but cluster network was not set Jun 30 01:32:29 tecoceph ceph-osd[8661]: -324> 2019-06-30 01:32:27.805 7fa9bd453d80 -1 Using public network also for cluster network Jun 30 01:32:29 tecoceph ceph-osd[8661]: -324> 2019-06-30 01:32:29.771 7fa9bd453d80 -1 abort: Corruption: Bad table magic number: expected 9863518390377041911, found 15656361161312523986 in db/002923.sst Jun 30 01:32:29 tecoceph ceph-osd[8661]: -324> 2019-06-30 01:32:29.831 7fa9bd453d80 -1 *** Caught signal (Aborted) ** Is there any way to recover any of these OSDs? Karlsruhe Institute of Technology (KIT)
Pervasive Computing Systems – TECO Prof. Dr. Michael Beigl IT Christian Wahl Vincenz-Prießnitz-Str. 1 Building 07.07., 2nd floor 76131 Karlsruhe, Germany |
_______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com