Re: OSD has Rocksdb corruption that crashes ceph-bluestore-tool repair

Igor Fedotov <igor.fedotov@xxxxxxxx> · Mon, 18 Dec 2023 14:57:28 +0300

Hi Malcolm,

you might want to try ceph-objectstore-tool's export command to save the 
PG into a file and then import it to another OSD.

Thanks,

Igor

On 18/12/2023 02:59, Malcolm Haak wrote:
Hello all,

I had an OSD go offline due to UWE. When restarting the OSD service,
to try and at least get it to  drain cleanly of that data that wasn't
damaged, the ceph-osd process would crash.

I then attempted to repair it using ceph-bluestore-tool. I can run
fsck and it will complete without issue, however when attempting to
run repair it crashes in the exact same way that ceph-osd crashes.

I'll attach the tail end of the output here:

  2023-12-17T20:24:53.320+1000 7fdb7bf17740 -1 rocksdb: submit_common
error: Corruption: block checksum mismatch: stored = 1106056583,
computed = 657190205, type = 1  in db/020524.sst offset 21626321 size
4014 code =  Rocksdb transaction:
PutCF( prefix = S key = 'per_pool_omap' value size = 1)
   -442> 2023-12-17T20:24:53.386+1000 7fdb7bf17740 -1
/usr/src/debug/ceph/ceph-18.2.0/src/os/bluestore/BlueStore.cc: In
function 'unsigned int BlueStoreRepairer::apply(KeyValueDB*)' thread
7fdb7bf17740 time 2023-12-17T20:24:53.341999+1000
/usr/src/debug/ceph/ceph-18.2.0/src/os/bluestore/BlueStore.cc: 17982:
FAILED ceph_assert(ok)

  ceph version 18.2.0 (5dd24139a1eada541a3bc16b6941c5dde975e26d) reef
(stable)
  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
const*)+0x136) [0x7fdb7b6502c9]
  2: /usr/lib/ceph/libceph-common.so.2(+0x2504a4) [0x7fdb7b6504a4]
  3: (BlueStoreRepairer::apply(KeyValueDB*)+0x5af) [0x559afb98cc7f]
  4: (BlueStore::_fsck_on_open(BlueStore::FSCKDepth, bool)+0x45fc)
[0x559afba2436c]
  5: (BlueStore::_fsck(BlueStore::FSCKDepth, bool)+0x204)
[0x559afba31014]
  6: main()
  7: /usr/lib/libc.so.6(+0x27cd0) [0x7fdb7ae45cd0]
  8: __libc_start_main()
  9: _start()

   -441> 2023-12-17T20:24:53.390+1000 7fdb7bf17740 -1 *** Caught signal
(Aborted) **
  in thread 7fdb7bf17740 thread_name:ceph-bluestore-

  ceph version 18.2.0 (5dd24139a1eada541a3bc16b6941c5dde975e26d) reef
(stable)
  1: /usr/lib/libc.so.6(+0x3e710) [0x7fdb7ae5c710]
  2: /usr/lib/libc.so.6(+0x8e83c) [0x7fdb7aeac83c]
  3: raise()
  4: abort()
  5: (ceph::__ceph_assert_fail(char const*, char const*, int, char
const*)+0x191) [0x7fdb7b650324]
  6: /usr/lib/ceph/libceph-common.so.2(+0x2504a4) [0x7fdb7b6504a4]
  7: (BlueStoreRepairer::apply(KeyValueDB*)+0x5af) [0x559afb98cc7f]
  8: (BlueStore::_fsck_on_open(BlueStore::FSCKDepth, bool)+0x45fc)
[0x559afba2436c]
  9: (BlueStore::_fsck(BlueStore::FSCKDepth, bool)+0x204)
[0x559afba31014]
  10: main()
  11: /usr/lib/libc.so.6(+0x27cd0) [0x7fdb7ae45cd0]
  12: __libc_start_main()
  13: _start()
  NOTE: a copy of the executable, or `objdump -rdS <executable>` is
needed to interpret this.

The reason I need to get this OSD functioning is I had two other OSD's
fail causing a single PG to be in down state. The weird thing is, I
got one of those back up without issue (ceph-osd crashed due to root
filling and alert not sending) but the PG is still down. So I need to
get this other one back up (or the data extracted) to get that PG back
from down.

Thanks in advance
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx