OSD crashing - Corruption: block checksumo mismatch

Eneko Lacunza <elacunza@xxxxxxxxx> · Thu, 2 Dec 2021 12:21:04 +0100

Hi all,

Since we upgraded our tiny 4-node 15-osd from Nautilus to Pacific, we 
are seeing issues with osd.15, that periodically crashes with:

-10> 2021-12-02T11:52:50.716+0100 7f27071bc700 10 monclient: 
_check_auth_rotating have uptodate secrets (they expire after 
2021-12-02T11:52:20.721345+0100)
    -9> 2021-12-02T11:52:51.548+0100 7f2708efd700  5 prioritycache 
tune_memory target: 4294967296 mapped: 4041244672 unmapped: 479731712 
heap: 4520976384 old mem: 2845415818 new mem: 2845415818
    -8> 2021-12-02T11:52:51.696+0100 7f270b702700  3 rocksdb: 
[db_impl/db_impl_compaction_flush.cc:2807] Compaction error: Corruption: 
block checksum mismatch: expected 3428654824, got 1987789945  in db/511261.s
st offset 7219720 size 4044
    -7> 2021-12-02T11:52:51.696+0100 7f270b702700  4 rocksdb: (Original 
Log Time 2021/12/02-11:52:51.701026) [compaction/compaction_job.cc:743] 
[default] compacted to: files[4 1 21 0 0 0 0] max score 0.46, MB/se
c: 83.7 rd, 0.0 wr, level 1, files in(4, 1) out(1) MB in(44.2, 35.2) 
out(0.0), read-write-amplify(1.8) write-amplify(0.0) Corruption: block 
checksum mismatch: expected 3428654824, got 1987789945  in db/511261.ss
t offset 7219720 size 4044, records in: 613843, records dropped: 288377 
output_compression: NoCompression

    -6> 2021-12-02T11:52:51.696+0100 7f270b702700  4 rocksdb: (Original 
Log Time 2021/12/02-11:52:51.701047) EVENT_LOG_v1 {"time_micros": 
1638442371701036, "job": 1640, "event": "compaction_finished", "compactio
n_time_micros": 995261, "compaction_time_cpu_micros": 899466, 
"output_level": 1, "num_output_files": 1, "total_output_size": 40875027, 
"num_input_records": 613843, "num_output_records": 325466, "num_subcompactio
ns": 1, "output_compression": "NoCompression", 
"num_single_delete_mismatches": 0, "num_single_delete_fallthrough": 0, 
"lsm_state": [4, 1, 21, 0, 0, 0, 0]}
    -5> 2021-12-02T11:52:51.696+0100 7f270b702700  2 rocksdb: 
[db_impl/db_impl_compaction_flush.cc:2341] Waiting after background 
compaction error: Corruption: block checksum mismatch: expected 
3428654824, got 1
987789945  in db/511261.sst offset 7219720 size 4044, Accumulated 
background error counts: 1
    -4> 2021-12-02T11:52:51.716+0100 7f27071bc700 10 monclient: tick
    -3> 2021-12-02T11:52:51.716+0100 7f27071bc700 10 monclient: 
_check_auth_rotating have uptodate secrets (they expire after 
2021-12-02T11:52:21.721429+0100)
    -2> 2021-12-02T11:52:51.788+0100 7f27046f4700 -1 rocksdb: 
submit_common error: Corruption: block checksum mismatch: expected 
3428654824, got 1987789945  in db/511261.sst offset 7219720 size 4044 
code = ^B Ro
cksdb transaction:
PutCF( prefix = m key = 
0x000000000000000700000000000008'^.0000042922.00000000000048814458' 
value size = 236)
PutCF( prefix = m key = 0x000000000000000700000000000008'^._fastinfo' 
value size = 186)
PutCF( prefix = O key = 
0x7F80000000000000074161CBB7'!rbd_data.1cf93c843df86a.000000000000021d!='0xFFFFFFFFFFFFFFFEFFFFFFFFFFFFFFFF6F0026000078 
value size = 535)
PutCF( prefix = O key = 
0x7F80000000000000074161CBB7'!rbd_data.1cf93c843df86a.000000000000021d!='0xFFFFFFFFFFFFFFFEFFFFFFFFFFFFFFFF6F 
value size = 420)
PutCF( prefix = L key = 0x0000000000C33283 value size = 4135)
    -1> 2021-12-02T11:52:51.800+0100 7f27046f4700 -1 
./src/os/bluestore/BlueStore.cc: In function 'void 
BlueStore::_txc_apply_kv(BlueStore::TransContext*, bool)' thread 
7f27046f4700 time 2021-12-02T11:52:51.7937
84+0100
./src/os/bluestore/BlueStore.cc: 11650: FAILED ceph_assert(r == 0)

 ceph version 16.2.6 (1a6b9a05546f335eeeddb460fdc89caadf80ac7a) pacific 
(stable)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char 
const*)+0x124) [0x55fe1a8e992e]
 2: /usr/bin/ceph-osd(+0xabaab9) [0x55fe1a8e9ab9]
 3: (BlueStore::_txc_apply_kv(BlueStore::TransContext*, bool)+0x5ff) 
[0x55fe1aefd50f]
 4: (BlueStore::_kv_sync_thread()+0x1a23) [0x55fe1af3b3d3]
 5: (BlueStore::KVSyncThread::entry()+0xd) [0x55fe1af6492d]
 6: /lib/x86_64-linux-gnu/libpthread.so.0(+0x8ea7) [0x7f2716046ea7]
 7: clone()

ceph version 16.2.6 (1a6b9a05546f335eeeddb460fdc89caadf80ac7a) pacific 
(stable)
 1: /lib/x86_64-linux-gnu/libpthread.so.0(+0x14140) [0x7f2716052140]
 2: gsignal()
 3: abort()
 4: (ceph::__ceph_assert_fail(char const*, char const*, int, char 
const*)+0x16e) [0x55fe1a8e9978]
 5: /usr/bin/ceph-osd(+0xabaab9) [0x55fe1a8e9ab9]
 6: (BlueStore::_txc_apply_kv(BlueStore::TransContext*, bool)+0x5ff) 
[0x55fe1aefd50f]
 7: (BlueStore::_kv_sync_thread()+0x1a23) [0x55fe1af3b3d3]
 8: (BlueStore::KVSyncThread::entry()+0xd) [0x55fe1af6492d]
 9: /lib/x86_64-linux-gnu/libpthread.so.0(+0x8ea7) [0x7f2716046ea7]
 10: clone()
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is 
needed to interpret this.

This is a Proxmox VE HC cluster.

Node has 3 other OSDs, filestore and HDD. osd.15 is SSD and bluestore. 
All nodes have one SSD/bluestore OSD and 2-3 HDD OSDs (some filestore 
and some bluestore).

osd.15 restarts gracefully after the crash and continues working OK for 
days or even 1-2 weeks.

We suspect some kind of (memory?) corruption or SSD malfunction on the 
node; maybe other data is being corrupted and we don't know that because 
other OSDs are filestore.

Problem happening after upgrade is suspicious, but could be a coincidence...

Is there any way I could make some kind of "fsck" for that osd.15, so 
that I can know it is good in a given moment? Any other suggestion to 
troubleshoot the issue? (otherwise we'll be changing RAM modules to see 
if that helps...)

Thanks a lot

Eneko Lacunza
Zuzendari teknikoa | Director técnico
Binovo IT Human Project

Tel. +34 943 569 206 | https://www.binovo.es
Astigarragako Bidea, 2 - 2º izda. Oficina 10-11, 20180 Oiartzun

https://www.youtube.com/user/CANALBINOVO
https://www.linkedin.com/company/37269706/

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx