OSD crashing - Corruption: block checksumo mismatch

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi all,

Since we upgraded our tiny 4-node 15-osd from Nautilus to Pacific, we are seeing issues with osd.15, that periodically crashes with:

-10> 2021-12-02T11:52:50.716+0100 7f27071bc700 10 monclient: _check_auth_rotating have uptodate secrets (they expire after 2021-12-02T11:52:20.721345+0100)     -9> 2021-12-02T11:52:51.548+0100 7f2708efd700  5 prioritycache tune_memory target: 4294967296 mapped: 4041244672 unmapped: 479731712 heap: 4520976384 old mem: 2845415818 new mem: 2845415818     -8> 2021-12-02T11:52:51.696+0100 7f270b702700  3 rocksdb: [db_impl/db_impl_compaction_flush.cc:2807] Compaction error: Corruption: block checksum mismatch: expected 3428654824, got 1987789945  in db/511261.s
st offset 7219720 size 4044
    -7> 2021-12-02T11:52:51.696+0100 7f270b702700  4 rocksdb: (Original Log Time 2021/12/02-11:52:51.701026) [compaction/compaction_job.cc:743] [default] compacted to: files[4 1 21 0 0 0 0] max score 0.46, MB/se c: 83.7 rd, 0.0 wr, level 1, files in(4, 1) out(1) MB in(44.2, 35.2) out(0.0), read-write-amplify(1.8) write-amplify(0.0) Corruption: block checksum mismatch: expected 3428654824, got 1987789945  in db/511261.ss t offset 7219720 size 4044, records in: 613843, records dropped: 288377 output_compression: NoCompression

    -6> 2021-12-02T11:52:51.696+0100 7f270b702700  4 rocksdb: (Original Log Time 2021/12/02-11:52:51.701047) EVENT_LOG_v1 {"time_micros": 1638442371701036, "job": 1640, "event": "compaction_finished", "compactio n_time_micros": 995261, "compaction_time_cpu_micros": 899466, "output_level": 1, "num_output_files": 1, "total_output_size": 40875027, "num_input_records": 613843, "num_output_records": 325466, "num_subcompactio ns": 1, "output_compression": "NoCompression", "num_single_delete_mismatches": 0, "num_single_delete_fallthrough": 0, "lsm_state": [4, 1, 21, 0, 0, 0, 0]}     -5> 2021-12-02T11:52:51.696+0100 7f270b702700  2 rocksdb: [db_impl/db_impl_compaction_flush.cc:2341] Waiting after background compaction error: Corruption: block checksum mismatch: expected 3428654824, got 1 987789945  in db/511261.sst offset 7219720 size 4044, Accumulated background error counts: 1
    -4> 2021-12-02T11:52:51.716+0100 7f27071bc700 10 monclient: tick
    -3> 2021-12-02T11:52:51.716+0100 7f27071bc700 10 monclient: _check_auth_rotating have uptodate secrets (they expire after 2021-12-02T11:52:21.721429+0100)     -2> 2021-12-02T11:52:51.788+0100 7f27046f4700 -1 rocksdb: submit_common error: Corruption: block checksum mismatch: expected 3428654824, got 1987789945  in db/511261.sst offset 7219720 size 4044 code = ^B Ro
cksdb transaction:
PutCF( prefix = m key = 0x000000000000000700000000000008'^.0000042922.00000000000048814458' value size = 236) PutCF( prefix = m key = 0x000000000000000700000000000008'^._fastinfo' value size = 186) PutCF( prefix = O key = 0x7F80000000000000074161CBB7'!rbd_data.1cf93c843df86a.000000000000021d!='0xFFFFFFFFFFFFFFFEFFFFFFFFFFFFFFFF6F0026000078 value size = 535) PutCF( prefix = O key = 0x7F80000000000000074161CBB7'!rbd_data.1cf93c843df86a.000000000000021d!='0xFFFFFFFFFFFFFFFEFFFFFFFFFFFFFFFF6F value size = 420)
PutCF( prefix = L key = 0x0000000000C33283 value size = 4135)
    -1> 2021-12-02T11:52:51.800+0100 7f27046f4700 -1 ./src/os/bluestore/BlueStore.cc: In function 'void BlueStore::_txc_apply_kv(BlueStore::TransContext*, bool)' thread 7f27046f4700 time 2021-12-02T11:52:51.7937
84+0100
./src/os/bluestore/BlueStore.cc: 11650: FAILED ceph_assert(r == 0)

 ceph version 16.2.6 (1a6b9a05546f335eeeddb460fdc89caadf80ac7a) pacific (stable)  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x124) [0x55fe1a8e992e]
 2: /usr/bin/ceph-osd(+0xabaab9) [0x55fe1a8e9ab9]
 3: (BlueStore::_txc_apply_kv(BlueStore::TransContext*, bool)+0x5ff) [0x55fe1aefd50f]
 4: (BlueStore::_kv_sync_thread()+0x1a23) [0x55fe1af3b3d3]
 5: (BlueStore::KVSyncThread::entry()+0xd) [0x55fe1af6492d]
 6: /lib/x86_64-linux-gnu/libpthread.so.0(+0x8ea7) [0x7f2716046ea7]
 7: clone()

ceph version 16.2.6 (1a6b9a05546f335eeeddb460fdc89caadf80ac7a) pacific (stable)
 1: /lib/x86_64-linux-gnu/libpthread.so.0(+0x14140) [0x7f2716052140]
 2: gsignal()
 3: abort()
 4: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x16e) [0x55fe1a8e9978]
 5: /usr/bin/ceph-osd(+0xabaab9) [0x55fe1a8e9ab9]
 6: (BlueStore::_txc_apply_kv(BlueStore::TransContext*, bool)+0x5ff) [0x55fe1aefd50f]
 7: (BlueStore::_kv_sync_thread()+0x1a23) [0x55fe1af3b3d3]
 8: (BlueStore::KVSyncThread::entry()+0xd) [0x55fe1af6492d]
 9: /lib/x86_64-linux-gnu/libpthread.so.0(+0x8ea7) [0x7f2716046ea7]
 10: clone()
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

This is a Proxmox VE HC cluster.

Node has 3 other OSDs, filestore and HDD. osd.15 is SSD and bluestore. All nodes have one SSD/bluestore OSD and 2-3 HDD OSDs (some filestore and some bluestore).

osd.15 restarts gracefully after the crash and continues working OK for days or even 1-2 weeks.

We suspect some kind of (memory?) corruption or SSD malfunction on the node; maybe other data is being corrupted and we don't know that because other OSDs are filestore.

Problem happening after upgrade is suspicious, but could be a coincidence...

Is there any way I could make some kind of "fsck" for that osd.15, so that I can know it is good in a given moment? Any other suggestion to troubleshoot the issue? (otherwise we'll be changing RAM modules to see if that helps...)

Thanks a lot


Eneko Lacunza
Zuzendari teknikoa | Director técnico
Binovo IT Human Project

Tel. +34 943 569 206 | https://www.binovo.es
Astigarragako Bidea, 2 - 2º izda. Oficina 10-11, 20180 Oiartzun

https://www.youtube.com/user/CANALBINOVO
https://www.linkedin.com/company/37269706/

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx




[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux