Hi all,
Since we upgraded our tiny 4-node 15-osd from Nautilus to Pacific, we
are seeing issues with osd.15, that periodically crashes with:
-10> 2021-12-02T11:52:50.716+0100 7f27071bc700 10 monclient:
_check_auth_rotating have uptodate secrets (they expire after
2021-12-02T11:52:20.721345+0100)
-9> 2021-12-02T11:52:51.548+0100 7f2708efd700 5 prioritycache
tune_memory target: 4294967296 mapped: 4041244672 unmapped: 479731712
heap: 4520976384 old mem: 2845415818 new mem: 2845415818
-8> 2021-12-02T11:52:51.696+0100 7f270b702700 3 rocksdb:
[db_impl/db_impl_compaction_flush.cc:2807] Compaction error: Corruption:
block checksum mismatch: expected 3428654824, got 1987789945 in db/511261.s
st offset 7219720 size 4044
-7> 2021-12-02T11:52:51.696+0100 7f270b702700 4 rocksdb: (Original
Log Time 2021/12/02-11:52:51.701026) [compaction/compaction_job.cc:743]
[default] compacted to: files[4 1 21 0 0 0 0] max score 0.46, MB/se
c: 83.7 rd, 0.0 wr, level 1, files in(4, 1) out(1) MB in(44.2, 35.2)
out(0.0), read-write-amplify(1.8) write-amplify(0.0) Corruption: block
checksum mismatch: expected 3428654824, got 1987789945 in db/511261.ss
t offset 7219720 size 4044, records in: 613843, records dropped: 288377
output_compression: NoCompression
-6> 2021-12-02T11:52:51.696+0100 7f270b702700 4 rocksdb: (Original
Log Time 2021/12/02-11:52:51.701047) EVENT_LOG_v1 {"time_micros":
1638442371701036, "job": 1640, "event": "compaction_finished", "compactio
n_time_micros": 995261, "compaction_time_cpu_micros": 899466,
"output_level": 1, "num_output_files": 1, "total_output_size": 40875027,
"num_input_records": 613843, "num_output_records": 325466, "num_subcompactio
ns": 1, "output_compression": "NoCompression",
"num_single_delete_mismatches": 0, "num_single_delete_fallthrough": 0,
"lsm_state": [4, 1, 21, 0, 0, 0, 0]}
-5> 2021-12-02T11:52:51.696+0100 7f270b702700 2 rocksdb:
[db_impl/db_impl_compaction_flush.cc:2341] Waiting after background
compaction error: Corruption: block checksum mismatch: expected
3428654824, got 1
987789945 in db/511261.sst offset 7219720 size 4044, Accumulated
background error counts: 1
-4> 2021-12-02T11:52:51.716+0100 7f27071bc700 10 monclient: tick
-3> 2021-12-02T11:52:51.716+0100 7f27071bc700 10 monclient:
_check_auth_rotating have uptodate secrets (they expire after
2021-12-02T11:52:21.721429+0100)
-2> 2021-12-02T11:52:51.788+0100 7f27046f4700 -1 rocksdb:
submit_common error: Corruption: block checksum mismatch: expected
3428654824, got 1987789945 in db/511261.sst offset 7219720 size 4044
code = ^B Ro
cksdb transaction:
PutCF( prefix = m key =
0x000000000000000700000000000008'^.0000042922.00000000000048814458'
value size = 236)
PutCF( prefix = m key = 0x000000000000000700000000000008'^._fastinfo'
value size = 186)
PutCF( prefix = O key =
0x7F80000000000000074161CBB7'!rbd_data.1cf93c843df86a.000000000000021d!='0xFFFFFFFFFFFFFFFEFFFFFFFFFFFFFFFF6F0026000078
value size = 535)
PutCF( prefix = O key =
0x7F80000000000000074161CBB7'!rbd_data.1cf93c843df86a.000000000000021d!='0xFFFFFFFFFFFFFFFEFFFFFFFFFFFFFFFF6F
value size = 420)
PutCF( prefix = L key = 0x0000000000C33283 value size = 4135)
-1> 2021-12-02T11:52:51.800+0100 7f27046f4700 -1
./src/os/bluestore/BlueStore.cc: In function 'void
BlueStore::_txc_apply_kv(BlueStore::TransContext*, bool)' thread
7f27046f4700 time 2021-12-02T11:52:51.7937
84+0100
./src/os/bluestore/BlueStore.cc: 11650: FAILED ceph_assert(r == 0)
ceph version 16.2.6 (1a6b9a05546f335eeeddb460fdc89caadf80ac7a) pacific
(stable)
1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
const*)+0x124) [0x55fe1a8e992e]
2: /usr/bin/ceph-osd(+0xabaab9) [0x55fe1a8e9ab9]
3: (BlueStore::_txc_apply_kv(BlueStore::TransContext*, bool)+0x5ff)
[0x55fe1aefd50f]
4: (BlueStore::_kv_sync_thread()+0x1a23) [0x55fe1af3b3d3]
5: (BlueStore::KVSyncThread::entry()+0xd) [0x55fe1af6492d]
6: /lib/x86_64-linux-gnu/libpthread.so.0(+0x8ea7) [0x7f2716046ea7]
7: clone()
ceph version 16.2.6 (1a6b9a05546f335eeeddb460fdc89caadf80ac7a) pacific
(stable)
1: /lib/x86_64-linux-gnu/libpthread.so.0(+0x14140) [0x7f2716052140]
2: gsignal()
3: abort()
4: (ceph::__ceph_assert_fail(char const*, char const*, int, char
const*)+0x16e) [0x55fe1a8e9978]
5: /usr/bin/ceph-osd(+0xabaab9) [0x55fe1a8e9ab9]
6: (BlueStore::_txc_apply_kv(BlueStore::TransContext*, bool)+0x5ff)
[0x55fe1aefd50f]
7: (BlueStore::_kv_sync_thread()+0x1a23) [0x55fe1af3b3d3]
8: (BlueStore::KVSyncThread::entry()+0xd) [0x55fe1af6492d]
9: /lib/x86_64-linux-gnu/libpthread.so.0(+0x8ea7) [0x7f2716046ea7]
10: clone()
NOTE: a copy of the executable, or `objdump -rdS <executable>` is
needed to interpret this.
This is a Proxmox VE HC cluster.
Node has 3 other OSDs, filestore and HDD. osd.15 is SSD and bluestore.
All nodes have one SSD/bluestore OSD and 2-3 HDD OSDs (some filestore
and some bluestore).
osd.15 restarts gracefully after the crash and continues working OK for
days or even 1-2 weeks.
We suspect some kind of (memory?) corruption or SSD malfunction on the
node; maybe other data is being corrupted and we don't know that because
other OSDs are filestore.
Problem happening after upgrade is suspicious, but could be a coincidence...
Is there any way I could make some kind of "fsck" for that osd.15, so
that I can know it is good in a given moment? Any other suggestion to
troubleshoot the issue? (otherwise we'll be changing RAM modules to see
if that helps...)
Thanks a lot
Eneko Lacunza
Zuzendari teknikoa | Director técnico
Binovo IT Human Project
Tel. +34 943 569 206 | https://www.binovo.es
Astigarragako Bidea, 2 - 2º izda. Oficina 10-11, 20180 Oiartzun
https://www.youtube.com/user/CANALBINOVO
https://www.linkedin.com/company/37269706/
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx