Hi Eneko,
I don't think this is a memory H/W issue. This reminds me the following
thread:
https://lists.ceph.io/hyperkitty/list/ceph-users@xxxxxxx/thread/DEOBAUXQBUFL6HNBBNJ3LMQUCQC76HLY/
There was apparently a data corruption in RocksDB which poped up during
DB compaction only. There were no issues during regular access but
sometimes internal auto-compaction procedure triggered the crash.
Finally they disabled auto-compaction and took the data out of this OSD
and redeployed it. You might want to try the same approach if you need
data from this OSD or just redeploy it.
Thanks,
Igor
On 12/2/2021 2:21 PM, Eneko Lacunza wrote:
Hi all,
Since we upgraded our tiny 4-node 15-osd from Nautilus to Pacific, we
are seeing issues with osd.15, that periodically crashes with:
-10> 2021-12-02T11:52:50.716+0100 7f27071bc700 10 monclient:
_check_auth_rotating have uptodate secrets (they expire after
2021-12-02T11:52:20.721345+0100)
-9> 2021-12-02T11:52:51.548+0100 7f2708efd700 5 prioritycache
tune_memory target: 4294967296 mapped: 4041244672 unmapped: 479731712
heap: 4520976384 old mem: 2845415818 new mem: 2845415818
-8> 2021-12-02T11:52:51.696+0100 7f270b702700 3 rocksdb:
[db_impl/db_impl_compaction_flush.cc:2807] Compaction error:
Corruption: block checksum mismatch: expected 3428654824, got
1987789945 in db/511261.s
st offset 7219720 size 4044
-7> 2021-12-02T11:52:51.696+0100 7f270b702700 4 rocksdb:
(Original Log Time 2021/12/02-11:52:51.701026)
[compaction/compaction_job.cc:743] [default] compacted to: files[4 1
21 0 0 0 0] max score 0.46, MB/se
c: 83.7 rd, 0.0 wr, level 1, files in(4, 1) out(1) MB in(44.2, 35.2)
out(0.0), read-write-amplify(1.8) write-amplify(0.0) Corruption: block
checksum mismatch: expected 3428654824, got 1987789945 in db/511261.ss
t offset 7219720 size 4044, records in: 613843, records dropped:
288377 output_compression: NoCompression
-6> 2021-12-02T11:52:51.696+0100 7f270b702700 4 rocksdb:
(Original Log Time 2021/12/02-11:52:51.701047) EVENT_LOG_v1
{"time_micros": 1638442371701036, "job": 1640, "event":
"compaction_finished", "compactio
n_time_micros": 995261, "compaction_time_cpu_micros": 899466,
"output_level": 1, "num_output_files": 1, "total_output_size":
40875027, "num_input_records": 613843, "num_output_records": 325466,
"num_subcompactio
ns": 1, "output_compression": "NoCompression",
"num_single_delete_mismatches": 0, "num_single_delete_fallthrough": 0,
"lsm_state": [4, 1, 21, 0, 0, 0, 0]}
-5> 2021-12-02T11:52:51.696+0100 7f270b702700 2 rocksdb:
[db_impl/db_impl_compaction_flush.cc:2341] Waiting after background
compaction error: Corruption: block checksum mismatch: expected
3428654824, got 1
987789945 in db/511261.sst offset 7219720 size 4044, Accumulated
background error counts: 1
-4> 2021-12-02T11:52:51.716+0100 7f27071bc700 10 monclient: tick
-3> 2021-12-02T11:52:51.716+0100 7f27071bc700 10 monclient:
_check_auth_rotating have uptodate secrets (they expire after
2021-12-02T11:52:21.721429+0100)
-2> 2021-12-02T11:52:51.788+0100 7f27046f4700 -1 rocksdb:
submit_common error: Corruption: block checksum mismatch: expected
3428654824, got 1987789945 in db/511261.sst offset 7219720 size 4044
code = ^B Ro
cksdb transaction:
PutCF( prefix = m key =
0x000000000000000700000000000008'^.0000042922.00000000000048814458'
value size = 236)
PutCF( prefix = m key = 0x000000000000000700000000000008'^._fastinfo'
value size = 186)
PutCF( prefix = O key =
0x7F80000000000000074161CBB7'!rbd_data.1cf93c843df86a.000000000000021d!='0xFFFFFFFFFFFFFFFEFFFFFFFFFFFFFFFF6F0026000078
value size = 535)
PutCF( prefix = O key =
0x7F80000000000000074161CBB7'!rbd_data.1cf93c843df86a.000000000000021d!='0xFFFFFFFFFFFFFFFEFFFFFFFFFFFFFFFF6F
value size = 420)
PutCF( prefix = L key = 0x0000000000C33283 value size = 4135)
-1> 2021-12-02T11:52:51.800+0100 7f27046f4700 -1
./src/os/bluestore/BlueStore.cc: In function 'void
BlueStore::_txc_apply_kv(BlueStore::TransContext*, bool)' thread
7f27046f4700 time 2021-12-02T11:52:51.7937
84+0100
./src/os/bluestore/BlueStore.cc: 11650: FAILED ceph_assert(r == 0)
ceph version 16.2.6 (1a6b9a05546f335eeeddb460fdc89caadf80ac7a)
pacific (stable)
1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
const*)+0x124) [0x55fe1a8e992e]
2: /usr/bin/ceph-osd(+0xabaab9) [0x55fe1a8e9ab9]
3: (BlueStore::_txc_apply_kv(BlueStore::TransContext*, bool)+0x5ff)
[0x55fe1aefd50f]
4: (BlueStore::_kv_sync_thread()+0x1a23) [0x55fe1af3b3d3]
5: (BlueStore::KVSyncThread::entry()+0xd) [0x55fe1af6492d]
6: /lib/x86_64-linux-gnu/libpthread.so.0(+0x8ea7) [0x7f2716046ea7]
7: clone()
ceph version 16.2.6 (1a6b9a05546f335eeeddb460fdc89caadf80ac7a) pacific
(stable)
1: /lib/x86_64-linux-gnu/libpthread.so.0(+0x14140) [0x7f2716052140]
2: gsignal()
3: abort()
4: (ceph::__ceph_assert_fail(char const*, char const*, int, char
const*)+0x16e) [0x55fe1a8e9978]
5: /usr/bin/ceph-osd(+0xabaab9) [0x55fe1a8e9ab9]
6: (BlueStore::_txc_apply_kv(BlueStore::TransContext*, bool)+0x5ff)
[0x55fe1aefd50f]
7: (BlueStore::_kv_sync_thread()+0x1a23) [0x55fe1af3b3d3]
8: (BlueStore::KVSyncThread::entry()+0xd) [0x55fe1af6492d]
9: /lib/x86_64-linux-gnu/libpthread.so.0(+0x8ea7) [0x7f2716046ea7]
10: clone()
NOTE: a copy of the executable, or `objdump -rdS <executable>` is
needed to interpret this.
This is a Proxmox VE HC cluster.
Node has 3 other OSDs, filestore and HDD. osd.15 is SSD and bluestore.
All nodes have one SSD/bluestore OSD and 2-3 HDD OSDs (some filestore
and some bluestore).
osd.15 restarts gracefully after the crash and continues working OK
for days or even 1-2 weeks.
We suspect some kind of (memory?) corruption or SSD malfunction on the
node; maybe other data is being corrupted and we don't know that
because other OSDs are filestore.
Problem happening after upgrade is suspicious, but could be a
coincidence...
Is there any way I could make some kind of "fsck" for that osd.15, so
that I can know it is good in a given moment? Any other suggestion to
troubleshoot the issue? (otherwise we'll be changing RAM modules to
see if that helps...)
Thanks a lot
Eneko Lacunza
Zuzendari teknikoa | Director técnico
Binovo IT Human Project
Tel. +34 943 569 206 | https://www.binovo.es
Astigarragako Bidea, 2 - 2º izda. Oficina 10-11, 20180 Oiartzun
https://www.youtube.com/user/CANALBINOVO
https://www.linkedin.com/company/37269706/
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
--
Igor Fedotov
Ceph Lead Developer
Looking for help with your Ceph cluster? Contact us at https://croit.io
croit GmbH, Freseniusstr. 31h, 81247 Munich
CEO: Martin Verges - VAT-ID: DE310638492
Com. register: Amtsgericht Munich HRB 231263
Web: https://croit.io | YouTube: https://goo.gl/PGE1Bx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx