just to add
a hypothesis why mon hosts are affected only - higher memory
utilization at these nodes is what causes disk reading failures to
appear. RAM leakage (or excessive utilization) in MON processes or
something?
On 8/26/2020 4:29 PM, Igor Fedotov wrote:
Hi Denis,
this reminds me the following ticket:
https://tracker.ceph.com/issues/37282
Please note they mentioned co-location with mon in comment #29.
Working hypothesis for this ticket is the interim disk read failures
which cause RocksDB checksum failures. Earlier we observed such a
problem for main device. Presumably it's heavy memory pressure which
causes kernel to be failing this way. See my comment #38 there.
So I'd like to see answers/comments for the following questions:
0) What is backing disks layout for OSDs in question (main device
type?, additional DB/WAL devices?).
1) Please check all the existing logs for OSDs at "failing" nodes for
other checksum errors (as per my comment #38)
2) Check if BlueFS spillover is observed for any failing OSDs.
3) Check "bluestore_reads_with_retries" performance counters for all
OSDs at nodes in question. See comments 38-42 on the details. Any
non-zero values?
4) Start monitoring RAM usage and swapping for these nodes. Comment 39.
Thanks,
Igor
On 8/26/2020 3:47 PM, Denis Krienbühl wrote:
Hi!
We've recently upgraded all our clusters from Mimic to Octopus
(15.2.4). Since
then, our largest cluster is experiencing random crashes on OSDs
attached to the
mon hosts.
This is the crash we are seeing (cut for brevity, see links in post
scriptum):
{
"ceph_version": "15.2.4",
"utsname_release": "4.15.0-72-generic",
"assert_condition": "r == 0",
"assert_func": "void
BlueStore::_txc_apply_kv(BlueStore::TransContext*, bool)",
"assert_file":
"/build/ceph-15.2.4/src/os/bluestore/BlueStore.cc
<http://bluestore.cc/>",
"assert_line": 11430,
"assert_thread_name": "bstore_kv_sync",
"assert_msg":
"/build/ceph-15.2.4/src/os/bluestore/BlueStore.cc
<http://bluestore.cc/>: In function 'void
BlueStore::_txc_apply_kv(BlueStore::TransContext*, bool)' thread
7fc56311a700 time
2020-08-26T08:52:24.917083+0200\n/build/ceph-15.2.4/src/os/bluestore/BlueStore.cc
<http://bluestore.cc/>: 11430: FAILED ceph_assert(r == 0)\n",
"backtrace": [
"(()+0x12890) [0x7fc576875890]",
"(gsignal()+0xc7) [0x7fc575527e97]",
"(abort()+0x141) [0x7fc575529801]",
"(ceph::__ceph_assert_fail(char const*, char const*, int,
char const*)+0x1a5) [0x559ef9ae97b5]",
"(ceph::__ceph_assertf_fail(char const*, char const*,
int, char const*, char const*, ...)+0) [0x559ef9ae993f]",
"(BlueStore::_txc_apply_kv(BlueStore::TransContext*,
bool)+0x3a0) [0x559efa0245b0]",
"(BlueStore::_kv_sync_thread()+0xbdd) [0x559efa07745d]",
"(BlueStore::KVSyncThread::entry()+0xd) [0x559efa09cd3d]",
"(()+0x76db) [0x7fc57686a6db]",
"(clone()+0x3f) [0x7fc57560a88f]"
]
}
Right before the crash occurs, we see the following message in the
crash log:
-3> 2020-08-26T08:52:24.787+0200 7fc569b2d700 2 rocksdb:
[db/db_impl_compaction_flush.cc:2212
<http://db_impl_compaction_flush.cc:2212/>] Waiting after background
compaction error: Corruption: block checksum mismatch: expected
2548200440, got 2324967102 in db/815839.sst offset 67107066 size
3808, Accumulated background error counts: 1
-2> 2020-08-26T08:52:24.852+0200 7fc56311a700 -1 rocksdb:
submit_common error: Corruption: block checksum mismatch: expected
2548200440, got 2324967102 in db/815839.sst offset 67107066 size
3808 code = 2 Rocksdb transaction:
In short, we see a Rocksdb corruption error after background
compaction, when this happens.
When an OSD crashes, which happens about 10-15 times a day, it
restarts and
resumes work without any further problems.
We are pretty confident that this is not a hardware issue, due to the
following facts:
* The crashes occur on 5 different hosts over 3 different racks.
* There is no smartctl/dmesg output that could explain it.
* It usually happens to a different OSD that did not crash before.
Still we checked the following on a few OSDs/hosts:
* We can do a manual compaction, both offline and online.
* We successfully ran "ceph-bluestore-tool fsck --deep yes" on one of
the OSDs.
* We manually compacted a number of OSDs, one of which crashed hours
later.
The only thing we have noticed so far: It only happens to OSDs that
are attached
to a mon host. *None* of the non-mon host OSDs have had a crash!
Does anyone have a hint what could be causing this? We currently have
no good
theory that could explain this, much less have a fix or workaround.
Any help would be greatly appreciated.
Denis
Crash:
https://public-resources.objects.lpg.cloudscale.ch/osd-crash/meta.txt
<https://public-resources.objects.lpg.cloudscale.ch/osd-crash/meta.txt>
Log:
https://public-resources.objects.lpg.cloudscale.ch/osd-crash/log.txt
<https://public-resources.objects.lpg.cloudscale.ch/osd-crash/log.txt>
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx