Re: RandomCrashes on OSDs Attached to Mon Hosts with Octopus

Igor Fedotov <ifedotov@xxxxxxx> · Wed, 26 Aug 2020 16:51:22 +0300

just to add

a hypothesis why mon hosts are affected only  - higher memory 
utilization at these nodes is what causes disk reading failures to 
appear. RAM leakage (or excessive utilization) in MON processes or 
something?

On 8/26/2020 4:29 PM, Igor Fedotov wrote:
Hi Denis,

this reminds me the following ticket: 
https://tracker.ceph.com/issues/37282

Please note they mentioned co-location with mon in comment #29.

Working hypothesis for this ticket is the interim disk read failures 
which cause RocksDB checksum failures. Earlier we observed such a 
problem for main device. Presumably it's heavy memory pressure which 
causes kernel to be failing this way.  See my comment #38 there.

So I'd like to see answers/comments for the following questions:

0) What is backing disks layout for OSDs in question (main device 
type?, additional DB/WAL devices?).

1) Please check all the existing logs for OSDs at "failing" nodes for 
other checksum errors (as per my comment #38)

2) Check if BlueFS spillover is observed for any failing OSDs.

3) Check "bluestore_reads_with_retries" performance counters for all 
OSDs at nodes in question. See comments 38-42 on the details. Any 
non-zero values?

4) Start monitoring RAM usage and swapping for these nodes. Comment 39.

Thanks,

Igor

On 8/26/2020 3:47 PM, Denis Krienbühl wrote:
Hi!

We've recently upgraded all our clusters from Mimic to Octopus 
(15.2.4). Since
then, our largest cluster is experiencing random crashes on OSDs 
attached to the
mon hosts.

This is the crash we are seeing (cut for brevity, see links in post 
scriptum):

    {
        "ceph_version": "15.2.4",
        "utsname_release": "4.15.0-72-generic",
        "assert_condition": "r == 0",
        "assert_func": "void 
BlueStore::_txc_apply_kv(BlueStore::TransContext*, bool)",
        "assert_file": 
"/build/ceph-15.2.4/src/os/bluestore/BlueStore.cc 
<http://bluestore.cc/>",
        "assert_line": 11430,
        "assert_thread_name": "bstore_kv_sync",
        "assert_msg": 
"/build/ceph-15.2.4/src/os/bluestore/BlueStore.cc 
<http://bluestore.cc/>: In function 'void 
BlueStore::_txc_apply_kv(BlueStore::TransContext*, bool)' thread 
7fc56311a700 time 
2020-08-26T08:52:24.917083+0200\n/build/ceph-15.2.4/src/os/bluestore/BlueStore.cc 
<http://bluestore.cc/>: 11430: FAILED ceph_assert(r == 0)\n",
        "backtrace": [
            "(()+0x12890) [0x7fc576875890]",
            "(gsignal()+0xc7) [0x7fc575527e97]",
            "(abort()+0x141) [0x7fc575529801]",
            "(ceph::__ceph_assert_fail(char const*, char const*, int, 
char const*)+0x1a5) [0x559ef9ae97b5]",
            "(ceph::__ceph_assertf_fail(char const*, char const*, 
int, char const*, char const*, ...)+0) [0x559ef9ae993f]",
            "(BlueStore::_txc_apply_kv(BlueStore::TransContext*, 
bool)+0x3a0) [0x559efa0245b0]",
            "(BlueStore::_kv_sync_thread()+0xbdd) [0x559efa07745d]",
            "(BlueStore::KVSyncThread::entry()+0xd) [0x559efa09cd3d]",
            "(()+0x76db) [0x7fc57686a6db]",
            "(clone()+0x3f) [0x7fc57560a88f]"
        ]
    }

Right before the crash occurs, we see the following message in the 
crash log:

        -3> 2020-08-26T08:52:24.787+0200 7fc569b2d700  2 rocksdb: 
[db/db_impl_compaction_flush.cc:2212 
<http://db_impl_compaction_flush.cc:2212/>] Waiting after background 
compaction error: Corruption: block checksum mismatch: expected 
2548200440, got 2324967102  in db/815839.sst offset 67107066 size 
3808, Accumulated background error counts: 1
        -2> 2020-08-26T08:52:24.852+0200 7fc56311a700 -1 rocksdb: 
submit_common error: Corruption: block checksum mismatch: expected 
2548200440, got 2324967102  in db/815839.sst offset 67107066 size 
3808 code = 2 Rocksdb transaction:

In short, we see a Rocksdb corruption error after background 
compaction, when this happens.

When an OSD crashes, which happens about 10-15 times a day, it 
restarts and
resumes work without any further problems.

We are pretty confident that this is not a hardware issue, due to the 
following facts:

* The crashes occur on 5 different hosts over 3 different racks.
* There is no smartctl/dmesg output that could explain it.
* It usually happens to a different OSD that did not crash before.

Still we checked the following on a few OSDs/hosts:

* We can do a manual compaction, both offline and online.
* We successfully ran "ceph-bluestore-tool fsck --deep yes" on one of 
the OSDs.
* We manually compacted a number of OSDs, one of which crashed hours 
later.

The only thing we have noticed so far: It only happens to OSDs that 
are attached
to a mon host. *None* of the non-mon host OSDs have had a crash!

Does anyone have a hint what could be causing this? We currently have 
no good
theory that could explain this, much less have a fix or workaround.

Any help would be greatly appreciated.

Denis

Crash: 
https://public-resources.objects.lpg.cloudscale.ch/osd-crash/meta.txt 
<https://public-resources.objects.lpg.cloudscale.ch/osd-crash/meta.txt>
Log: 
https://public-resources.objects.lpg.cloudscale.ch/osd-crash/log.txt 
<https://public-resources.objects.lpg.cloudscale.ch/osd-crash/log.txt>

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx