Re: RandomCrashes on OSDs Attached to Mon Hosts with Octopus

Igor Fedotov <ifedotov@xxxxxxx> · Wed, 26 Aug 2020 16:29:17 +0300

Hi Denis,

this reminds me the following ticket: https://tracker.ceph.com/issues/37282

Please note they mentioned co-location with mon in comment #29.

Working hypothesis for this ticket is the interim disk read failures 
which cause RocksDB checksum failures. Earlier we observed such a 
problem for main device. Presumably it's heavy memory pressure which 
causes kernel to be failing this way.  See my comment #38 there.

So I'd like to see answers/comments for the following questions:

0) What is backing disks layout for OSDs in question (main device type?, 
additional DB/WAL devices?).

1) Please check all the existing logs for OSDs at "failing" nodes for 
other checksum errors (as per my comment #38)

2) Check if BlueFS spillover is observed for any failing OSDs.

3) Check "bluestore_reads_with_retries" performance counters for all 
OSDs at nodes in question. See comments 38-42 on the details. Any 
non-zero values?

4) Start monitoring RAM usage and swapping for these nodes. Comment 39.

Thanks,

Igor

On 8/26/2020 3:47 PM, Denis Krienbühl wrote:
Hi!

We've recently upgraded all our clusters from Mimic to Octopus (15.2.4). Since
then, our largest cluster is experiencing random crashes on OSDs attached to the
mon hosts.

This is the crash we are seeing (cut for brevity, see links in post scriptum):

    {
        "ceph_version": "15.2.4",
        "utsname_release": "4.15.0-72-generic",
        "assert_condition": "r == 0",
        "assert_func": "void BlueStore::_txc_apply_kv(BlueStore::TransContext*, bool)",
        "assert_file": "/build/ceph-15.2.4/src/os/bluestore/BlueStore.cc <http://bluestore.cc/>",
        "assert_line": 11430,
        "assert_thread_name": "bstore_kv_sync",
        "assert_msg": "/build/ceph-15.2.4/src/os/bluestore/BlueStore.cc <http://bluestore.cc/>: In function 'void BlueStore::_txc_apply_kv(BlueStore::TransContext*, bool)' thread 7fc56311a700 time 2020-08-26T08:52:24.917083+0200\n/build/ceph-15.2.4/src/os/bluestore/BlueStore.cc <http://bluestore.cc/>: 11430: FAILED ceph_assert(r == 0)\n",
        "backtrace": [
            "(()+0x12890) [0x7fc576875890]",
            "(gsignal()+0xc7) [0x7fc575527e97]",
            "(abort()+0x141) [0x7fc575529801]",
            "(ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x1a5) [0x559ef9ae97b5]",
            "(ceph::__ceph_assertf_fail(char const*, char const*, int, char const*, char const*, ...)+0) [0x559ef9ae993f]",
            "(BlueStore::_txc_apply_kv(BlueStore::TransContext*, bool)+0x3a0) [0x559efa0245b0]",
            "(BlueStore::_kv_sync_thread()+0xbdd) [0x559efa07745d]",
            "(BlueStore::KVSyncThread::entry()+0xd) [0x559efa09cd3d]",
            "(()+0x76db) [0x7fc57686a6db]",
            "(clone()+0x3f) [0x7fc57560a88f]"
        ]
    }

Right before the crash occurs, we see the following message in the crash log:

        -3> 2020-08-26T08:52:24.787+0200 7fc569b2d700  2 rocksdb: [db/db_impl_compaction_flush.cc:2212 <http://db_impl_compaction_flush.cc:2212/>] Waiting after background compaction error: Corruption: block checksum mismatch: expected 2548200440, got 2324967102  in db/815839.sst offset 67107066 size 3808, Accumulated background error counts: 1
        -2> 2020-08-26T08:52:24.852+0200 7fc56311a700 -1 rocksdb: submit_common error: Corruption: block checksum mismatch: expected 2548200440, got 2324967102  in db/815839.sst offset 67107066 size 3808 code = 2 Rocksdb transaction:

In short, we see a Rocksdb corruption error after background compaction, when this happens.

When an OSD crashes, which happens about 10-15 times a day, it restarts and
resumes work without any further problems.

We are pretty confident that this is not a hardware issue, due to the following facts:

* The crashes occur on 5 different hosts over 3 different racks.
* There is no smartctl/dmesg output that could explain it.
* It usually happens to a different OSD that did not crash before.

Still we checked the following on a few OSDs/hosts:

* We can do a manual compaction, both offline and online.
* We successfully ran "ceph-bluestore-tool fsck --deep yes" on one of the OSDs.
* We manually compacted a number of OSDs, one of which crashed hours later.

The only thing we have noticed so far: It only happens to OSDs that are attached
to a mon host. *None* of the non-mon host OSDs have had a crash!

Does anyone have a hint what could be causing this? We currently have no good
theory that could explain this, much less have a fix or workaround.

Any help would be greatly appreciated.

Denis

Crash: https://public-resources.objects.lpg.cloudscale.ch/osd-crash/meta.txt <https://public-resources.objects.lpg.cloudscale.ch/osd-crash/meta.txt>
Log: https://public-resources.objects.lpg.cloudscale.ch/osd-crash/log.txt <https://public-resources.objects.lpg.cloudscale.ch/osd-crash/log.txt>

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx