Re: RandomCrashes on OSDs Attached to Mon Hosts with Octopus

Denis Krienbühl <denis@xxxxxxx> · Thu, 27 Aug 2020 09:06:05 +0200

Hi Igor,

Thanks for your input. I tried to gather as much information as I could to
answer your questions. Hopefully we can get to the bottom of this.

> 0) What is backing disks layout for OSDs in question (main device type?, additional DB/WAL devices?).

Everything is on a single Intel NVMe P4510 using dmcrypt with 2 OSDs per NVMe
device. There is no additional DB/WAL device and there are no HDDs involved.

Also note that we use 40 OSDs per host with a memory target of 6'174'015'488.

> 1) Please check all the existing logs for OSDs at "failing" nodes for other checksum errors (as per my comment #38)

I grepped the logs for "checksum mismatch" and "_verify_csum". The only
occurrences I could find where the ones that preceed the crashes.

> 2) Check if BlueFS spillover is observed for any failing OSDs.

As everything is on the same device, there can be no spillover, right?

> 3) Check "bluestore_reads_with_retries" performance counters for all OSDs at nodes in question. See comments 38-42 on the details. Any non-zero values?

I monitored this over night by repeatedly polling this performance counter over
all OSDs on the mons. Only one OSD, which has crashed in the past, has had a
value of 1 since I started measuring. All the other OSDs, including the ones
that crashed over night, have a value of 0. Before and after the crash.

> 4) Start monitoring RAM usage and swapping for these nodes. Comment 39.

The memory use of those nodes is pretty constant with ~6GB free, ~25GB availble of 256GB.
There are also only a handful of pages being swapped, if at all.

> a hypothesis why mon hosts are affected only  - higher memory utilization at these nodes is what causes disk reading failures to appear. RAM leakage (or excessive utilization) in MON processes or something?

Since the memory usage is rather constant I'm not sure this is the case, I think
we would see more of an up/down pattern. However we are not yet monitoring all
processes, and that would be somthing I'd like to get some data on, but I'm not
sure this is the right course of action at the moment.

What do you think, is it still plausible that we see a memory utilization
problem, even though there's little variance in the memory usage patterns?

The approaches we currently consider is to upgrade our kernel and to lower the memory
target somewhat.

Cheers,

Denis

> On 26 Aug 2020, at 15:29, Igor Fedotov <ifedotov@xxxxxxx> wrote:
> 
> Hi Denis,
> 
> this reminds me the following ticket: https://tracker.ceph.com/issues/37282
> 
> Please note they mentioned co-location with mon in comment #29.
> 
> 
> Working hypothesis for this ticket is the interim disk read failures which cause RocksDB checksum failures. Earlier we observed such a problem for main device. Presumably it's heavy memory pressure which causes kernel to be failing this way.  See my comment #38 there.
> 
> So I'd like to see answers/comments for the following questions:
> 
> 0) What is backing disks layout for OSDs in question (main device type?, additional DB/WAL devices?).
> 
> 1) Please check all the existing logs for OSDs at "failing" nodes for other checksum errors (as per my comment #38)
> 
> 2) Check if BlueFS spillover is observed for any failing OSDs.
> 
> 3) Check "bluestore_reads_with_retries" performance counters for all OSDs at nodes in question. See comments 38-42 on the details. Any non-zero values?
> 
> 4) Start monitoring RAM usage and swapping for these nodes. Comment 39.
> 
> 
> Thanks,
> 
> Igor
> 
> 
> 
> 
> 
> 
> On 8/26/2020 3:47 PM, Denis Krienbühl wrote:
>> Hi!
>> 
>> We've recently upgraded all our clusters from Mimic to Octopus (15.2.4). Since
>> then, our largest cluster is experiencing random crashes on OSDs attached to the
>> mon hosts.
>> 
>> This is the crash we are seeing (cut for brevity, see links in post scriptum):
>> 
>>    {
>>        "ceph_version": "15.2.4",
>>        "utsname_release": "4.15.0-72-generic",
>>        "assert_condition": "r == 0",
>>        "assert_func": "void BlueStore::_txc_apply_kv(BlueStore::TransContext*, bool)",
>>        "assert_file": "/build/ceph-15.2.4/src/os/bluestore/BlueStore.cc <http://bluestore.cc/>",
>>        "assert_line": 11430,
>>        "assert_thread_name": "bstore_kv_sync",
>>        "assert_msg": "/build/ceph-15.2.4/src/os/bluestore/BlueStore.cc <http://bluestore.cc/>: In function 'void BlueStore::_txc_apply_kv(BlueStore::TransContext*, bool)' thread 7fc56311a700 time 2020-08-26T08:52:24.917083+0200\n/build/ceph-15.2.4/src/os/bluestore/BlueStore.cc <http://bluestore.cc/>: 11430: FAILED ceph_assert(r == 0)\n",
>>        "backtrace": [
>>            "(()+0x12890) [0x7fc576875890]",
>>            "(gsignal()+0xc7) [0x7fc575527e97]",
>>            "(abort()+0x141) [0x7fc575529801]",
>>            "(ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x1a5) [0x559ef9ae97b5]",
>>            "(ceph::__ceph_assertf_fail(char const*, char const*, int, char const*, char const*, ...)+0) [0x559ef9ae993f]",
>>            "(BlueStore::_txc_apply_kv(BlueStore::TransContext*, bool)+0x3a0) [0x559efa0245b0]",
>>            "(BlueStore::_kv_sync_thread()+0xbdd) [0x559efa07745d]",
>>            "(BlueStore::KVSyncThread::entry()+0xd) [0x559efa09cd3d]",
>>            "(()+0x76db) [0x7fc57686a6db]",
>>            "(clone()+0x3f) [0x7fc57560a88f]"
>>        ]
>>    }
>> 
>> Right before the crash occurs, we see the following message in the crash log:
>> 
>>        -3> 2020-08-26T08:52:24.787+0200 7fc569b2d700  2 rocksdb: [db/db_impl_compaction_flush.cc:2212 <http://db_impl_compaction_flush.cc:2212/>] Waiting after background compaction error: Corruption: block checksum mismatch: expected 2548200440, got 2324967102  in db/815839.sst offset 67107066 size 3808, Accumulated background error counts: 1
>>        -2> 2020-08-26T08:52:24.852+0200 7fc56311a700 -1 rocksdb: submit_common error: Corruption: block checksum mismatch: expected 2548200440, got 2324967102  in db/815839.sst offset 67107066 size 3808 code = 2 Rocksdb transaction:
>> 
>> In short, we see a Rocksdb corruption error after background compaction, when this happens.
>> 
>> When an OSD crashes, which happens about 10-15 times a day, it restarts and
>> resumes work without any further problems.
>> 
>> We are pretty confident that this is not a hardware issue, due to the following facts:
>> 
>> * The crashes occur on 5 different hosts over 3 different racks.
>> * There is no smartctl/dmesg output that could explain it.
>> * It usually happens to a different OSD that did not crash before.
>> 
>> Still we checked the following on a few OSDs/hosts:
>> 
>> * We can do a manual compaction, both offline and online.
>> * We successfully ran "ceph-bluestore-tool fsck --deep yes" on one of the OSDs.
>> * We manually compacted a number of OSDs, one of which crashed hours later.
>> 
>> The only thing we have noticed so far: It only happens to OSDs that are attached
>> to a mon host. *None* of the non-mon host OSDs have had a crash!
>> 
>> Does anyone have a hint what could be causing this? We currently have no good
>> theory that could explain this, much less have a fix or workaround.
>> 
>> Any help would be greatly appreciated.
>> 
>> Denis
>> 
>> Crash: https://public-resources.objects.lpg.cloudscale.ch/osd-crash/meta.txt <https://public-resources.objects.lpg.cloudscale.ch/osd-crash/meta.txt>
>> Log: https://public-resources.objects.lpg.cloudscale.ch/osd-crash/log.txt <https://public-resources.objects.lpg.cloudscale.ch/osd-crash/log.txt>
>> 
>> _______________________________________________
>> ceph-users mailing list -- ceph-users@xxxxxxx
>> To unsubscribe send an email to ceph-users-leave@xxxxxxx
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx