Re: RandomCrashes on OSDs Attached to Mon Hosts with Octopus

Denis Krienbühl <denis@xxxxxxx> · Tue, 1 Sep 2020 13:48:42 +0200

Hi Igor

To bring this thread to a conclusion: We managed to stop the random crashes by restarting each of the OSDs manually.

After upgrading the cluster we reshuffled a lot of our data by changing PG counts. It seems like the memory reserved during that time was never released back to the OS.

Though we did not see any change in swap usage, with swap page in/out actually being lower than before the upgrade, the OSDs did not reclaim the memory they used before the restart in the days following the restart. We also stopped seeing random crashes.

I can’t say definitely what the error was, but for us these random crashes were solved by restarting all OSDs. Maybe this helps somebody else searching for this error in the future.

Thanks again for your help!

Denis

> On 27 Aug 2020, at 13:46, Denis Krienbühl <denis@xxxxxxx> wrote:
> 
> Hi Igor
> 
> Just to clarify:
> 
>>> I grepped the logs for "checksum mismatch" and "_verify_csum". The only
>>> occurrences I could find where the ones that preceed the crashes.
>> 
>> Are you able to find multiple _verify_csum precisely?
> 
> There are no “_verify_csum” entries whatsoever. I wrote that wrongly.
> I could only find “checksum mismatch” right when the crash happens.
> 
> Sorry for the confusion.
> 
> I will keep tracking those counters and have a look at monitor/osd memory tracking.
> 
> Cheers,
> 
> Denis
> 
>> On 27 Aug 2020, at 13:39, Igor Fedotov <ifedotov@xxxxxxx <mailto:ifedotov@xxxxxxx>> wrote:
>> 
>> Hi Denis
>> 
>> please see my comments inline.
>> 
>> 
>> Thanks,
>> 
>> Igor
>> 
>> On 8/27/2020 10:06 AM, Denis Krienbühl wrote:
>>> Hi Igor,
>>> 
>>> Thanks for your input. I tried to gather as much information as I could to
>>> answer your questions. Hopefully we can get to the bottom of this.
>>> 
>>>> 0) What is backing disks layout for OSDs in question (main device type?, additional DB/WAL devices?).
>>> Everything is on a single Intel NVMe P4510 using dmcrypt with 2 OSDs per NVMe
>>> device. There is no additional DB/WAL device and there are no HDDs involved.
>>> 
>>> Also note that we use 40 OSDs per host with a memory target of 6'174'015'488.
>>> 
>>>> 1) Please check all the existing logs for OSDs at "failing" nodes for other checksum errors (as per my comment #38)
>>> I grepped the logs for "checksum mismatch" and "_verify_csum". The only
>>> occurrences I could find where the ones that preceed the crashes.
>> 
>> Are you able to find multiple _verify_csum precisely?
>> 
>> If so this means data read failures were observed at user data not RocksDB one. Which backs the hypothesis about interim  disk read
>> 
>> errors as a root cause. User data reading has quite a different access stack and is able to retry after such errors hence they aren't that visible.
>> 
>> But having checksum failures for both DB and user data points to the same root cause at lower layers (kernel, I/O stack etc).
>> 
>> It might be interesting whether _verify_csum and RocksDB csum were happening nearly at the same period of time. Not even for a single OSD but for different OSDs of the same node.
>> 
>> This might indicate that node was suffering from some decease at that time. Anything suspicious from system-wide logs for this time period?
>> 
>>> 
>>>> 2) Check if BlueFS spillover is observed for any failing OSDs.
>>> As everything is on the same device, there can be no spillover, right?
>> Right
>>> 
>>>> 3) Check "bluestore_reads_with_retries" performance counters for all OSDs at nodes in question. See comments 38-42 on the details. Any non-zero values?
>>> I monitored this over night by repeatedly polling this performance counter over
>>> all OSDs on the mons. Only one OSD, which has crashed in the past, has had a
>>> value of 1 since I started measuring. All the other OSDs, including the ones
>>> that crashed over night, have a value of 0. Before and after the crash.
>> 
>> Even a single occurrence isn't expected - this counter should always be equal to 0. And presumably these are peak hours when the cluster is exposed to the issue at most. Night is likely to be not the the peak period though. So please keep tracking...
>> 
>> 
>>> 
>>>> 4) Start monitoring RAM usage and swapping for these nodes. Comment 39.
>>> The memory use of those nodes is pretty constant with ~6GB free, ~25GB availble of 256GB.
>>> There are also only a handful of pages being swapped, if at all.
>>> 
>>>> a hypothesis why mon hosts are affected only  - higher memory utilization at these nodes is what causes disk reading failures to appear. RAM leakage (or excessive utilization) in MON processes or something?
>>> Since the memory usage is rather constant I'm not sure this is the case, I think
>>> we would see more of an up/down pattern. However we are not yet monitoring all
>>> processes, and that would be somthing I'd like to get some data on, but I'm not
>>> sure this is the right course of action at the moment.
>> 
>> Given the fact that colocation with monitors is probably the clue - suggest to track  MON and OSD process at least.
>> 
>> And high memory pressure is just a working hypothesis for these disk failures root cause. Something else (e.g. high disk utilization) might be another trigger or it might just be wrong...
>> 
>> So please just pay some attention to this.
>> 
>>> 
>>> What do you think, is it still plausible that we see a memory utilization
>>> problem, even though there's little variance in the memory usage patterns?
>>> 
>>> The approaches we currently consider is to upgrade our kernel and to lower the memory
>>> target somewhat.
>>> 
>>> Cheers,
>>> 
>>> Denis
>>> 
>>> 
>>>> On 26 Aug 2020, at 15:29, Igor Fedotov <ifedotov@xxxxxxx <mailto:ifedotov@xxxxxxx>> wrote:
>>>> 
>>>> Hi Denis,
>>>> 
>>>> this reminds me the following ticket: https://tracker.ceph.com/issues/37282 <https://tracker.ceph.com/issues/37282>
>>>> 
>>>> Please note they mentioned co-location with mon in comment #29.
>>>> 
>>>> 
>>>> Working hypothesis for this ticket is the interim disk read failures which cause RocksDB checksum failures. Earlier we observed such a problem for main device. Presumably it's heavy memory pressure which causes kernel to be failing this way.  See my comment #38 there.
>>>> 
>>>> So I'd like to see answers/comments for the following questions:
>>>> 
>>>> 0) What is backing disks layout for OSDs in question (main device type?, additional DB/WAL devices?).
>>>> 
>>>> 1) Please check all the existing logs for OSDs at "failing" nodes for other checksum errors (as per my comment #38)
>>>> 
>>>> 2) Check if BlueFS spillover is observed for any failing OSDs.
>>>> 
>>>> 3) Check "bluestore_reads_with_retries" performance counters for all OSDs at nodes in question. See comments 38-42 on the details. Any non-zero values?
>>>> 
>>>> 4) Start monitoring RAM usage and swapping for these nodes. Comment 39.
>>>> 
>>>> 
>>>> Thanks,
>>>> 
>>>> Igor
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> On 8/26/2020 3:47 PM, Denis Krienbühl wrote:
>>>>> Hi!
>>>>> 
>>>>> We've recently upgraded all our clusters from Mimic to Octopus (15.2.4). Since
>>>>> then, our largest cluster is experiencing random crashes on OSDs attached to the
>>>>> mon hosts.
>>>>> 
>>>>> This is the crash we are seeing (cut for brevity, see links in post scriptum):
>>>>> 
>>>>>    {
>>>>>        "ceph_version": "15.2.4",
>>>>>        "utsname_release": "4.15.0-72-generic",
>>>>>        "assert_condition": "r == 0",
>>>>>        "assert_func": "void BlueStore::_txc_apply_kv(BlueStore::TransContext*, bool)",
>>>>>        "assert_file": "/build/ceph-15.2.4/src/os/bluestore/BlueStore.cc <http://bluestore.cc/> <http://bluestore.cc/ <http://bluestore.cc/>>",
>>>>>        "assert_line": 11430,
>>>>>        "assert_thread_name": "bstore_kv_sync",
>>>>>        "assert_msg": "/build/ceph-15.2.4/src/os/bluestore/BlueStore.cc <http://bluestore.cc/> <http://bluestore.cc/ <http://bluestore.cc/>>: In function 'void BlueStore::_txc_apply_kv(BlueStore::TransContext*, bool)' thread 7fc56311a700 time 2020-08-26T08:52:24.917083+0200\n/build/ceph-15.2.4/src/os/bluestore/BlueStore.cc <http://bluestore.cc/> <http://bluestore.cc/ <http://bluestore.cc/>>: 11430: FAILED ceph_assert(r == 0)\n",
>>>>>        "backtrace": [
>>>>>            "(()+0x12890) [0x7fc576875890]",
>>>>>            "(gsignal()+0xc7) [0x7fc575527e97]",
>>>>>            "(abort()+0x141) [0x7fc575529801]",
>>>>>            "(ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x1a5) [0x559ef9ae97b5]",
>>>>>            "(ceph::__ceph_assertf_fail(char const*, char const*, int, char const*, char const*, ...)+0) [0x559ef9ae993f]",
>>>>>            "(BlueStore::_txc_apply_kv(BlueStore::TransContext*, bool)+0x3a0) [0x559efa0245b0]",
>>>>>            "(BlueStore::_kv_sync_thread()+0xbdd) [0x559efa07745d]",
>>>>>            "(BlueStore::KVSyncThread::entry()+0xd) [0x559efa09cd3d]",
>>>>>            "(()+0x76db) [0x7fc57686a6db]",
>>>>>            "(clone()+0x3f) [0x7fc57560a88f]"
>>>>>        ]
>>>>>    }
>>>>> 
>>>>> Right before the crash occurs, we see the following message in the crash log:
>>>>> 
>>>>>        -3> 2020-08-26T08:52:24.787+0200 7fc569b2d700  2 rocksdb: [db/db_impl_compaction_flush.cc:2212 <http://db_impl_compaction_flush.cc:2212/> <http://db_impl_compaction_flush.cc:2212/ <http://db_impl_compaction_flush.cc:2212/>>] Waiting after background compaction error: Corruption: block checksum mismatch: expected 2548200440, got 2324967102  in db/815839.sst offset 67107066 size 3808, Accumulated background error counts: 1
>>>>>        -2> 2020-08-26T08:52:24.852+0200 7fc56311a700 -1 rocksdb: submit_common error: Corruption: block checksum mismatch: expected 2548200440, got 2324967102  in db/815839.sst offset 67107066 size 3808 code = 2 Rocksdb transaction:
>>>>> 
>>>>> In short, we see a Rocksdb corruption error after background compaction, when this happens.
>>>>> 
>>>>> When an OSD crashes, which happens about 10-15 times a day, it restarts and
>>>>> resumes work without any further problems.
>>>>> 
>>>>> We are pretty confident that this is not a hardware issue, due to the following facts:
>>>>> 
>>>>> * The crashes occur on 5 different hosts over 3 different racks.
>>>>> * There is no smartctl/dmesg output that could explain it.
>>>>> * It usually happens to a different OSD that did not crash before.
>>>>> 
>>>>> Still we checked the following on a few OSDs/hosts:
>>>>> 
>>>>> * We can do a manual compaction, both offline and online.
>>>>> * We successfully ran "ceph-bluestore-tool fsck --deep yes" on one of the OSDs.
>>>>> * We manually compacted a number of OSDs, one of which crashed hours later.
>>>>> 
>>>>> The only thing we have noticed so far: It only happens to OSDs that are attached
>>>>> to a mon host. *None* of the non-mon host OSDs have had a crash!
>>>>> 
>>>>> Does anyone have a hint what could be causing this? We currently have no good
>>>>> theory that could explain this, much less have a fix or workaround.
>>>>> 
>>>>> Any help would be greatly appreciated.
>>>>> 
>>>>> Denis
>>>>> 
>>>>> Crash: https://public-resources.objects.lpg.cloudscale.ch/osd-crash/meta.txt <https://public-resources.objects.lpg.cloudscale.ch/osd-crash/meta.txt> <https://public-resources.objects.lpg.cloudscale.ch/osd-crash/meta.txt <https://public-resources.objects.lpg.cloudscale.ch/osd-crash/meta.txt>>
>>>>> Log: https://public-resources.objects.lpg.cloudscale.ch/osd-crash/log.txt <https://public-resources.objects.lpg.cloudscale.ch/osd-crash/log.txt> <https://public-resources.objects.lpg.cloudscale.ch/osd-crash/log.txt <https://public-resources.objects.lpg.cloudscale.ch/osd-crash/log.txt>>
>>>>> 
>>>>> _______________________________________________
>>>>> ceph-users mailing list -- ceph-users@xxxxxxx <mailto:ceph-users@xxxxxxx>
>>>>> To unsubscribe send an email to ceph-users-leave@xxxxxxx <mailto:ceph-users-leave@xxxxxxx>
>>>> _______________________________________________
>>>> ceph-users mailing list -- ceph-users@xxxxxxx <mailto:ceph-users@xxxxxxx>
>>>> To unsubscribe send an email to ceph-users-leave@xxxxxxx <mailto:ceph-users-leave@xxxxxxx>
>> _______________________________________________
>> ceph-users mailing list -- ceph-users@xxxxxxx <mailto:ceph-users@xxxxxxx>
>> To unsubscribe send an email to ceph-users-leave@xxxxxxx <mailto:ceph-users-leave@xxxxxxx>

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx