RandomCrashes on OSDs Attached to Mon Hosts with Octopus

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi!

We've recently upgraded all our clusters from Mimic to Octopus (15.2.4). Since
then, our largest cluster is experiencing random crashes on OSDs attached to the
mon hosts.

This is the crash we are seeing (cut for brevity, see links in post scriptum):

   {
       "ceph_version": "15.2.4",
       "utsname_release": "4.15.0-72-generic",
       "assert_condition": "r == 0",
       "assert_func": "void BlueStore::_txc_apply_kv(BlueStore::TransContext*, bool)",
       "assert_file": "/build/ceph-15.2.4/src/os/bluestore/BlueStore.cc <http://bluestore.cc/>",
       "assert_line": 11430,
       "assert_thread_name": "bstore_kv_sync",
       "assert_msg": "/build/ceph-15.2.4/src/os/bluestore/BlueStore.cc <http://bluestore.cc/>: In function 'void BlueStore::_txc_apply_kv(BlueStore::TransContext*, bool)' thread 7fc56311a700 time 2020-08-26T08:52:24.917083+0200\n/build/ceph-15.2.4/src/os/bluestore/BlueStore.cc <http://bluestore.cc/>: 11430: FAILED ceph_assert(r == 0)\n",
       "backtrace": [
           "(()+0x12890) [0x7fc576875890]",
           "(gsignal()+0xc7) [0x7fc575527e97]",
           "(abort()+0x141) [0x7fc575529801]",
           "(ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x1a5) [0x559ef9ae97b5]",
           "(ceph::__ceph_assertf_fail(char const*, char const*, int, char const*, char const*, ...)+0) [0x559ef9ae993f]",
           "(BlueStore::_txc_apply_kv(BlueStore::TransContext*, bool)+0x3a0) [0x559efa0245b0]",
           "(BlueStore::_kv_sync_thread()+0xbdd) [0x559efa07745d]",
           "(BlueStore::KVSyncThread::entry()+0xd) [0x559efa09cd3d]",
           "(()+0x76db) [0x7fc57686a6db]",
           "(clone()+0x3f) [0x7fc57560a88f]"
       ]
   }

Right before the crash occurs, we see the following message in the crash log:

       -3> 2020-08-26T08:52:24.787+0200 7fc569b2d700  2 rocksdb: [db/db_impl_compaction_flush.cc:2212 <http://db_impl_compaction_flush.cc:2212/>] Waiting after background compaction error: Corruption: block checksum mismatch: expected 2548200440, got 2324967102  in db/815839.sst offset 67107066 size 3808, Accumulated background error counts: 1
       -2> 2020-08-26T08:52:24.852+0200 7fc56311a700 -1 rocksdb: submit_common error: Corruption: block checksum mismatch: expected 2548200440, got 2324967102  in db/815839.sst offset 67107066 size 3808 code = 2 Rocksdb transaction:

In short, we see a Rocksdb corruption error after background compaction, when this happens.

When an OSD crashes, which happens about 10-15 times a day, it restarts and
resumes work without any further problems.

We are pretty confident that this is not a hardware issue, due to the following facts:

* The crashes occur on 5 different hosts over 3 different racks.
* There is no smartctl/dmesg output that could explain it.
* It usually happens to a different OSD that did not crash before.

Still we checked the following on a few OSDs/hosts:

* We can do a manual compaction, both offline and online.
* We successfully ran "ceph-bluestore-tool fsck --deep yes" on one of the OSDs.
* We manually compacted a number of OSDs, one of which crashed hours later.

The only thing we have noticed so far: It only happens to OSDs that are attached
to a mon host. *None* of the non-mon host OSDs have had a crash!

Does anyone have a hint what could be causing this? We currently have no good
theory that could explain this, much less have a fix or workaround.

Any help would be greatly appreciated.

Denis

Crash: https://public-resources.objects.lpg.cloudscale.ch/osd-crash/meta.txt <https://public-resources.objects.lpg.cloudscale.ch/osd-crash/meta.txt>
Log: https://public-resources.objects.lpg.cloudscale.ch/osd-crash/log.txt <https://public-resources.objects.lpg.cloudscale.ch/osd-crash/log.txt>

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx



[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux