Re: SSD OSD crashing after upgrade to 12.2.10

Eugen Block <eblock@xxxxxx> · Mon, 11 Mar 2019 13:32:44 +0000

Hi all,

we had some assistance with our SSD crash issue outside of this  
mailing list - which is not resolved yet  
(http://tracker.ceph.com/issues/38395) - but there's one thing I'd  
like to ask the list.

I noticed that a lot of the OSD crashes show a correlation to MON  
elections. For the last 18 OSD failures I count 7 MON elections  
happening right before the OSD failures are reported. But if I take  
into account that there's a grace period of 20 seconds, it seems as if  
some OSD failures could trigger a MON election, is that even possible?

The logs look like this:

---cut here---
2019-03-02 21:43:17.599452 mon.monitor02 mon.1 <ADDRESS>:6789/0 977222  
: cluster [INF] mon.monitor02 calling monitor election
2019-03-02 21:43:17.758506 mon.monitor01 mon.0 <ADDRESS>:6789/0  
1079594 : cluster [INF] mon.monitor01 calling monitor election
2019-03-02 21:43:22.938084 mon.monitor01 mon.0 <ADDRESS>:6789/0  
1079595 : cluster [INF] mon.monitor01 is new leader, mons  
monitor01,monitor02 in quorum (ranks 0,1)
2019-03-02 21:43:23.106667 mon.monitor01 mon.0 <ADDRESS>:6789/0  
1079600 : cluster [WRN] Health check failed: 1/3 mons down, quorum  
monitor01,monitor02 (MON_DOWN)
2019-03-02 21:43:23.180382 mon.monitor01 mon.0 <ADDRESS>:6789/0  
1079601 : cluster [WRN] overall HEALTH_WARN 1/3 mons down, quorum  
monitor01,monitor02
2019-03-02 21:43:27.454252 mon.monitor01 mon.0 <ADDRESS>:6789/0  
1079610 : cluster [INF] osd.20 failed (root=default,host=monitor03) (2  
reporters from different host after 20.000136 >= grace 20.000000)
[...]
2019-03-04 10:06:35.743561 mon.monitor01 mon.0 <ADDRESS>:6789/0  
1164043 : cluster [INF] mon.monitor01 calling monitor election
2019-03-04 10:06:35.752565 mon.monitor02 mon.1 <ADDRESS>:6789/0  
1054674 : cluster [INF] mon.monitor02 calling monitor election
2019-03-04 10:06:35.835435 mon.monitor01 mon.0 <ADDRESS>:6789/0  
1164044 : cluster [INF] mon.monitor01 is new leader, mons  
monitor01,monitor02,monitor03 in quorum (ranks 0,1,2)
2019-03-04 10:06:35.701759 mon.monitor03 mon.2 <ADDRESS>:6789/0 287652  
: cluster [INF] mon.monitor03 calling monitor election
2019-03-04 10:06:35.954407 mon.monitor01 mon.0 <ADDRESS>:6789/0  
1164049 : cluster [INF] overall HEALTH_OK
2019-03-04 10:06:45.299686 mon.monitor01 mon.0 <ADDRESS>:6789/0  
1164057 : cluster [INF] osd.20 failed (root=default,host=monitor03) (2  
reporters from different host after 20.068848 >= grace 20.000000)
[...]
---cut here---

These MON elections only happened when a OSD failure occured, no  
elections without OSD failures. Does this make sense to anybody? Any  
insights would be greatly appreciated.

Regards,
Eugen

Zitat von Igor Fedotov <ifedotov@xxxxxxx>:

Hi Eugen,

looks like this isn't [1] but rather

https://tracker.ceph.com/issues/38049

and

https://tracker.ceph.com/issues/36541 (=  
https://tracker.ceph.com/issues/36638 for luminous)

Hence it's not fixed in 12.2.10, target release is 12.2.11

Also please note the patch allows to avoid new occurrences for the  
issue. But there some chances that inconsistencies caused by it  
earlier are still present in DB. And assertion might still happen  
(hopefully with less frequency).

So could you please run fsck for OSDs that were broken once and  
share the results?

Then we can decide if it makes sense to proceed with the repair.

Thanks,

Igor

On 2/7/2019 3:37 PM, Eugen Block wrote:
Hi list,

I found this thread [1] about crashing SSD OSDs, although that was  
about an upgrade to 12.2.7, we just hit (probably) the same issue  
after our update to 12.2.10 two days ago in a production cluster.
Just half an hour ago I saw one OSD (SSD) crashing (for the first time):

2019-02-07 13:02:07.682178 mon.host1 mon.0 <IP>:6789/0 109754 :  
cluster [INF] osd.10 failed (root=default,host=host1) (connection  
refused reported by osd.20)
2019-02-07 13:02:08.623828 mon.host1 mon.0 <IP>:6789/0 109771 :  
cluster [WRN] Health check failed: 1 osds down (OSD_DOWN)

One minute later, the OSD was back online.
This is the stack trace reported in syslog:

---cut here---
2019-02-07T13:01:51.181027+01:00 host1 ceph-osd[1136505]: ***  
Caught signal (Aborted) **
2019-02-07T13:01:51.181232+01:00 host1 ceph-osd[1136505]:  in  
thread 7f75ce646700 thread_name:bstore_kv_final
2019-02-07T13:01:51.185873+01:00 host1 ceph-osd[1136505]:  ceph  
version 12.2.10-544-gb10c702661  
(b10c702661a31c8563b3421d6d664de93a0cb0e2) luminous (stable)
2019-02-07T13:01:51.186077+01:00 host1 ceph-osd[1136505]:  1:  
(()+0xa587d9) [0x560b921cc7d9]
2019-02-07T13:01:51.186226+01:00 host1 ceph-osd[1136505]:  2:  
(()+0x10b10) [0x7f75d8386b10]
2019-02-07T13:01:51.186368+01:00 host1 ceph-osd[1136505]:  3:  
(gsignal()+0x37) [0x7f75d73508d7]
2019-02-07T13:01:51.186773+01:00 host1 ceph-osd[1136505]:  4:  
(abort()+0x13a) [0x7f75d7351caa]
2019-02-07T13:01:51.186906+01:00 host1 ceph-osd[1136505]:  5:  
(ceph::__ceph_assert_fail(char const*, char const*, int, char  
const*)+0x280) [0x560b922096d0]
2019-02-07T13:01:51.187027+01:00 host1 ceph-osd[1136505]:  6:  
(interval_set<unsigned long, btree::btree_map<unsigned long,  
unsigned long, std::less<unsigned long>,  
mempool::pool_allocator<(mempool::pool_index_t)1,  
std::pair<unsigned long const, unsigned long> >, 256>  
>::insert(unsigned long, unsigned long, unsigned long*, unsigned  
long*)+0xef2) [0x560b921bd432]
2019-02-07T13:01:51.187167+01:00 host1 ceph-osd[1136505]:  7:  
(StupidAllocator::_insert_free(unsigned long, unsigned long)+0x126)  
[0x560b921b4a06]
2019-02-07T13:01:51.187294+01:00 host1 ceph-osd[1136505]:  8:  
(StupidAllocator::release(unsigned long, unsigned long)+0x7d)  
[0x560b921b4f4d]
2019-02-07T13:01:51.187418+01:00 host1 ceph-osd[1136505]:  9:  
(BlueStore::_txc_release_alloc(BlueStore::TransContext*)+0x72)  
[0x560b9207fa22]
2019-02-07T13:01:51.187539+01:00 host1 ceph-osd[1136505]:  10:  
(BlueStore::_txc_finish(BlueStore::TransContext*)+0x5d7)  
[0x560b92092d77]
2019-02-07T13:01:51.187661+01:00 host1 ceph-osd[1136505]:  11:  
(BlueStore::_txc_state_proc(BlueStore::TransContext*)+0x1f6)  
[0x560b920a3fa6]
2019-02-07T13:01:51.187781+01:00 host1 ceph-osd[1136505]:  12:  
(BlueStore::_kv_finalize_thread()+0x620) [0x560b920a58e0]
2019-02-07T13:01:51.187898+01:00 host1 ceph-osd[1136505]:  13:  
(BlueStore::KVFinalizeThread::entry()+0xd) [0x560b920fb57d]
2019-02-07T13:01:51.188017+01:00 host1 ceph-osd[1136505]:  14:  
(()+0x8744) [0x7f75d837e744]
2019-02-07T13:01:51.188138+01:00 host1 ceph-osd[1136505]:  15:  
(clone()+0x6d) [0x7f75d7405aad]
2019-02-07T13:01:51.188271+01:00 host1 ceph-osd[1136505]:  
2019-02-07 13:01:51.185833 7f75ce646700 -1 *** Caught signal  
(Aborted) **
---cut here---

Is there anything we can do about this? The issue in [1] doesn't  
seem to be resolved, yet. Debug logging is not enabled, so I don't  
have more detailed information except the full stack trace from the  
OSD log. Any help is appreciated!

Regards,
Eugen

[1]  
http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-September/029616.html

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com