SSD OSD crashing after upgrade to 12.2.10

Eugen Block <eblock@xxxxxx> · Thu, 07 Feb 2019 12:37:50 +0000

Hi list,

I found this thread [1] about crashing SSD OSDs, although that was  
about an upgrade to 12.2.7, we just hit (probably) the same issue  
after our update to 12.2.10 two days ago in a production cluster.
Just half an hour ago I saw one OSD (SSD) crashing (for the first time):

2019-02-07 13:02:07.682178 mon.host1 mon.0 <IP>:6789/0 109754 :  
cluster [INF] osd.10 failed (root=default,host=host1) (connection  
refused reported by osd.20)
2019-02-07 13:02:08.623828 mon.host1 mon.0 <IP>:6789/0 109771 :  
cluster [WRN] Health check failed: 1 osds down (OSD_DOWN)

One minute later, the OSD was back online.
This is the stack trace reported in syslog:

---cut here---
2019-02-07T13:01:51.181027+01:00 host1 ceph-osd[1136505]: *** Caught  
signal (Aborted) **
2019-02-07T13:01:51.181232+01:00 host1 ceph-osd[1136505]:  in thread  
7f75ce646700 thread_name:bstore_kv_final
2019-02-07T13:01:51.185873+01:00 host1 ceph-osd[1136505]:  ceph  
version 12.2.10-544-gb10c702661  
(b10c702661a31c8563b3421d6d664de93a0cb0e2) luminous (stable)
2019-02-07T13:01:51.186077+01:00 host1 ceph-osd[1136505]:  1:  
(()+0xa587d9) [0x560b921cc7d9]
2019-02-07T13:01:51.186226+01:00 host1 ceph-osd[1136505]:  2:  
(()+0x10b10) [0x7f75d8386b10]
2019-02-07T13:01:51.186368+01:00 host1 ceph-osd[1136505]:  3:  
(gsignal()+0x37) [0x7f75d73508d7]
2019-02-07T13:01:51.186773+01:00 host1 ceph-osd[1136505]:  4:  
(abort()+0x13a) [0x7f75d7351caa]
2019-02-07T13:01:51.186906+01:00 host1 ceph-osd[1136505]:  5:  
(ceph::__ceph_assert_fail(char const*, char const*, int, char  
const*)+0x280) [0x560b922096d0]
2019-02-07T13:01:51.187027+01:00 host1 ceph-osd[1136505]:  6:  
(interval_set<unsigned long, btree::btree_map<unsigned long, unsigned  
long, std::less<unsigned long>,  
mempool::pool_allocator<(mempool::pool_index_t)1, std::pair<unsigned  
long const, unsigned long> >, 256> >::insert(unsigned long, unsigned  
long, unsigned long*, unsigned long*)+0xef2) [0x560b921bd432]
2019-02-07T13:01:51.187167+01:00 host1 ceph-osd[1136505]:  7:  
(StupidAllocator::_insert_free(unsigned long, unsigned long)+0x126)  
[0x560b921b4a06]
2019-02-07T13:01:51.187294+01:00 host1 ceph-osd[1136505]:  8:  
(StupidAllocator::release(unsigned long, unsigned long)+0x7d)  
[0x560b921b4f4d]
2019-02-07T13:01:51.187418+01:00 host1 ceph-osd[1136505]:  9:  
(BlueStore::_txc_release_alloc(BlueStore::TransContext*)+0x72)  
[0x560b9207fa22]
2019-02-07T13:01:51.187539+01:00 host1 ceph-osd[1136505]:  10:  
(BlueStore::_txc_finish(BlueStore::TransContext*)+0x5d7)  
[0x560b92092d77]
2019-02-07T13:01:51.187661+01:00 host1 ceph-osd[1136505]:  11:  
(BlueStore::_txc_state_proc(BlueStore::TransContext*)+0x1f6)  
[0x560b920a3fa6]
2019-02-07T13:01:51.187781+01:00 host1 ceph-osd[1136505]:  12:  
(BlueStore::_kv_finalize_thread()+0x620) [0x560b920a58e0]
2019-02-07T13:01:51.187898+01:00 host1 ceph-osd[1136505]:  13:  
(BlueStore::KVFinalizeThread::entry()+0xd) [0x560b920fb57d]
2019-02-07T13:01:51.188017+01:00 host1 ceph-osd[1136505]:  14:  
(()+0x8744) [0x7f75d837e744]
2019-02-07T13:01:51.188138+01:00 host1 ceph-osd[1136505]:  15:  
(clone()+0x6d) [0x7f75d7405aad]
2019-02-07T13:01:51.188271+01:00 host1 ceph-osd[1136505]: 2019-02-07  
13:01:51.185833 7f75ce646700 -1 *** Caught signal (Aborted) **
---cut here---

Is there anything we can do about this? The issue in [1] doesn't seem  
to be resolved, yet. Debug logging is not enabled, so I don't have  
more detailed information except the full stack trace from the OSD  
log. Any help is appreciated!

Regards,
Eugen

[1]  
http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-September/029616.html

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com