Hi list,
I found this thread [1] about crashing SSD OSDs, although that was
about an upgrade to 12.2.7, we just hit (probably) the same issue
after our update to 12.2.10 two days ago in a production cluster.
Just half an hour ago I saw one OSD (SSD) crashing (for the first time):
2019-02-07 13:02:07.682178 mon.host1 mon.0 <IP>:6789/0 109754 :
cluster [INF] osd.10 failed (root=default,host=host1) (connection
refused reported by osd.20)
2019-02-07 13:02:08.623828 mon.host1 mon.0 <IP>:6789/0 109771 :
cluster [WRN] Health check failed: 1 osds down (OSD_DOWN)
One minute later, the OSD was back online.
This is the stack trace reported in syslog:
---cut here---
2019-02-07T13:01:51.181027+01:00 host1 ceph-osd[1136505]: *** Caught
signal (Aborted) **
2019-02-07T13:01:51.181232+01:00 host1 ceph-osd[1136505]: in thread
7f75ce646700 thread_name:bstore_kv_final
2019-02-07T13:01:51.185873+01:00 host1 ceph-osd[1136505]: ceph
version 12.2.10-544-gb10c702661
(b10c702661a31c8563b3421d6d664de93a0cb0e2) luminous (stable)
2019-02-07T13:01:51.186077+01:00 host1 ceph-osd[1136505]: 1:
(()+0xa587d9) [0x560b921cc7d9]
2019-02-07T13:01:51.186226+01:00 host1 ceph-osd[1136505]: 2:
(()+0x10b10) [0x7f75d8386b10]
2019-02-07T13:01:51.186368+01:00 host1 ceph-osd[1136505]: 3:
(gsignal()+0x37) [0x7f75d73508d7]
2019-02-07T13:01:51.186773+01:00 host1 ceph-osd[1136505]: 4:
(abort()+0x13a) [0x7f75d7351caa]
2019-02-07T13:01:51.186906+01:00 host1 ceph-osd[1136505]: 5:
(ceph::__ceph_assert_fail(char const*, char const*, int, char
const*)+0x280) [0x560b922096d0]
2019-02-07T13:01:51.187027+01:00 host1 ceph-osd[1136505]: 6:
(interval_set<unsigned long, btree::btree_map<unsigned long, unsigned
long, std::less<unsigned long>,
mempool::pool_allocator<(mempool::pool_index_t)1, std::pair<unsigned
long const, unsigned long> >, 256> >::insert(unsigned long, unsigned
long, unsigned long*, unsigned long*)+0xef2) [0x560b921bd432]
2019-02-07T13:01:51.187167+01:00 host1 ceph-osd[1136505]: 7:
(StupidAllocator::_insert_free(unsigned long, unsigned long)+0x126)
[0x560b921b4a06]
2019-02-07T13:01:51.187294+01:00 host1 ceph-osd[1136505]: 8:
(StupidAllocator::release(unsigned long, unsigned long)+0x7d)
[0x560b921b4f4d]
2019-02-07T13:01:51.187418+01:00 host1 ceph-osd[1136505]: 9:
(BlueStore::_txc_release_alloc(BlueStore::TransContext*)+0x72)
[0x560b9207fa22]
2019-02-07T13:01:51.187539+01:00 host1 ceph-osd[1136505]: 10:
(BlueStore::_txc_finish(BlueStore::TransContext*)+0x5d7)
[0x560b92092d77]
2019-02-07T13:01:51.187661+01:00 host1 ceph-osd[1136505]: 11:
(BlueStore::_txc_state_proc(BlueStore::TransContext*)+0x1f6)
[0x560b920a3fa6]
2019-02-07T13:01:51.187781+01:00 host1 ceph-osd[1136505]: 12:
(BlueStore::_kv_finalize_thread()+0x620) [0x560b920a58e0]
2019-02-07T13:01:51.187898+01:00 host1 ceph-osd[1136505]: 13:
(BlueStore::KVFinalizeThread::entry()+0xd) [0x560b920fb57d]
2019-02-07T13:01:51.188017+01:00 host1 ceph-osd[1136505]: 14:
(()+0x8744) [0x7f75d837e744]
2019-02-07T13:01:51.188138+01:00 host1 ceph-osd[1136505]: 15:
(clone()+0x6d) [0x7f75d7405aad]
2019-02-07T13:01:51.188271+01:00 host1 ceph-osd[1136505]: 2019-02-07
13:01:51.185833 7f75ce646700 -1 *** Caught signal (Aborted) **
---cut here---
Is there anything we can do about this? The issue in [1] doesn't seem
to be resolved, yet. Debug logging is not enabled, so I don't have
more detailed information except the full stack trace from the OSD
log. Any help is appreciated!
Regards,
Eugen
[1]
http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-September/029616.html
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com