Re: SSD OSD crashing after upgrade to 12.2.10

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi Eugen,

looks like this isn't [1] but rather

https://tracker.ceph.com/issues/38049

and

https://tracker.ceph.com/issues/36541 (= https://tracker.ceph.com/issues/36638 for luminous)

Hence it's not fixed in 12.2.10, target release is 12.2.11


Also please note the patch allows to avoid new occurrences for the issue. But there some chances that inconsistencies caused by it earlier are still present in DB. And assertion might still happen (hopefully with less frequency).

So could you please run fsck for OSDs that were broken once and share the results?

Then we can decide if it makes sense to proceed with the repair.


Thanks,

Igor

On 2/7/2019 3:37 PM, Eugen Block wrote:
Hi list,

I found this thread [1] about crashing SSD OSDs, although that was about an upgrade to 12.2.7, we just hit (probably) the same issue after our update to 12.2.10 two days ago in a production cluster.
Just half an hour ago I saw one OSD (SSD) crashing (for the first time):

2019-02-07 13:02:07.682178 mon.host1 mon.0 <IP>:6789/0 109754 : cluster [INF] osd.10 failed (root=default,host=host1) (connection refused reported by osd.20) 2019-02-07 13:02:08.623828 mon.host1 mon.0 <IP>:6789/0 109771 : cluster [WRN] Health check failed: 1 osds down (OSD_DOWN)

One minute later, the OSD was back online.
This is the stack trace reported in syslog:

---cut here---
2019-02-07T13:01:51.181027+01:00 host1 ceph-osd[1136505]: *** Caught signal (Aborted) ** 2019-02-07T13:01:51.181232+01:00 host1 ceph-osd[1136505]:  in thread 7f75ce646700 thread_name:bstore_kv_final 2019-02-07T13:01:51.185873+01:00 host1 ceph-osd[1136505]:  ceph version 12.2.10-544-gb10c702661 (b10c702661a31c8563b3421d6d664de93a0cb0e2) luminous (stable) 2019-02-07T13:01:51.186077+01:00 host1 ceph-osd[1136505]:  1: (()+0xa587d9) [0x560b921cc7d9] 2019-02-07T13:01:51.186226+01:00 host1 ceph-osd[1136505]:  2: (()+0x10b10) [0x7f75d8386b10] 2019-02-07T13:01:51.186368+01:00 host1 ceph-osd[1136505]:  3: (gsignal()+0x37) [0x7f75d73508d7] 2019-02-07T13:01:51.186773+01:00 host1 ceph-osd[1136505]:  4: (abort()+0x13a) [0x7f75d7351caa] 2019-02-07T13:01:51.186906+01:00 host1 ceph-osd[1136505]:  5: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x280) [0x560b922096d0] 2019-02-07T13:01:51.187027+01:00 host1 ceph-osd[1136505]:  6: (interval_set<unsigned long, btree::btree_map<unsigned long, unsigned long, std::less<unsigned long>, mempool::pool_allocator<(mempool::pool_index_t)1, std::pair<unsigned long const, unsigned long> >, 256> >::insert(unsigned long, unsigned long, unsigned long*, unsigned long*)+0xef2) [0x560b921bd432] 2019-02-07T13:01:51.187167+01:00 host1 ceph-osd[1136505]:  7: (StupidAllocator::_insert_free(unsigned long, unsigned long)+0x126) [0x560b921b4a06] 2019-02-07T13:01:51.187294+01:00 host1 ceph-osd[1136505]:  8: (StupidAllocator::release(unsigned long, unsigned long)+0x7d) [0x560b921b4f4d] 2019-02-07T13:01:51.187418+01:00 host1 ceph-osd[1136505]:  9: (BlueStore::_txc_release_alloc(BlueStore::TransContext*)+0x72) [0x560b9207fa22] 2019-02-07T13:01:51.187539+01:00 host1 ceph-osd[1136505]:  10: (BlueStore::_txc_finish(BlueStore::TransContext*)+0x5d7) [0x560b92092d77] 2019-02-07T13:01:51.187661+01:00 host1 ceph-osd[1136505]:  11: (BlueStore::_txc_state_proc(BlueStore::TransContext*)+0x1f6) [0x560b920a3fa6] 2019-02-07T13:01:51.187781+01:00 host1 ceph-osd[1136505]:  12: (BlueStore::_kv_finalize_thread()+0x620) [0x560b920a58e0] 2019-02-07T13:01:51.187898+01:00 host1 ceph-osd[1136505]:  13: (BlueStore::KVFinalizeThread::entry()+0xd) [0x560b920fb57d] 2019-02-07T13:01:51.188017+01:00 host1 ceph-osd[1136505]:  14: (()+0x8744) [0x7f75d837e744] 2019-02-07T13:01:51.188138+01:00 host1 ceph-osd[1136505]:  15: (clone()+0x6d) [0x7f75d7405aad] 2019-02-07T13:01:51.188271+01:00 host1 ceph-osd[1136505]: 2019-02-07 13:01:51.185833 7f75ce646700 -1 *** Caught signal (Aborted) **
---cut here---

Is there anything we can do about this? The issue in [1] doesn't seem to be resolved, yet. Debug logging is not enabled, so I don't have more detailed information except the full stack trace from the OSD log. Any help is appreciated!

Regards,
Eugen

[1] http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-September/029616.html

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com




[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux