Re: SSD OSD crashing after upgrade to 12.2.10

Eugen Block <eblock@xxxxxx> · Thu, 07 Feb 2019 15:06:17 +0000

At first - you should upgrade to 12.2.11 (or bring the mentioned  
patch in by other means) to fix rename procedure which will avoid  
new inconsistent objects appearance in DB. This should at least  
reduce the OSD crash frequency.

We'll have to wait until 12.2.11 is available for openSUSE, I'm not  
sure how long it will take.

So I'd like to have fsck report to verify that. No matter if you do  
fsck before or after the upgrade.

Once we have fsck report we can proceed with the repair. Which is a  
bit risky procedure so may be I should try to simulate  the  
inconsistency  in question and check if built-in repair handles that  
properly. Will see, lets get fsck report first.

I'll try to run the fsck today, I have to wait until there are fewer  
clients active. Depending on the log file size, would it be okay to  
attach it to an email and send it directly to you or what is the best  
procedure for you?

Thanks for your support!
Eugen

Zitat von Igor Fedotov <ifedotov@xxxxxxx>:

Eugen,

At first - you should upgrade to 12.2.11 (or bring the mentioned  
patch in by other means) to fix rename procedure which will avoid  
new inconsistent objects appearance in DB. This should at least  
reduce the OSD crash frequency.

At second - theoretically previous crashes could result in  
persistent inconsistent objects in your DB. I haven't seen that in  
real life before but probably they exist. We need to check. If so  
OSD crashes might still occur.

So I'd like to have fsck report to verify that. No matter if you do  
fsck before or after the upgrade.

Once we have fsck report we can proceed with the repair. Which is a  
bit risky procedure so may be I should try to simulate  the  
inconsistency  in question and check if built-in repair handles that  
properly. Will see, lets get fsck report first.

W.r.t to running ceph-bluestore-tool - you might want to specify log  
file and increase log level to 20 using --log-file and --log-level  
options.

On 2/7/2019 4:45 PM, Eugen Block wrote:
Hi Igor,

thanks for the quick response!
Just to make sure I don't misunderstand, and because it's a  
production cluster:
before anything else I should run fsck on that OSD? Depending on  
the result we'll decide how to continue, right?
Is there anything else to be enabled for that command or can I  
simply run 'ceph-bluestore-tool fsck --path  
/var/lib/ceph/osd/ceph-<ID>'?

Any other obstacles I should be aware of when running fsck?

Thanks!
Eugen

Zitat von Igor Fedotov <ifedotov@xxxxxxx>:

Hi Eugen,

looks like this isn't [1] but rather

https://tracker.ceph.com/issues/38049

and

https://tracker.ceph.com/issues/36541 (=  
https://tracker.ceph.com/issues/36638 for luminous)

Hence it's not fixed in 12.2.10, target release is 12.2.11

Also please note the patch allows to avoid new occurrences for the  
issue. But there some chances that inconsistencies caused by it  
earlier are still present in DB. And assertion might still happen  
(hopefully with less frequency).

So could you please run fsck for OSDs that were broken once and  
share the results?

Then we can decide if it makes sense to proceed with the repair.

Thanks,

Igor

On 2/7/2019 3:37 PM, Eugen Block wrote:
Hi list,

I found this thread [1] about crashing SSD OSDs, although that  
was about an upgrade to 12.2.7, we just hit (probably) the same  
issue after our update to 12.2.10 two days ago in a production  
cluster.
Just half an hour ago I saw one OSD (SSD) crashing (for the first time):

2019-02-07 13:02:07.682178 mon.host1 mon.0 <IP>:6789/0 109754 :  
cluster [INF] osd.10 failed (root=default,host=host1) (connection  
refused reported by osd.20)
2019-02-07 13:02:08.623828 mon.host1 mon.0 <IP>:6789/0 109771 :  
cluster [WRN] Health check failed: 1 osds down (OSD_DOWN)

One minute later, the OSD was back online.
This is the stack trace reported in syslog:

---cut here---
2019-02-07T13:01:51.181027+01:00 host1 ceph-osd[1136505]: ***  
Caught signal (Aborted) **
2019-02-07T13:01:51.181232+01:00 host1 ceph-osd[1136505]:  in  
thread 7f75ce646700 thread_name:bstore_kv_final
2019-02-07T13:01:51.185873+01:00 host1 ceph-osd[1136505]: ceph  
version 12.2.10-544-gb10c702661  
(b10c702661a31c8563b3421d6d664de93a0cb0e2) luminous (stable)
2019-02-07T13:01:51.186077+01:00 host1 ceph-osd[1136505]:  1:  
(()+0xa587d9) [0x560b921cc7d9]
2019-02-07T13:01:51.186226+01:00 host1 ceph-osd[1136505]:  2:  
(()+0x10b10) [0x7f75d8386b10]
2019-02-07T13:01:51.186368+01:00 host1 ceph-osd[1136505]:  3:  
(gsignal()+0x37) [0x7f75d73508d7]
2019-02-07T13:01:51.186773+01:00 host1 ceph-osd[1136505]:  4:  
(abort()+0x13a) [0x7f75d7351caa]
2019-02-07T13:01:51.186906+01:00 host1 ceph-osd[1136505]:  5:  
(ceph::__ceph_assert_fail(char const*, char const*, int, char  
const*)+0x280) [0x560b922096d0]
2019-02-07T13:01:51.187027+01:00 host1 ceph-osd[1136505]:  6:  
(interval_set<unsigned long, btree::btree_map<unsigned long,  
unsigned long, std::less<unsigned long>,  
mempool::pool_allocator<(mempool::pool_index_t)1,  
std::pair<unsigned long const, unsigned long> >, 256>  
>::insert(unsigned long, unsigned long, unsigned long*, unsigned  
long*)+0xef2) [0x560b921bd432]
2019-02-07T13:01:51.187167+01:00 host1 ceph-osd[1136505]:  7:  
(StupidAllocator::_insert_free(unsigned long, unsigned  
long)+0x126) [0x560b921b4a06]
2019-02-07T13:01:51.187294+01:00 host1 ceph-osd[1136505]:  8:  
(StupidAllocator::release(unsigned long, unsigned long)+0x7d)  
[0x560b921b4f4d]
2019-02-07T13:01:51.187418+01:00 host1 ceph-osd[1136505]:  9:  
(BlueStore::_txc_release_alloc(BlueStore::TransContext*)+0x72)  
[0x560b9207fa22]
2019-02-07T13:01:51.187539+01:00 host1 ceph-osd[1136505]:  10:  
(BlueStore::_txc_finish(BlueStore::TransContext*)+0x5d7)  
[0x560b92092d77]
2019-02-07T13:01:51.187661+01:00 host1 ceph-osd[1136505]:  11:  
(BlueStore::_txc_state_proc(BlueStore::TransContext*)+0x1f6)  
[0x560b920a3fa6]
2019-02-07T13:01:51.187781+01:00 host1 ceph-osd[1136505]:  12:  
(BlueStore::_kv_finalize_thread()+0x620) [0x560b920a58e0]
2019-02-07T13:01:51.187898+01:00 host1 ceph-osd[1136505]:  13:  
(BlueStore::KVFinalizeThread::entry()+0xd) [0x560b920fb57d]
2019-02-07T13:01:51.188017+01:00 host1 ceph-osd[1136505]:  14:  
(()+0x8744) [0x7f75d837e744]
2019-02-07T13:01:51.188138+01:00 host1 ceph-osd[1136505]:  15:  
(clone()+0x6d) [0x7f75d7405aad]
2019-02-07T13:01:51.188271+01:00 host1 ceph-osd[1136505]:  
2019-02-07 13:01:51.185833 7f75ce646700 -1 *** Caught signal  
(Aborted) **
---cut here---

Is there anything we can do about this? The issue in [1] doesn't  
seem to be resolved, yet. Debug logging is not enabled, so I  
don't have more detailed information except the full stack trace  
from the OSD log. Any help is appreciated!

Regards,
Eugen

[1]  
http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-September/029616.html

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com