Re: EC related osd crashes (luminous 12.2.4)

Adam Tygart <mozes@xxxxxxx> · Thu, 5 Apr 2018 22:25:57 -0500

Thank you! Setting norecover has seemed to work in terms of keeping
the osds up. I am glad my logs were of use to tracking this down. I am
looking forward to future updates.

Let me know if you need anything else.

--
Adam

On Thu, Apr 5, 2018 at 10:13 PM, Josh Durgin <jdurgin@xxxxxxxxxx> wrote:
> On 04/05/2018 08:11 PM, Josh Durgin wrote:
>>
>> On 04/05/2018 06:15 PM, Adam Tygart wrote:
>>>
>>> Well, the cascading crashes are getting worse. I'm routinely seeing
>>> 8-10 of my 518 osds crash. I cannot start 2 of them without triggering
>>> 14 or so of them to crash repeatedly for more than an hour.
>>>
>>> I've ran another one of them with more logging, debug osd = 20; debug
>>> ms = 1 (definitely more than one crash in there):
>>> http://people.cs.ksu.edu/~mozes/ceph-osd.422.log
>>>
>>> Anyone have any thoughts? My cluster feels like it is getting more and
>>> more unstable by the hour...
>>
>>
>> Thanks to your logs, I think I've found the root cause. It looks like a
>> bug in the EC recovery code that's triggered by EC overwrites. I'm working
>> on a fix.
>>
>> For now I'd suggest setting the noout and norecover flags to avoid
>> hitting this bug any more by avoiding recovery. Backfilling with no client
>> I/O would also avoid the bug.
>
>
> I forgot to mention the tracker ticket for this bug is:
> http://tracker.ceph.com/issues/23195
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com