Re: EC related osd crashes (luminous 12.2.4)

Adam Tygart <mozes@xxxxxxx> · Fri, 6 Apr 2018 13:54:03 -0500

I set this about 15 minutes ago, with the following:
ceph tell osd.* injectargs '--osd-recovery-max-single-start 1
--osd-recovery-max-active 1'
ceph osd unset noout
ceph osd unset norecover

I also set those settings in ceph.conf just in case the "not observed"
response was true.

Things have been stable, no segfaults at all, and recovery is
happening. Thanks for your hard work on this. I'll follow-up if
anything else crops up.

--
Adam

On Fri, Apr 6, 2018 at 11:26 AM, Josh Durgin <jdurgin@xxxxxxxxxx> wrote:
> You should be able to avoid the crash by setting:
>
> osd recovery max single start = 1
> osd recovery max active = 1
>
> With that, you can unset norecover to let recovery start again.
>
> A fix so you don't need those settings is here:
> https://github.com/ceph/ceph/pull/21273
>
> If you see any other backtraces let me know - especially the
> complete_read_op one from http://tracker.ceph.com/issues/21931
>
> Josh
>
>
> On 04/05/2018 08:25 PM, Adam Tygart wrote:
>>
>> Thank you! Setting norecover has seemed to work in terms of keeping
>> the osds up. I am glad my logs were of use to tracking this down. I am
>> looking forward to future updates.
>>
>> Let me know if you need anything else.
>>
>> --
>> Adam
>>
>> On Thu, Apr 5, 2018 at 10:13 PM, Josh Durgin <jdurgin@xxxxxxxxxx> wrote:
>>>
>>> On 04/05/2018 08:11 PM, Josh Durgin wrote:
>>>>
>>>>
>>>> On 04/05/2018 06:15 PM, Adam Tygart wrote:
>>>>>
>>>>>
>>>>> Well, the cascading crashes are getting worse. I'm routinely seeing
>>>>> 8-10 of my 518 osds crash. I cannot start 2 of them without triggering
>>>>> 14 or so of them to crash repeatedly for more than an hour.
>>>>>
>>>>> I've ran another one of them with more logging, debug osd = 20; debug
>>>>> ms = 1 (definitely more than one crash in there):
>>>>> http://people.cs.ksu.edu/~mozes/ceph-osd.422.log
>>>>>
>>>>> Anyone have any thoughts? My cluster feels like it is getting more and
>>>>> more unstable by the hour...
>>>>
>>>>
>>>>
>>>> Thanks to your logs, I think I've found the root cause. It looks like a
>>>> bug in the EC recovery code that's triggered by EC overwrites. I'm
>>>> working
>>>> on a fix.
>>>>
>>>> For now I'd suggest setting the noout and norecover flags to avoid
>>>> hitting this bug any more by avoiding recovery. Backfilling with no
>>>> client
>>>> I/O would also avoid the bug.
>>>
>>>
>>>
>>> I forgot to mention the tracker ticket for this bug is:
>>> http://tracker.ceph.com/issues/23195
>
>
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com