Re: Issue #5876 : assertion failure in rbd_img_obj_callback()

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 04/05/2014 03:09 AM, Olivier Bonvalet wrote:
> Le vendredi 04 avril 2014 à 20:57 -0500, Alex Elder a écrit :
>> On 04/04/2014 08:16 PM, Olivier Bonvalet wrote:
>>> Le mardi 25 mars 2014 à 09:39 +0100, Olivier Bonvalet a écrit :
>>>> Hi,
>>>>
>>>> what can/should I do to help fix that problem ?
>>>>
>>>> for now, RBD kernel client hang on : 
>>>>         Assertion failure in rbd_img_obj_callback() at line 2131:
>>>>            rbd_assert(which >= img_request->next_completion);
>>>>
>>>> or on :
>>>>         Assertion failure in rbd_img_obj_callback() at line 2127:
>>>>             rbd_assert(img_request != NULL);
>>>>
>>>>
>>>> I have both case at least once per week, on latest 3.13.5 kernels.
>>>>
>>>> It seems that the problem occurs only on more loaded servers (I have 4
>>>> near same servers, and crash occurs on two of them. If I move the VM,
>>>> crash follows...).
>>>>
>>>> Olivier
>>>>
>>>> --
>>>
>>> Hi,
>>>
>>> so. After some days without any problems, RBD crashed toonight :
>>
>> Unfortunately this could be a symptom of the same sort of race.
>> When a object request is removed from its image request's list
>> the request count gets decremented.  To be honest, all of these
>> assertions in rbd_img_obj_callback() are probably unsafe, at
>> least until I get the patch that does proper reference counting
>> implemented:
>>
>>         rbd_assert(img_request != NULL);
>>         rbd_assert(img_request->obj_request_count > 0);
>>         rbd_assert(which != BAD_WHICH);
>>         rbd_assert(which < img_request->obj_request_count);
>>
>> Until then I think you can avoid this by commenting out those
>> assertions.  I'm afraid there will remain a (smaller) window
>> of opportunity for a problem to occur, but I believe commenting
>> those out will help for now.
>>
>> I'm very sorry you're hitting these.  I'll see if I can get
>> a comprehensive fix this weekend.
>>
>> 					-Alex
> 
> Thanks for your help, really.
> 
> By removing those asserts, can I throw any data corruption ?

Data corruption is no more likely with the asserts removed.

They should not fail, and in general they do not, so things
are working properly.  Because of this race condition we are
seeing them fail, on rare occasions.  I understand why this
is happening though, and when it does, this test should avoid
doing any invalid processing of the request:
        if (which != img_request->next_completion)
                goto out;

					-Alex

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html




[Index of Archives]     [CEPH Users]     [Ceph Large]     [Information on CEPH]     [Linux BTRFS]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]
  Powered by Linux