Re: Issue #5876 : assertion failure in rbd_img_obj_callback()

Ilya Dryomov <ilya.dryomov@xxxxxxxxxxx> · Wed, 26 Mar 2014 13:43:46 +0200

On Wed, Mar 26, 2014 at 1:13 PM, Alex Elder <elder@xxxxxxxx> wrote:
> On 03/26/2014 12:00 AM, Alex Elder wrote:
>> I think I've got it.  I'll explain, and we'll have to look at
>> the explanation more closely in the morning.
>>
>> Bottom line is that I think the assertion is bogus.  Like
>> the other assertion that was not done under protection of
>> a lock, this one is being done in a way that's not safe.
>>
>> First, here's a patch that I think might avoid the problem:
>>
>> ----------------------------------
>> Index: b/drivers/block/rbd.c
>> ===================================================================
>> --- a/drivers/block/rbd.c
>> +++ b/drivers/block/rbd.c
>> @@ -2130,9 +2130,8 @@ static void rbd_img_obj_callback(struct
>>       rbd_assert(which < img_request->obj_request_count);
>>
>>       spin_lock_irq(&img_request->completion_lock);
>> -     if (which > img_request->next_completion)
>> +     if (which != img_request->next_completion)
>>               goto out;
>> -     rbd_assert(which == img_request->next_completion);
>>
>>       for_each_obj_request_from(img_request, obj_request) {
>>               rbd_assert(more);
>> ----------------------------------
>
>
> After some sleep (though not a lot), I remain convinced
> the above will fix the problem, for now.  But it's not
> quite enough.
>
> The problem arises out of the fact that the thread of
> control that marks an object request done is not necessarily
> the thread of control that processing the completion of
> that object (nor its associated image) request.  There are
> no duplicate requests and no duplicate responses.
>
>
> Object request completions are handled sequentially, based
> on the range of address space involved in the requests.
> This is the model required by the Linux blk_end_request()
> interface.  Let's assume we have a single image request
> that has two contiguous object requests, with the "lower"
> object request involving bytes ending at the point the
> "higher" one begins.
>
> These object requests can be completed by their OSDs in
> any order, and we will handle their completion in the
> order their acknowledgement is received.  If we process
> the higher request's acknowledgement first, that will
> be the first to enter rbd_img_obj_callback().  Since
> we will not have completed the lower one yet, nothing
> more is done.
>
> When the lower object request completes, it will enter
> rbd_img_obj_callback() and find that it *is* the
> next one we want completed.  So we'll enter a loop
> that calls rbd_img_obj_end_request() on this request,
> followed by its successor, doing completion processing
> on any request that has by now been marked done.  In
> the case I've described, the higher request will be
> marked done, so it is here that its completion processing
> is performed.
>
> Once the last object request has been completed, we
> call rbd_img_request_complete(), which will ultimately
> lead to dropping a reference on the image request,
> which leads to rbd_img_request_destroy().
>
>
> What's happening in Olivier's case is that the
> requests are actually completing *in order*, but
> nearly simultaneously.  The lower and higher request
> acknowledgements from their OSDs arrive, and their
> consequent calls to rbd_img_obj_end_request() occur
> at the same time.  The lower request wins the race
> to get the spinlock before checking if it is the
> next one to complete.  So the higher request waits;
> but note that both of them have been marked "done"
> at this point.
>
> The lower request enters the completion loop, and
> calls rbd_imb_obj_end_request() on both requests,
> and updates the next_completion value before releasing
> the spinlock.  Now the higher request acquires the
> lock and gets its chance to check its position against
> next_completion.  Since completion of this higher
> request was now been done already while processing
> the lower request, next_completion has now gone
> past the higher request.  Therefore its "which"
> value will not be higher than "next_completion"
> but neither will it be equal to it, so the old
> assertion would fail.
>
> So my proposed fix just recognizes that, and allows
> only the object request matching next_completion
> to enter the completion loop, without assuming any
> other request must have a greater "which" value.
>
> Hopefully that's a more complete and clear
> explanation of the problem.
>
>
> Now, onto why that's not sufficient.
>
> Basically, at the instant an image object request is
> marked done, it is eligible to be completed by a
> call to rbd_img_obj_end_request() by a concurrent
> thread processing a lower-numbered object request in
> the same image request.  And at just after that
> instant (essentially, because of the protection
> of the completion lock), the image request itself will
> be completed by a call to rbd_img_request_complete().
> It's likely this will drop the last reference to the
> image request, so rbd_img_request_destroy() will be
> called.  And that calls rbd_img_obj_request_del() on
> each of its object requests.
>
> In other words, conceivably the higher object
> request that was stuck waiting for the spinlock
> in rbd_img_obj_end_request() may now be getting
> deleted from its image's request list.  (It
> could be worse, it may have *just* finished
> getting marked done and may not have even entered
> rbd_img_obj_end_request() yet.)  And in the
> process of getting deleted from its image request's
> object request list, an object request gets
> partially reinitialized, freed, and may get reused.
>
>
> So...  in order to have a complete fix we need to
> do some careful work with reference counting of
> image requests and object requests, and avoid
> having any of their content get disturbed until
> the last reference is dropped.
>
> Right now it's not quite right.  In particular, each
> object request needs to have a *reference* to its
> image request (not just a pointer to it), and it
> should not be dropped until the the object request
> gets destroyed.  And because one thread can destroy
> an object request being actively used by another,
> each object request should have reference that gets
> held until it is no longer needed (probably at the
> end of rbd_osd_req_callback()).
>
>
> I think for the time being, the simple fix I
> described last night will do, and it's somewhat
> unlikely--though not impossible--for the problems
> due to concurrent destruction will arise.  But
> we do need a fix.  I'm prepared to create it, but
> for now I'd like to have another opinion on my
> treatise above.

This all makes sense, I think reference counting is the right thing to
do.  We definitely do need a real fix, with the simple fix we still
have a potential use-after-free on img_request itself, not to mention
obj_requests.

It looks like img_request kref currenlty exists for posterity only.
Unless I'm missing something, its counter is set to 1 in
rbd_img_request_create() and is not incremented anywhere else, which
means that the instant rbd_img_request_put() is called, img_request is
freed.  I naively assumed it was incremented and decremented in
rbd_img_obj_request_add() and rbd_img_obj_request_del() respectively..
Maybe that's something we should look at first?

Thanks,

                Ilya
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html