Re: Issue #5876 : assertion failure in rbd_img_obj_callback()

Alex Elder <elder@xxxxxxxx> · Wed, 26 Mar 2014 06:13:07 -0500

On 03/26/2014 12:00 AM, Alex Elder wrote:
> I think I've got it.  I'll explain, and we'll have to look at
> the explanation more closely in the morning.
> 
> Bottom line is that I think the assertion is bogus.  Like
> the other assertion that was not done under protection of
> a lock, this one is being done in a way that's not safe.
> 
> First, here's a patch that I think might avoid the problem:
> 
> ----------------------------------
> Index: b/drivers/block/rbd.c
> ===================================================================
> --- a/drivers/block/rbd.c
> +++ b/drivers/block/rbd.c
> @@ -2130,9 +2130,8 @@ static void rbd_img_obj_callback(struct
>  	rbd_assert(which < img_request->obj_request_count);
> 
>  	spin_lock_irq(&img_request->completion_lock);
> -	if (which > img_request->next_completion)
> +	if (which != img_request->next_completion)
>  		goto out;
> -	rbd_assert(which == img_request->next_completion);
> 
>  	for_each_obj_request_from(img_request, obj_request) {
>  		rbd_assert(more);
> ----------------------------------

After some sleep (though not a lot), I remain convinced
the above will fix the problem, for now.  But it's not
quite enough.

The problem arises out of the fact that the thread of
control that marks an object request done is not necessarily
the thread of control that processing the completion of
that object (nor its associated image) request.  There are
no duplicate requests and no duplicate responses.

Object request completions are handled sequentially, based
on the range of address space involved in the requests.
This is the model required by the Linux blk_end_request()
interface.  Let's assume we have a single image request
that has two contiguous object requests, with the "lower"
object request involving bytes ending at the point the
"higher" one begins.

These object requests can be completed by their OSDs in
any order, and we will handle their completion in the
order their acknowledgement is received.  If we process
the higher request's acknowledgement first, that will
be the first to enter rbd_img_obj_callback().  Since
we will not have completed the lower one yet, nothing
more is done.

When the lower object request completes, it will enter
rbd_img_obj_callback() and find that it *is* the
next one we want completed.  So we'll enter a loop
that calls rbd_img_obj_end_request() on this request,
followed by its successor, doing completion processing
on any request that has by now been marked done.  In
the case I've described, the higher request will be
marked done, so it is here that its completion processing
is performed.

Once the last object request has been completed, we
call rbd_img_request_complete(), which will ultimately
lead to dropping a reference on the image request,
which leads to rbd_img_request_destroy().

What's happening in Olivier's case is that the
requests are actually completing *in order*, but
nearly simultaneously.  The lower and higher request
acknowledgements from their OSDs arrive, and their
consequent calls to rbd_img_obj_end_request() occur
at the same time.  The lower request wins the race
to get the spinlock before checking if it is the
next one to complete.  So the higher request waits;
but note that both of them have been marked "done"
at this point.

The lower request enters the completion loop, and
calls rbd_imb_obj_end_request() on both requests,
and updates the next_completion value before releasing
the spinlock.  Now the higher request acquires the
lock and gets its chance to check its position against
next_completion.  Since completion of this higher
request was now been done already while processing
the lower request, next_completion has now gone
past the higher request.  Therefore its "which"
value will not be higher than "next_completion"
but neither will it be equal to it, so the old
assertion would fail.

So my proposed fix just recognizes that, and allows
only the object request matching next_completion
to enter the completion loop, without assuming any
other request must have a greater "which" value.

Hopefully that's a more complete and clear
explanation of the problem.

Now, onto why that's not sufficient.

Basically, at the instant an image object request is
marked done, it is eligible to be completed by a
call to rbd_img_obj_end_request() by a concurrent
thread processing a lower-numbered object request in
the same image request.  And at just after that
instant (essentially, because of the protection
of the completion lock), the image request itself will
be completed by a call to rbd_img_request_complete().
It's likely this will drop the last reference to the
image request, so rbd_img_request_destroy() will be
called.  And that calls rbd_img_obj_request_del() on
each of its object requests.

In other words, conceivably the higher object
request that was stuck waiting for the spinlock
in rbd_img_obj_end_request() may now be getting
deleted from its image's request list.  (It
could be worse, it may have *just* finished
getting marked done and may not have even entered
rbd_img_obj_end_request() yet.)  And in the
process of getting deleted from its image request's
object request list, an object request gets
partially reinitialized, freed, and may get reused.

So...  in order to have a complete fix we need to
do some careful work with reference counting of
image requests and object requests, and avoid
having any of their content get disturbed until
the last reference is dropped.

Right now it's not quite right.  In particular, each
object request needs to have a *reference* to its
image request (not just a pointer to it), and it
should not be dropped until the the object request
gets destroyed.  And because one thread can destroy
an object request being actively used by another,
each object request should have reference that gets
held until it is no longer needed (probably at the
end of rbd_osd_req_callback()).

I think for the time being, the simple fix I
described last night will do, and it's somewhat
unlikely--though not impossible--for the problems
due to concurrent destruction will arise.  But
we do need a fix.  I'm prepared to create it, but
for now I'd like to have another opinion on my
treatise above.

					-Alex
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html