On 03/26/2014 12:00 AM, Alex Elder wrote: > I think I've got it. I'll explain, and we'll have to look at > the explanation more closely in the morning. > > Bottom line is that I think the assertion is bogus. Like > the other assertion that was not done under protection of > a lock, this one is being done in a way that's not safe. > > First, here's a patch that I think might avoid the problem: > > ---------------------------------- > Index: b/drivers/block/rbd.c > =================================================================== > --- a/drivers/block/rbd.c > +++ b/drivers/block/rbd.c > @@ -2130,9 +2130,8 @@ static void rbd_img_obj_callback(struct > rbd_assert(which < img_request->obj_request_count); > > spin_lock_irq(&img_request->completion_lock); > - if (which > img_request->next_completion) > + if (which != img_request->next_completion) > goto out; > - rbd_assert(which == img_request->next_completion); > > for_each_obj_request_from(img_request, obj_request) { > rbd_assert(more); > ---------------------------------- After some sleep (though not a lot), I remain convinced the above will fix the problem, for now. But it's not quite enough. The problem arises out of the fact that the thread of control that marks an object request done is not necessarily the thread of control that processing the completion of that object (nor its associated image) request. There are no duplicate requests and no duplicate responses. Object request completions are handled sequentially, based on the range of address space involved in the requests. This is the model required by the Linux blk_end_request() interface. Let's assume we have a single image request that has two contiguous object requests, with the "lower" object request involving bytes ending at the point the "higher" one begins. These object requests can be completed by their OSDs in any order, and we will handle their completion in the order their acknowledgement is received. If we process the higher request's acknowledgement first, that will be the first to enter rbd_img_obj_callback(). Since we will not have completed the lower one yet, nothing more is done. When the lower object request completes, it will enter rbd_img_obj_callback() and find that it *is* the next one we want completed. So we'll enter a loop that calls rbd_img_obj_end_request() on this request, followed by its successor, doing completion processing on any request that has by now been marked done. In the case I've described, the higher request will be marked done, so it is here that its completion processing is performed. Once the last object request has been completed, we call rbd_img_request_complete(), which will ultimately lead to dropping a reference on the image request, which leads to rbd_img_request_destroy(). What's happening in Olivier's case is that the requests are actually completing *in order*, but nearly simultaneously. The lower and higher request acknowledgements from their OSDs arrive, and their consequent calls to rbd_img_obj_end_request() occur at the same time. The lower request wins the race to get the spinlock before checking if it is the next one to complete. So the higher request waits; but note that both of them have been marked "done" at this point. The lower request enters the completion loop, and calls rbd_imb_obj_end_request() on both requests, and updates the next_completion value before releasing the spinlock. Now the higher request acquires the lock and gets its chance to check its position against next_completion. Since completion of this higher request was now been done already while processing the lower request, next_completion has now gone past the higher request. Therefore its "which" value will not be higher than "next_completion" but neither will it be equal to it, so the old assertion would fail. So my proposed fix just recognizes that, and allows only the object request matching next_completion to enter the completion loop, without assuming any other request must have a greater "which" value. Hopefully that's a more complete and clear explanation of the problem. Now, onto why that's not sufficient. Basically, at the instant an image object request is marked done, it is eligible to be completed by a call to rbd_img_obj_end_request() by a concurrent thread processing a lower-numbered object request in the same image request. And at just after that instant (essentially, because of the protection of the completion lock), the image request itself will be completed by a call to rbd_img_request_complete(). It's likely this will drop the last reference to the image request, so rbd_img_request_destroy() will be called. And that calls rbd_img_obj_request_del() on each of its object requests. In other words, conceivably the higher object request that was stuck waiting for the spinlock in rbd_img_obj_end_request() may now be getting deleted from its image's request list. (It could be worse, it may have *just* finished getting marked done and may not have even entered rbd_img_obj_end_request() yet.) And in the process of getting deleted from its image request's object request list, an object request gets partially reinitialized, freed, and may get reused. So... in order to have a complete fix we need to do some careful work with reference counting of image requests and object requests, and avoid having any of their content get disturbed until the last reference is dropped. Right now it's not quite right. In particular, each object request needs to have a *reference* to its image request (not just a pointer to it), and it should not be dropped until the the object request gets destroyed. And because one thread can destroy an object request being actively used by another, each object request should have reference that gets held until it is no longer needed (probably at the end of rbd_osd_req_callback()). I think for the time being, the simple fix I described last night will do, and it's somewhat unlikely--though not impossible--for the problems due to concurrent destruction will arise. But we do need a fix. I'm prepared to create it, but for now I'd like to have another opinion on my treatise above. -Alex -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html