On Wed, Mar 26, 2014 at 1:13 PM, Alex Elder <elder@xxxxxxxx> wrote: > On 03/26/2014 12:00 AM, Alex Elder wrote: >> I think I've got it. I'll explain, and we'll have to look at >> the explanation more closely in the morning. >> >> Bottom line is that I think the assertion is bogus. Like >> the other assertion that was not done under protection of >> a lock, this one is being done in a way that's not safe. >> >> First, here's a patch that I think might avoid the problem: >> >> ---------------------------------- >> Index: b/drivers/block/rbd.c >> =================================================================== >> --- a/drivers/block/rbd.c >> +++ b/drivers/block/rbd.c >> @@ -2130,9 +2130,8 @@ static void rbd_img_obj_callback(struct >> rbd_assert(which < img_request->obj_request_count); >> >> spin_lock_irq(&img_request->completion_lock); >> - if (which > img_request->next_completion) >> + if (which != img_request->next_completion) >> goto out; >> - rbd_assert(which == img_request->next_completion); >> >> for_each_obj_request_from(img_request, obj_request) { >> rbd_assert(more); >> ---------------------------------- > > > After some sleep (though not a lot), I remain convinced > the above will fix the problem, for now. But it's not > quite enough. > > The problem arises out of the fact that the thread of > control that marks an object request done is not necessarily > the thread of control that processing the completion of > that object (nor its associated image) request. There are > no duplicate requests and no duplicate responses. > > > Object request completions are handled sequentially, based > on the range of address space involved in the requests. > This is the model required by the Linux blk_end_request() > interface. Let's assume we have a single image request > that has two contiguous object requests, with the "lower" > object request involving bytes ending at the point the > "higher" one begins. > > These object requests can be completed by their OSDs in > any order, and we will handle their completion in the > order their acknowledgement is received. If we process > the higher request's acknowledgement first, that will > be the first to enter rbd_img_obj_callback(). Since > we will not have completed the lower one yet, nothing > more is done. > > When the lower object request completes, it will enter > rbd_img_obj_callback() and find that it *is* the > next one we want completed. So we'll enter a loop > that calls rbd_img_obj_end_request() on this request, > followed by its successor, doing completion processing > on any request that has by now been marked done. In > the case I've described, the higher request will be > marked done, so it is here that its completion processing > is performed. > > Once the last object request has been completed, we > call rbd_img_request_complete(), which will ultimately > lead to dropping a reference on the image request, > which leads to rbd_img_request_destroy(). > > > What's happening in Olivier's case is that the > requests are actually completing *in order*, but > nearly simultaneously. The lower and higher request > acknowledgements from their OSDs arrive, and their > consequent calls to rbd_img_obj_end_request() occur > at the same time. The lower request wins the race > to get the spinlock before checking if it is the > next one to complete. So the higher request waits; > but note that both of them have been marked "done" > at this point. > > The lower request enters the completion loop, and > calls rbd_imb_obj_end_request() on both requests, > and updates the next_completion value before releasing > the spinlock. Now the higher request acquires the > lock and gets its chance to check its position against > next_completion. Since completion of this higher > request was now been done already while processing > the lower request, next_completion has now gone > past the higher request. Therefore its "which" > value will not be higher than "next_completion" > but neither will it be equal to it, so the old > assertion would fail. > > So my proposed fix just recognizes that, and allows > only the object request matching next_completion > to enter the completion loop, without assuming any > other request must have a greater "which" value. > > Hopefully that's a more complete and clear > explanation of the problem. > > > Now, onto why that's not sufficient. > > Basically, at the instant an image object request is > marked done, it is eligible to be completed by a > call to rbd_img_obj_end_request() by a concurrent > thread processing a lower-numbered object request in > the same image request. And at just after that > instant (essentially, because of the protection > of the completion lock), the image request itself will > be completed by a call to rbd_img_request_complete(). > It's likely this will drop the last reference to the > image request, so rbd_img_request_destroy() will be > called. And that calls rbd_img_obj_request_del() on > each of its object requests. > > In other words, conceivably the higher object > request that was stuck waiting for the spinlock > in rbd_img_obj_end_request() may now be getting > deleted from its image's request list. (It > could be worse, it may have *just* finished > getting marked done and may not have even entered > rbd_img_obj_end_request() yet.) And in the > process of getting deleted from its image request's > object request list, an object request gets > partially reinitialized, freed, and may get reused. > > > So... in order to have a complete fix we need to > do some careful work with reference counting of > image requests and object requests, and avoid > having any of their content get disturbed until > the last reference is dropped. > > Right now it's not quite right. In particular, each > object request needs to have a *reference* to its > image request (not just a pointer to it), and it > should not be dropped until the the object request > gets destroyed. And because one thread can destroy > an object request being actively used by another, > each object request should have reference that gets > held until it is no longer needed (probably at the > end of rbd_osd_req_callback()). > > > I think for the time being, the simple fix I > described last night will do, and it's somewhat > unlikely--though not impossible--for the problems > due to concurrent destruction will arise. But > we do need a fix. I'm prepared to create it, but > for now I'd like to have another opinion on my > treatise above. This all makes sense, I think reference counting is the right thing to do. We definitely do need a real fix, with the simple fix we still have a potential use-after-free on img_request itself, not to mention obj_requests. It looks like img_request kref currenlty exists for posterity only. Unless I'm missing something, its counter is set to 1 in rbd_img_request_create() and is not incremented anywhere else, which means that the instant rbd_img_request_put() is called, img_request is freed. I naively assumed it was incremented and decremented in rbd_img_obj_request_add() and rbd_img_obj_request_del() respectively.. Maybe that's something we should look at first? Thanks, Ilya -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html