On 12/6/18 9:18 PM, Mike Snitzer wrote: > On Thu, Dec 06 2018 at 11:06pm -0500, > Jens Axboe <axboe@xxxxxxxxx> wrote: > >> On 12/6/18 8:54 PM, Mike Snitzer wrote: >>> On Thu, Dec 06 2018 at 9:49pm -0500, >>> Jens Axboe <axboe@xxxxxxxxx> wrote: >>> >>>> After the direct dispatch corruption fix, we permanently disallow direct >>>> dispatch of non read/write requests. This works fine off the normal IO >>>> path, as they will be retried like any other failed direct dispatch >>>> request. But for the blk_insert_cloned_request() that only DM uses to >>>> bypass the bottom level scheduler, we always first attempt direct >>>> dispatch. For some types of requests, that's now a permanent failure, >>>> and no amount of retrying will make that succeed. >>>> >>>> Use the driver private RQF_DONTPREP to track this condition in DM. If >>>> we encounter a BUSY condition from blk_insert_cloned_request(), then >>>> flag the request with RQF_DONTPREP. When we next time see this request, >>>> ask blk_insert_cloned_request() to bypass insert the request directly. >>>> This avoids the livelock of repeatedly trying to direct dispatch a >>>> request, while still retaining the BUSY feedback loop for blk-mq so >>>> that we don't over-dispatch to the lower level queue and mess up >>>> opportunities for merging on the DM queue. >>>> >>>> Fixes: ffe81d45322c ("blk-mq: fix corruption with direct issue") >>>> Reported-by: Bart Van Assche <bvanassche@xxxxxxx> >>>> Cc: stable@xxxxxxxxxxxxxxx >>>> Signed-off-by: Jens Axboe <axboe@xxxxxxxxx> >>>> >>>> --- >>>> >>>> This passes my testing as well, like the previous patch. But unlike the >>>> previous patch, we retain the BUSY feedback loop information for better >>>> merging. >>> >>> But it is kind of gross to workaround the new behaviour to "permanently >>> disallow direct dispatch of non read/write requests" by always failing >>> such requests back to DM for later immediate direct dispatch. That >>> bouncing of the request was acceptable when there was load-based >>> justification for having to retry (and in doing so: taking the cost of >>> freeing the clone request gotten via get_request() from the underlying >>> request_queues). >>> >>> Having to retry like this purely because the request isn't a read or >>> write seems costly.. every non-read-write will have implied >>> request_queue bouncing. In multipath's case: it could select an >>> entirely different underlying path the next time it is destaged (with >>> RQF_DONTPREP set). Which you'd think would negate all hope of IO >>> merging based performance improvements -- but that is a tangent I'll >>> need to ask Ming about (again). >>> >>> I really don't like this business of bouncing requests as a workaround >>> for the recent implementation of the corruption fix. >>> >>> Why not just add an override flag to _really_ allow direct dispatch for >>> _all_ types of requests? >>> >>> (just peeked at linux-block and it is looking like you took >>> jianchao.wang's series to avoid this hack... ;) >>> >>> Awesome.. my work is done for tonight! >> >> The whole point is doing something that is palatable to 4.20 and leaving >> the more experimental stuff to 4.21, where we have some weeks to verify >> that there are no conditions that cause IO stalls. I don't envision there >> will be, but I'm not willing to risk it this late in the 4.20 cycle. >> >> That said, this isn't a quick and dirty and I don't think it's fair >> calling this a hack. Using RQF_DONTPREP is quite common in drivers to >> retain state over multiple ->queue_rq invocations. Using it to avoid >> multiple direct dispatch failures (and obviously this new livelock) >> seems fine to me. > > But it bounces IO purely because non-read-write. That results in > guaranteed multiple blk_get_request() -- from underlying request_queues > request-based DM is stacked on -- for every non-read-write IO that is > cloned. That seems pathological. I must still be missing something. > >> I really don't want to go around and audit every driver for potential >> retained state over special commands, that's why the read+write thing is >> in place. It's the safe option, which is what we need right now. > > Maybe leave blk_mq_request_issue_directly() interface how it is, > non-read-write restriction and all, but export a new > __blk_mq_request_issue_directly() that _only_ > blk_insert_cloned_request() -- and future comparable users -- makes use > of? > > To me that is the best of both worlds: Fix corruption issue but don't > impose needless blk_get_request() dances for non-read-write IO issued to > dm-multipath. Alright, I hear your point. Maybe we are going to be better off with just dispatch insert. Here's a version of that on top of -git, basically reverting the previous one, and applying the two hunks from Ming, but also adding a missed one in blk_mq_try_issue_list_directly() where we don't want to be adding the current request back to the list. That should go to dispatch, too. Note note note - not tested yet. Will do so now. diff --git a/block/blk-mq.c b/block/blk-mq.c index 3262d83b9e07..89270a924ba8 100644 --- a/block/blk-mq.c +++ b/block/blk-mq.c @@ -1736,18 +1736,6 @@ static blk_status_t __blk_mq_issue_directly(struct blk_mq_hw_ctx *hctx, return ret; } -/* - * Don't allow direct dispatch of anything but regular reads/writes, - * as some of the other commands can potentially share request space - * with data we need for the IO scheduler. If we attempt a direct dispatch - * on those and fail, we can't safely add it to the scheduler afterwards - * without potentially overwriting data that the driver has already written. - */ -static bool blk_rq_can_direct_dispatch(struct request *rq) -{ - return req_op(rq) == REQ_OP_READ || req_op(rq) == REQ_OP_WRITE; -} - static blk_status_t __blk_mq_try_issue_directly(struct blk_mq_hw_ctx *hctx, struct request *rq, blk_qc_t *cookie, @@ -1769,7 +1757,7 @@ static blk_status_t __blk_mq_try_issue_directly(struct blk_mq_hw_ctx *hctx, goto insert; } - if (!blk_rq_can_direct_dispatch(rq) || (q->elevator && !bypass_insert)) + if (q->elevator && !bypass_insert) goto insert; if (!blk_mq_get_dispatch_budget(hctx)) @@ -1785,7 +1773,7 @@ static blk_status_t __blk_mq_try_issue_directly(struct blk_mq_hw_ctx *hctx, if (bypass_insert) return BLK_STS_RESOURCE; - blk_mq_sched_insert_request(rq, false, run_queue, false); + blk_mq_request_bypass_insert(rq, run_queue); return BLK_STS_OK; } @@ -1801,7 +1789,7 @@ static void blk_mq_try_issue_directly(struct blk_mq_hw_ctx *hctx, ret = __blk_mq_try_issue_directly(hctx, rq, cookie, false); if (ret == BLK_STS_RESOURCE || ret == BLK_STS_DEV_RESOURCE) - blk_mq_sched_insert_request(rq, false, true, false); + blk_mq_request_bypass_insert(rq, true); else if (ret != BLK_STS_OK) blk_mq_end_request(rq, ret); @@ -1831,15 +1819,12 @@ void blk_mq_try_issue_list_directly(struct blk_mq_hw_ctx *hctx, struct request *rq = list_first_entry(list, struct request, queuelist); - if (!blk_rq_can_direct_dispatch(rq)) - break; - list_del_init(&rq->queuelist); ret = blk_mq_request_issue_directly(rq); if (ret != BLK_STS_OK) { if (ret == BLK_STS_RESOURCE || ret == BLK_STS_DEV_RESOURCE) { - list_add(&rq->queuelist, list); + blk_mq_request_bypass_insert(rq, true); break; } blk_mq_end_request(rq, ret); -- Jens Axboe