Re: [PATCH v2] block/dm: fix handling of busy off direct dispatch path

Jens Axboe <axboe@xxxxxxxxx> · Thu, 6 Dec 2018 21:41:50 -0700

On 12/6/18 9:18 PM, Mike Snitzer wrote:
> On Thu, Dec 06 2018 at 11:06pm -0500,
> Jens Axboe <axboe@xxxxxxxxx> wrote:
> 
>> On 12/6/18 8:54 PM, Mike Snitzer wrote:
>>> On Thu, Dec 06 2018 at  9:49pm -0500,
>>> Jens Axboe <axboe@xxxxxxxxx> wrote:
>>>
>>>> After the direct dispatch corruption fix, we permanently disallow direct
>>>> dispatch of non read/write requests. This works fine off the normal IO
>>>> path, as they will be retried like any other failed direct dispatch
>>>> request. But for the blk_insert_cloned_request() that only DM uses to
>>>> bypass the bottom level scheduler, we always first attempt direct
>>>> dispatch. For some types of requests, that's now a permanent failure,
>>>> and no amount of retrying will make that succeed.
>>>>
>>>> Use the driver private RQF_DONTPREP to track this condition in DM. If
>>>> we encounter a BUSY condition from blk_insert_cloned_request(), then
>>>> flag the request with RQF_DONTPREP. When we next time see this request,
>>>> ask blk_insert_cloned_request() to bypass insert the request directly.
>>>> This avoids the livelock of repeatedly trying to direct dispatch a
>>>> request, while still retaining the BUSY feedback loop for blk-mq so
>>>> that we don't over-dispatch to the lower level queue and mess up
>>>> opportunities for merging on the DM queue.
>>>>
>>>> Fixes: ffe81d45322c ("blk-mq: fix corruption with direct issue")
>>>> Reported-by: Bart Van Assche <bvanassche@xxxxxxx>
>>>> Cc: stable@xxxxxxxxxxxxxxx
>>>> Signed-off-by: Jens Axboe <axboe@xxxxxxxxx>
>>>>
>>>> ---
>>>>
>>>> This passes my testing as well, like the previous patch. But unlike the
>>>> previous patch, we retain the BUSY feedback loop information for better
>>>> merging.
>>>
>>> But it is kind of gross to workaround the new behaviour to "permanently
>>> disallow direct dispatch of non read/write requests" by always failing
>>> such requests back to DM for later immediate direct dispatch.  That
>>> bouncing of the request was acceptable when there was load-based
>>> justification for having to retry (and in doing so: taking the cost of
>>> freeing the clone request gotten via get_request() from the underlying
>>> request_queues).
>>>
>>> Having to retry like this purely because the request isn't a read or
>>> write seems costly.. every non-read-write will have implied
>>> request_queue bouncing.  In multipath's case: it could select an
>>> entirely different underlying path the next time it is destaged (with
>>> RQF_DONTPREP set).  Which you'd think would negate all hope of IO
>>> merging based performance improvements -- but that is a tangent I'll
>>> need to ask Ming about (again).
>>>
>>> I really don't like this business of bouncing requests as a workaround
>>> for the recent implementation of the corruption fix.
>>>
>>> Why not just add an override flag to _really_ allow direct dispatch for
>>> _all_ types of requests?
>>>
>>> (just peeked at linux-block and it is looking like you took 
>>> jianchao.wang's series to avoid this hack... ;)
>>>
>>> Awesome.. my work is done for tonight!
>>
>> The whole point is doing something that is palatable to 4.20 and leaving
>> the more experimental stuff to 4.21, where we have some weeks to verify
>> that there are no conditions that cause IO stalls. I don't envision there
>> will be, but I'm not willing to risk it this late in the 4.20 cycle.
>>
>> That said, this isn't a quick and dirty and I don't think it's fair
>> calling this a hack. Using RQF_DONTPREP is quite common in drivers to
>> retain state over multiple ->queue_rq invocations. Using it to avoid
>> multiple direct dispatch failures (and obviously this new livelock)
>> seems fine to me.
> 
> But it bounces IO purely because non-read-write.  That results in
> guaranteed multiple blk_get_request() -- from underlying request_queues
> request-based DM is stacked on -- for every non-read-write IO that is
> cloned.  That seems pathological.  I must still be missing something.
> 
>> I really don't want to go around and audit every driver for potential
>> retained state over special commands, that's why the read+write thing is
>> in place. It's the safe option, which is what we need right now.
> 
> Maybe leave blk_mq_request_issue_directly() interface how it is,
> non-read-write restriction and all, but export a new
> __blk_mq_request_issue_directly() that _only_
> blk_insert_cloned_request() -- and future comparable users -- makes use
> of?
> 
> To me that is the best of both worlds: Fix corruption issue but don't
> impose needless blk_get_request() dances for non-read-write IO issued to
> dm-multipath.

Alright, I hear your point. Maybe we are going to be better off with
just dispatch insert. Here's a version of that on top of -git, basically
reverting the previous one, and applying the two hunks from Ming, but
also adding a missed one in blk_mq_try_issue_list_directly() where we
don't want to be adding the current request back to the list. That should
go to dispatch, too.

Note note note - not tested yet. Will do so now.

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 3262d83b9e07..89270a924ba8 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1736,18 +1736,6 @@ static blk_status_t __blk_mq_issue_directly(struct blk_mq_hw_ctx *hctx,
 	return ret;
 }
 
-/*
- * Don't allow direct dispatch of anything but regular reads/writes,
- * as some of the other commands can potentially share request space
- * with data we need for the IO scheduler. If we attempt a direct dispatch
- * on those and fail, we can't safely add it to the scheduler afterwards
- * without potentially overwriting data that the driver has already written.
- */
-static bool blk_rq_can_direct_dispatch(struct request *rq)
-{
-	return req_op(rq) == REQ_OP_READ || req_op(rq) == REQ_OP_WRITE;
-}
-
 static blk_status_t __blk_mq_try_issue_directly(struct blk_mq_hw_ctx *hctx,
 						struct request *rq,
 						blk_qc_t *cookie,
@@ -1769,7 +1757,7 @@ static blk_status_t __blk_mq_try_issue_directly(struct blk_mq_hw_ctx *hctx,
 		goto insert;
 	}
 
-	if (!blk_rq_can_direct_dispatch(rq) || (q->elevator && !bypass_insert))
+	if (q->elevator && !bypass_insert)
 		goto insert;
 
 	if (!blk_mq_get_dispatch_budget(hctx))
@@ -1785,7 +1773,7 @@ static blk_status_t __blk_mq_try_issue_directly(struct blk_mq_hw_ctx *hctx,
 	if (bypass_insert)
 		return BLK_STS_RESOURCE;
 
-	blk_mq_sched_insert_request(rq, false, run_queue, false);
+	blk_mq_request_bypass_insert(rq, run_queue);
 	return BLK_STS_OK;
 }
 
@@ -1801,7 +1789,7 @@ static void blk_mq_try_issue_directly(struct blk_mq_hw_ctx *hctx,
 
 	ret = __blk_mq_try_issue_directly(hctx, rq, cookie, false);
 	if (ret == BLK_STS_RESOURCE || ret == BLK_STS_DEV_RESOURCE)
-		blk_mq_sched_insert_request(rq, false, true, false);
+		blk_mq_request_bypass_insert(rq, true);
 	else if (ret != BLK_STS_OK)
 		blk_mq_end_request(rq, ret);
 
@@ -1831,15 +1819,12 @@ void blk_mq_try_issue_list_directly(struct blk_mq_hw_ctx *hctx,
 		struct request *rq = list_first_entry(list, struct request,
 				queuelist);
 
-		if (!blk_rq_can_direct_dispatch(rq))
-			break;
-
 		list_del_init(&rq->queuelist);
 		ret = blk_mq_request_issue_directly(rq);
 		if (ret != BLK_STS_OK) {
 			if (ret == BLK_STS_RESOURCE ||
 					ret == BLK_STS_DEV_RESOURCE) {
-				list_add(&rq->queuelist, list);
+				blk_mq_request_bypass_insert(rq, true);
 				break;
 			}
 			blk_mq_end_request(rq, ret);

-- 
Jens Axboe