On Thu, Jan 18, 2018 at 05:49:10PM -0700, Jens Axboe wrote: > On 1/18/18 5:35 PM, Ming Lei wrote: > > On Thu, Jan 18, 2018 at 05:24:29PM -0700, Jens Axboe wrote: > >> On 1/18/18 5:18 PM, Ming Lei wrote: > >>> On Fri, Jan 19, 2018 at 12:14:24AM +0000, Bart Van Assche wrote: > >>>> On Fri, 2018-01-19 at 08:11 +0800, Ming Lei wrote: > >>>>> On Thu, Jan 18, 2018 at 08:37:07AM -0800, Bart Van Assche wrote: > >>>>>> diff --git a/drivers/md/dm-rq.c b/drivers/md/dm-rq.c > >>>>>> index f16096af879a..c59c59cfd2a5 100644 > >>>>>> --- a/drivers/md/dm-rq.c > >>>>>> +++ b/drivers/md/dm-rq.c > >>>>>> @@ -761,6 +761,7 @@ static blk_status_t dm_mq_queue_rq(struct blk_mq_hw_ctx *hctx, > >>>>>> /* Undo dm_start_request() before requeuing */ > >>>>>> rq_end_stats(md, rq); > >>>>>> rq_completed(md, rq_data_dir(rq), false); > >>>>>> + blk_mq_delay_run_hw_queue(hctx, 100/*ms*/); > >>>>>> return BLK_STS_RESOURCE; > >>>>>> } > >>>>>> > >>>>> > >>>>> Nak. > >>>> > >>>> This patch fixes a regression that was introduced by you. You should know > >>>> that regressions are not acceptable. If you don't agree with this patch, > >>>> please fix the root cause. > >>> > >>> Yesterday I sent a patch, did you test that? > >> > >> That patch isn't going to be of much help. It might prevent you from > >> completely stalling, but it might take you tens of seconds to get there. > > > > Yeah, that is true, and why I marked it as RFC, but the case is so rare to > > happen. > > You don't know that. If the resource is very limited, or someone else > is gobbling up all of it, then it may trigger quite often. Granted, > in that case, you need some way of signaling the freeing of those > resources to make it efficient. > > >> On top of that, it's a rolling timer that gets added when IO is queued, > >> and re-added if IO is pending when it fires. If the latter case is not > >> true, then it won't get armed again. So if IO fails to queue without > >> any other IO pending, you're pretty much in the same situation as if > > > > No IO pending detection may take a bit time, we can do it after > > BLK_STS_RESOURCE is returned and SCHED_RESTART is set. I have done > > this way in the following patch, what do you think of it? > > > > https://github.com/ming1/linux/commit/f866ce0b97b0ae22d033881da7eb07706fd458b4 > > I think it's overly complicated. As I wrote in a lengthier email earlier > today, just ensure that we have a unique return code from the driver for > when it aborts queuing an IO due to a resource shortage that isn't tied > to it's own consumption of it. When we get that return, do a delayed > queue run after X amount of time to ensure we always try again. If you > want to get fancy (later on), you could track the frequency of such > returns and complain if it's too high. Suppose the new introduced return code is BLK_STS_NO_DEV_RESOURCE. This way may degrade performance, for example, DM-MPATH returns BLK_STS_NO_DEV_RESOURCE when blk_get_request() returns NULL, blk-mq handles it by blk_mq_delay_run_hw_queue(10ms). Then blk_mq_sched_restart() is just triggered when one DM-MPATH request is completed, that means one request of underlying queue is completed too, but blk_mq_sched_restart() still can't make progress during the 10ms. That means we should only run blk_mq_delay_run_hw_queue(10ms/100ms) when the queue is idle. I will post out the patch in github for discussion. Thanks, Ming