Re: hybrid polling on an nvme doesn't seem to work with iodepth > 1 on 5.10.0-rc5

Pavel Begunkov <asml.silence@xxxxxxxxx> · Mon, 14 Dec 2020 17:58:56 +0000

On 13/12/2020 18:19, Keith Busch wrote:
> On Fri, Dec 11, 2020 at 12:38:43PM +0000, Pavel Begunkov wrote:
>> On 11/12/2020 03:37, Keith Busch wrote:
>>> It sounds like the statistic is using the wrong criteria. It ought to
>>> use the average time for the next available completion for any request
>>> rather than the average latency of a specific IO. It might work at high
>>> depth if the hybrid poll knew the hctx's depth when calculating the
>>> sleep time, but that information doesn't appear to be readily available.
>>
>> It polls (and so sleeps) from submission of a request to its completion,
>> not from request to request. 
> 
> Right, but the polling thread is responsible for completing all
> requests, not just the most recent cookie. If the sleep timer uses the
> round trip of a single request when you have a high queue depth, there
> are likely to be many completions in the pipeline that aren't getting
> polled on time. This feeds back to the mean latency, pushing the sleep
> timer further out.

It rather polls for a particular request and completes others by the way,
and that's the problem. Completion-to-completion would make much more
sense if we'd have a separate from waiters poll task.

Or if the semantics would be not "poll for a request", but poll a file.
And since io_uring IMHO that actually makes more sense even for
non-hybrid polling.

> 
>> Looks like the other scheme doesn't suit well
>> when you don't have a constant-ish flow of requests, e.g. QD=1 and with
>> different latency in the userspace.
> 
> The idea I'm trying to convey shouldn't affect QD1. The following patch
> seems to test "ok", but I know of at least a few scenarios where it
> falls apart...
> 
> ---
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index e9799fed98c7..cab2dafcd3a9 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -3727,6 +3727,7 @@ static void blk_mq_poll_stats_fn(struct blk_stat_callback *cb)
>  static unsigned long blk_mq_poll_nsecs(struct request_queue *q,
>  				       struct request *rq)
>  {
> +	struct blk_mq_hw_ctx *hctx = rq->mq_hctx;
>  	unsigned long ret = 0;
>  	int bucket;
>  
> @@ -3753,6 +3754,15 @@ static unsigned long blk_mq_poll_nsecs(struct request_queue *q,
>  	if (q->poll_stat[bucket].nr_samples)
>  		ret = (q->poll_stat[bucket].mean + 1) / 2;
>  
> +	/*
> +	 * Finding completions on the first poll indicates we're sleeping too
> +	 * long and pushing the latency statistic in the wrong direction for
> +	 * future sleep consideration. Poll immediately until the average time
> +	 * becomes more useful.
> +	 */
> +	if (hctx->poll_invoked < 3 * hctx->poll_considered)
> +		return 0;
> +
>  	return ret;
>  }
>  
> ---
> 

-- 
Pavel Begunkov