Re: hybrid polling on an nvme doesn't seem to work with iodepth > 1 on 5.10.0-rc5

Keith Busch <kbusch@xxxxxxxxxx> · Sun, 13 Dec 2020 10:19:27 -0800

On Fri, Dec 11, 2020 at 12:38:43PM +0000, Pavel Begunkov wrote:
> On 11/12/2020 03:37, Keith Busch wrote:
> > It sounds like the statistic is using the wrong criteria. It ought to
> > use the average time for the next available completion for any request
> > rather than the average latency of a specific IO. It might work at high
> > depth if the hybrid poll knew the hctx's depth when calculating the
> > sleep time, but that information doesn't appear to be readily available.
> 
> It polls (and so sleeps) from submission of a request to its completion,
> not from request to request. 

Right, but the polling thread is responsible for completing all
requests, not just the most recent cookie. If the sleep timer uses the
round trip of a single request when you have a high queue depth, there
are likely to be many completions in the pipeline that aren't getting
polled on time. This feeds back to the mean latency, pushing the sleep
timer further out.

> Looks like the other scheme doesn't suit well
> when you don't have a constant-ish flow of requests, e.g. QD=1 and with
> different latency in the userspace.

The idea I'm trying to convey shouldn't affect QD1. The following patch
seems to test "ok", but I know of at least a few scenarios where it
falls apart...

---

diff --git a/block/blk-mq.c b/block/blk-mq.c
index e9799fed98c7..cab2dafcd3a9 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -3727,6 +3727,7 @@ static void blk_mq_poll_stats_fn(struct blk_stat_callback *cb)
 static unsigned long blk_mq_poll_nsecs(struct request_queue *q,
 				       struct request *rq)
 {
+	struct blk_mq_hw_ctx *hctx = rq->mq_hctx;
 	unsigned long ret = 0;
 	int bucket;
 
@@ -3753,6 +3754,15 @@ static unsigned long blk_mq_poll_nsecs(struct request_queue *q,
 	if (q->poll_stat[bucket].nr_samples)
 		ret = (q->poll_stat[bucket].mean + 1) / 2;
 
+	/*
+	 * Finding completions on the first poll indicates we're sleeping too
+	 * long and pushing the latency statistic in the wrong direction for
+	 * future sleep consideration. Poll immediately until the average time
+	 * becomes more useful.
+	 */
+	if (hctx->poll_invoked < 3 * hctx->poll_considered)
+		return 0;
+
 	return ret;
 }
 
---