Re: hybrid polling on an nvme doesn't seem to work with iodepth > 1 on 5.10.0-rc5

Keith Busch <kbusch@xxxxxxxxxx> · Tue, 15 Dec 2020 03:23:10 +0900

On Mon, Dec 14, 2020 at 05:58:56PM +0000, Pavel Begunkov wrote:
> On 13/12/2020 18:19, Keith Busch wrote:
> > On Fri, Dec 11, 2020 at 12:38:43PM +0000, Pavel Begunkov wrote:
> >> On 11/12/2020 03:37, Keith Busch wrote:
> >>> It sounds like the statistic is using the wrong criteria. It ought to
> >>> use the average time for the next available completion for any request
> >>> rather than the average latency of a specific IO. It might work at high
> >>> depth if the hybrid poll knew the hctx's depth when calculating the
> >>> sleep time, but that information doesn't appear to be readily available.
> >>
> >> It polls (and so sleeps) from submission of a request to its completion,
> >> not from request to request. 
> > 
> > Right, but the polling thread is responsible for completing all
> > requests, not just the most recent cookie. If the sleep timer uses the
> > round trip of a single request when you have a high queue depth, there
> > are likely to be many completions in the pipeline that aren't getting
> > polled on time. This feeds back to the mean latency, pushing the sleep
> > timer further out.
> 
> It rather polls for a particular request and completes others by the way,
> and that's the problem. Completion-to-completion would make much more
> sense if we'd have a separate from waiters poll task.
> 
> Or if the semantics would be not "poll for a request", but poll a file.
> And since io_uring IMHO that actually makes more sense even for
> non-hybrid polling.

The existing block layer polling semantics doesn't poll for a specific
request. Please see the blk_mq_ops driver API for the 'poll' function.
It takes a hardware context, which does not indicate a specific request.
See also the blk_poll() function, which doesn't consider any specific
request in order to break out of the polling loop.