Re: hybrid polling on an nvme doesn't seem to work with iodepth > 1 on 5.10.0-rc5

Pavel Begunkov <asml.silence@xxxxxxxxx> · Mon, 14 Dec 2020 19:01:31 +0000

On 14/12/2020 18:23, Keith Busch wrote:
> On Mon, Dec 14, 2020 at 05:58:56PM +0000, Pavel Begunkov wrote:
>> On 13/12/2020 18:19, Keith Busch wrote:
>>> On Fri, Dec 11, 2020 at 12:38:43PM +0000, Pavel Begunkov wrote:
>>>> On 11/12/2020 03:37, Keith Busch wrote:
>>>>> It sounds like the statistic is using the wrong criteria. It ought to
>>>>> use the average time for the next available completion for any request
>>>>> rather than the average latency of a specific IO. It might work at high
>>>>> depth if the hybrid poll knew the hctx's depth when calculating the
>>>>> sleep time, but that information doesn't appear to be readily available.
>>>>
>>>> It polls (and so sleeps) from submission of a request to its completion,
>>>> not from request to request. 
>>>
>>> Right, but the polling thread is responsible for completing all
>>> requests, not just the most recent cookie. If the sleep timer uses the
>>> round trip of a single request when you have a high queue depth, there
>>> are likely to be many completions in the pipeline that aren't getting
>>> polled on time. This feeds back to the mean latency, pushing the sleep
>>> timer further out.
>>
>> It rather polls for a particular request and completes others by the way,
>> and that's the problem. Completion-to-completion would make much more
>> sense if we'd have a separate from waiters poll task.
>>
>> Or if the semantics would be not "poll for a request", but poll a file.
>> And since io_uring IMHO that actually makes more sense even for
>> non-hybrid polling.
> 
> The existing block layer polling semantics doesn't poll for a specific
> request. Please see the blk_mq_ops driver API for the 'poll' function.
> It takes a hardware context, which does not indicate a specific request.
> See also the blk_poll() function, which doesn't consider any specific
> request in order to break out of the polling loop.

Yeah, thanks for pointing out, it's just the users do it that way --
block layer dio and somewhat true for io_uring, and also hybrid part is
per request based (and sleeps once per request), that stands out.
If would go with coml-to-compl it should be changed. And not to forget
that subm-to-compl sometimes is more desirable.

-- 
Pavel Begunkov