Re: hybrid polling on an nvme doesn't seem to work with iodepth > 1 on 5.10.0-rc5

Pavel Begunkov <asml.silence@xxxxxxxxx> · Fri, 11 Dec 2020 12:38:43 +0000

On 11/12/2020 03:37, Keith Busch wrote:
> On Fri, Dec 11, 2020 at 01:44:38AM +0000, Pavel Begunkov wrote:
>> On 11/12/2020 01:19, Andres Freund wrote:
>>> On 2020-12-10 23:15:15 +0000, Pavel Begunkov wrote:
>>>> On 10/12/2020 23:12, Pavel Begunkov wrote:
>>>>> On 10/12/2020 20:51, Andres Freund wrote:
>>>>>> Hi,
>>>>>>
>>>>>> When using hybrid polling (i.e echo 0 >
>>>>>> /sys/block/nvme1n1/queue/io_poll_delay) I see stalls with fio when using
>>>>>> an iodepth > 1. Sometimes fio hangs, other times the performance is
>>>>>> really poor. I reproduced this with SSDs from different vendors.
>>>>>
>>>>> Can you get poll stats from debugfs while running with hybrid?
>>>>> For both iodepth=1 and 32.
>>>>
>>>> Even better if for 32 you would show it in dynamic, i.e. cat it several
>>>> times while running it.
>>>
>>> Should read all email before responding...
>>>
>>> This is a loop of grepping for 4k writes (only type I am doing), with 1s
>>> interval. I started it before the fio run (after one with
>>> iodepth=1). Once the iodepth 32 run finished (--timeout 10, but took
>>> 42s0, I started a --iodepth 1 run.
>>
>> Thanks! Your mean grows to more than 30s, so it'll sleep for 15s for each
>> IO. Yep, the sleep time calculation is clearly broken for you.
>>
>> In general the current hybrid polling doesn't work well with high QD,
>> that's because statistics it based on are not very resilient to all sorts
>> of problems. And it might be a problem I described long ago
>>
>> https://www.spinics.net/lists/linux-block/msg61479.html
>> https://lkml.org/lkml/2019/4/30/120
> 
> It sounds like the statistic is using the wrong criteria. It ought to
> use the average time for the next available completion for any request
> rather than the average latency of a specific IO. It might work at high
> depth if the hybrid poll knew the hctx's depth when calculating the
> sleep time, but that information doesn't appear to be readily available.

It polls (and so sleeps) from submission of a request to its completion,
not from request to request. Looks like the other scheme doesn't suit well
when you don't have a constant-ish flow of requests, e.g. QD=1 and with
different latency in the userspace.

-- 
Pavel Begunkov