Re: [PATCH 2/2] io_uring: add irq completion work to the head of task_list

Pavel Begunkov <asml.silence@xxxxxxxxx> · Wed, 25 Aug 2021 12:18:01 +0100

On 8/25/21 4:19 AM, Hao Xu wrote:
> 在 2021/8/24 下午8:57, Pavel Begunkov 写道:
>> On 8/23/21 7:36 PM, Hao Xu wrote:
>>> Now we have a lot of task_work users, some are just to complete a req
>>> and generate a cqe. Let's put the work at the head position of the
>>> task_list, so that it can be handled quickly and thus to reduce
>>> avg req latency. an explanatory case:
>>>
>>> origin timeline:
>>>      submit_sqe-->irq-->add completion task_work
>>>      -->run heavy work0~n-->run completion task_work
>>> now timeline:
>>>      submit_sqe-->irq-->add completion task_work
>>>      -->run completion task_work-->run heavy work0~n
>>
>> Might be good. There are not so many hot tw users:
>> poll, queuing linked requests, and the new IRQ. Could be
>> BPF in the future.
> async buffered reads as well, regarding buffered reads is
> hot operation.

Good case as well, forgot about it. Should be not so hot,
as it's only when reads are served out of the buffer cache.

>> So, for the test case I'd think about some heavy-ish
>> submissions linked to your IRQ req. For instance,
>> keeping a large QD of
>>
>> read(IRQ-based) -> linked read_pipe(PAGE_SIZE);
>>
>> and running it for a while, so they get completely
>> out of sync and tw works really mix up. It reads
>> from pipes size<=PAGE_SIZE, so it completes inline,
>> but the copy takes enough of time.
> Thanks Pavel, previously I tried
> direct read-->buffered read(async buffered read)
> didn't see much difference. I'll try the above case
> you offered.

Hmm, considering that pipes have to be refilled, buffered reads
may be a better option. I'd make them all to read the same page,
+ registered buffer + reg file. And then it'd probably depend on
how fast your main SSD is.

mem = malloc_align(4096);
io_uring_register_buffer(mem, 4096);
// preferably another disk/SSD from the fast one
fd2 = open("./file");
// loop
read(fast_ssd, DIRECT, 512) -> read(fd2, fixed_buf, 4096)

Interesting what it'll yield. Probably with buffered reads
it can be experimented to have 2 * PAGE_SIZE or even slightly
more, to increase the heavy part.

btw, I'd look for latency distribution (90%, 99%) as well, it
may get the worst hit. 

>>
>> One thing is that Jens specifically wanted tw's to
>> be in FIFO order, where IRQ based will be in LIFO.
>> I don't think it's a real problem though, the
>> completion handler should be brief enough.In my latest code, the IRQ based tw are also FIFO,
> only LIFO between IRQ based tw and other tw:
> timeline: tw1 tw2 irq1 irq2
> task_list: irq1 irq2 tw1 tw2
>>
-- 
Pavel Begunkov