Re: [PATCH 3/8] io_uring: add a limited tw list for irq completion work

Pavel Begunkov <asml.silence@xxxxxxxxx> · Thu, 30 Sep 2021 10:02:07 +0100

On 9/29/21 12:38 PM, Hao Xu wrote:
> 在 2021/9/28 下午7:29, Pavel Begunkov 写道:
[...]
>>>   @@ -2132,12 +2136,16 @@ static void tctx_task_work(struct callback_head *cb)
>>>       while (1) {
>>>           struct io_wq_work_node *node;
>>>   -        if (!tctx->task_list.first && locked)
>>> +        if (!tctx->prior_task_list.first &&
>>> +            !tctx->task_list.first && locked)
>>>               io_submit_flush_completions(ctx);
>>>             spin_lock_irq(&tctx->task_lock);
>>> -        node = tctx->task_list.first;
>>> +        wq_list_merge(&tctx->prior_task_list, &tctx->task_list);
>>> +        node = tctx->prior_task_list.first;
>>
>> I find all this accounting expensive, sure I'll see it for my BPF tests.
> May I ask how do you evaluate the overhead with BPF here?

It's a custom branch and apparently would need some thinking on how
to apply your stuff on top, because of yet another list in [1]. In
short, the case in mind spins inside of tctx_task_work() doing one
request at a time.
I think would be easier if I try it out myself.

[1] https://github.com/isilence/linux/commit/d6285a9817eb26aa52ad54a79584512d7efa82fd

>>
>> How about
>> 1) remove MAX_EMERGENCY_TW_RATIO and all the counters,
>> prior_nr and others.
>>
>> 2) rely solely on list merging
>>
>> So, when it enters an iteration of the loop it finds a set of requests
>> to run, it first executes all priority ones of that set and then the
>> rest (just by the fact that you merged the lists and execute all from
>> them).
>>
>> It solves the problem of total starvation of non-prio requests, e.g.
>> if new completions coming as fast as you complete previous ones. One
>> downside is that prio requests coming while we execute a previous
>> batch will be executed only after a previous batch of non-prio
>> requests, I don't think it's much of a problem but interesting to
>> see numbers.
> hmm, this probably doesn't solve the starvation, since there may be
> a number of priority TWs ahead of non-prio TWs in one iteration, in the
> case of submitting many sqes in one io_submit_sqes. That's why I keep
> just 1/3 priority TWs there.

I don't think it's a problem, they should be fast enough and we have
a forward progress guarantees for non-prio. IMHO that should be enough.

>>
>>
>>>           INIT_WQ_LIST(&tctx->task_list);
>>> +        INIT_WQ_LIST(&tctx->prior_task_list);
>>> +        tctx->nr = tctx->prior_nr = 0;
>>>           if (!node)
>>>               tctx->task_running = false;
>>>           spin_unlock_irq(&tctx->task_lock);
>>> @@ -2166,7 +2174,7 @@ static void tctx_task_work(struct callback_head *cb)
>>>       ctx_flush_and_put(ctx, &locked);
>>>   }
>>>   -static void io_req_task_work_add(struct io_kiocb *req)
>>> +static void io_req_task_work_add(struct io_kiocb *req, bool emergency)
>>
>> It think "priority" instead of "emergency" will be more accurate
>>
>>>   {
>>>       struct task_struct *tsk = req->task;
>>>       struct io_uring_task *tctx = tsk->io_uring;
>>> @@ -2178,7 +2186,13 @@ static void io_req_task_work_add(struct io_kiocb *req)
>>>       WARN_ON_ONCE(!tctx);
>>>         spin_lock_irqsave(&tctx->task_lock, flags);
>>> -    wq_list_add_tail(&req->io_task_work.node, &tctx->task_list);
>>> +    if (emergency && tctx->prior_nr * MAX_EMERGENCY_TW_RATIO < tctx->nr) {
>>> +        wq_list_add_tail(&req->io_task_work.node, &tctx->prior_task_list);
>>> +        tctx->prior_nr++;
>>> +    } else {
>>> +        wq_list_add_tail(&req->io_task_work.node, &tctx->task_list);
>>> +    }
>>> +    tctx->nr++;
>>>       running = tctx->task_running;
>>>       if (!running)
>>>           tctx->task_running = true;
>>
>>
>>
> 

-- 
Pavel Begunkov