On 9/29/21 12:38 PM, Hao Xu wrote: > 在 2021/9/28 下午7:29, Pavel Begunkov 写道: [...] >>> @@ -2132,12 +2136,16 @@ static void tctx_task_work(struct callback_head *cb) >>> while (1) { >>> struct io_wq_work_node *node; >>> - if (!tctx->task_list.first && locked) >>> + if (!tctx->prior_task_list.first && >>> + !tctx->task_list.first && locked) >>> io_submit_flush_completions(ctx); >>> spin_lock_irq(&tctx->task_lock); >>> - node = tctx->task_list.first; >>> + wq_list_merge(&tctx->prior_task_list, &tctx->task_list); >>> + node = tctx->prior_task_list.first; >> >> I find all this accounting expensive, sure I'll see it for my BPF tests. > May I ask how do you evaluate the overhead with BPF here? It's a custom branch and apparently would need some thinking on how to apply your stuff on top, because of yet another list in [1]. In short, the case in mind spins inside of tctx_task_work() doing one request at a time. I think would be easier if I try it out myself. [1] https://github.com/isilence/linux/commit/d6285a9817eb26aa52ad54a79584512d7efa82fd >> >> How about >> 1) remove MAX_EMERGENCY_TW_RATIO and all the counters, >> prior_nr and others. >> >> 2) rely solely on list merging >> >> So, when it enters an iteration of the loop it finds a set of requests >> to run, it first executes all priority ones of that set and then the >> rest (just by the fact that you merged the lists and execute all from >> them). >> >> It solves the problem of total starvation of non-prio requests, e.g. >> if new completions coming as fast as you complete previous ones. One >> downside is that prio requests coming while we execute a previous >> batch will be executed only after a previous batch of non-prio >> requests, I don't think it's much of a problem but interesting to >> see numbers. > hmm, this probably doesn't solve the starvation, since there may be > a number of priority TWs ahead of non-prio TWs in one iteration, in the > case of submitting many sqes in one io_submit_sqes. That's why I keep > just 1/3 priority TWs there. I don't think it's a problem, they should be fast enough and we have a forward progress guarantees for non-prio. IMHO that should be enough. >> >> >>> INIT_WQ_LIST(&tctx->task_list); >>> + INIT_WQ_LIST(&tctx->prior_task_list); >>> + tctx->nr = tctx->prior_nr = 0; >>> if (!node) >>> tctx->task_running = false; >>> spin_unlock_irq(&tctx->task_lock); >>> @@ -2166,7 +2174,7 @@ static void tctx_task_work(struct callback_head *cb) >>> ctx_flush_and_put(ctx, &locked); >>> } >>> -static void io_req_task_work_add(struct io_kiocb *req) >>> +static void io_req_task_work_add(struct io_kiocb *req, bool emergency) >> >> It think "priority" instead of "emergency" will be more accurate >> >>> { >>> struct task_struct *tsk = req->task; >>> struct io_uring_task *tctx = tsk->io_uring; >>> @@ -2178,7 +2186,13 @@ static void io_req_task_work_add(struct io_kiocb *req) >>> WARN_ON_ONCE(!tctx); >>> spin_lock_irqsave(&tctx->task_lock, flags); >>> - wq_list_add_tail(&req->io_task_work.node, &tctx->task_list); >>> + if (emergency && tctx->prior_nr * MAX_EMERGENCY_TW_RATIO < tctx->nr) { >>> + wq_list_add_tail(&req->io_task_work.node, &tctx->prior_task_list); >>> + tctx->prior_nr++; >>> + } else { >>> + wq_list_add_tail(&req->io_task_work.node, &tctx->task_list); >>> + } >>> + tctx->nr++; >>> running = tctx->task_running; >>> if (!running) >>> tctx->task_running = true; >> >> >> > -- Pavel Begunkov