On 10/28/21 12:46, Hao Xu wrote:
在 2021/10/28 下午2:07, Hao Xu 写道:
在 2021/10/28 上午2:15, Pavel Begunkov 写道:
On 10/27/21 15:02, Hao Xu wrote:
Tested this patchset by manually replace __io_queue_sqe() in
io_queue_sqe() by io_req_task_queue() to construct 'heavy' task works.
Then test with fio:
If submissions and completions are done by the same task it doesn't
really matter in which order they're executed because the task won't
get back to userspace execution to see CQEs until tw returns.
It may matter, it depends on the time cost of submittion
and the DMA IO time. Pick up sqpoll mode as example,
we submit 10 reqs:
t1 io_submit_sqes
-->io_req_task_queue
t2 io_task_work_run
we actually do the submittion in t2, but if the workload
is big engough, the 'irq completion TW' will be inserted
to the TW list after t2 is fully done, then those
'irq completion TW' will be delayed to the next round.
With this patchset, we can handle them first.
Furthermore, it even might be worse because the earlier you submit
the better with everything else equal.
IIRC, that's how it's with fio, right? If so, you may get better
numbers with a test that does submissions and completions in
different threads.
Because of the completion cache, I doubt if it works.
For single ctx, it seems we always update the cqring
pointer after all the TWs in the list are done.
I suddenly realized sqpoll mode does submissions and completions
in different threads, and in this situation this patchset always
first commit_cqring() after handling TWs in priority list.
So this is the right test, do I miss something?
Yep, should be it. So the scope of the feature is SQPOLL or
completion/submission with different tasks.
Also interesting to find an explanation for you numbers assuming
The reason may be what I said above, but I don't have a
strict proof now.
they're stable. 7/8 batching? How often it does it go this path?
If only one task submits requests it should already be covered
with existing batching.
the problem of the existing batch is(given there is only
one ctx):
1. we flush it after all the TWs done
2. we batch them if we have uring lock.
the new batch is:
1. don't care about uring lock
2. we can flush the completions in the priority list
in advance.(which means userland can see it earlier.)
ioengine=io_uring
sqpoll=1
thread=1
bs=4k
direct=1
rw=randread
time_based=1
runtime=600
randrepeat=0
group_reporting=1
filename=/dev/nvme0n1
2/8 set unlimited priority_task_list, 8/8 set a limitation of
1/3 * (len_prority_list + len_normal_list), data below:
depth no 8/8 include 8/8 before this patchset
1 7.05 7.82 7.10
2 8.47 8.48 8.60
4 10.42 9.99 10.42
8 13.78 13.13 13.22
16 27.41 27.92 24.33
32 49.40 46.16 53.08
64 102.53 105.68 103.36
128 196.98 202.76 205.61
256 372.99 375.61 414.88
512 747.23 763.95 791.30
1024 1472.59 1527.46 1538.72
2048 3153.49 3129.22 3329.01
4096 6387.86 5899.74 6682.54
8192 12150.25 12433.59 12774.14
16384 23085.58 24342.84 26044.71
It seems 2/8 is better, haven't tried other choices other than 1/3,
still put 8/8 here for people's further thoughts.
Hao Xu (8):
io-wq: add helper to merge two wq_lists
io_uring: add a priority tw list for irq completion work
io_uring: add helper for task work execution code
io_uring: split io_req_complete_post() and add a helper
io_uring: move up io_put_kbuf() and io_put_rw_kbuf()
io_uring: add nr_ctx to record the number of ctx in a task
io_uring: batch completion in prior_task_list
io_uring: add limited number of TWs to priority task list
fs/io-wq.h | 21 +++++++
fs/io_uring.c | 168 +++++++++++++++++++++++++++++++++++---------------
2 files changed, 138 insertions(+), 51 deletions(-)
--
Pavel Begunkov