The first patch tries to improve various issues in current implementation: The prepare_to_wait() usage in __io_sq_thread() is weird. If multiple ctxs share one same poll thread, one ctx will put poll thread in TASK_INTERRUPTIBLE, but if other ctxs have work to do, we don't need to change task's stat at all. I think only if all ctxs don't have work to do, we can do it. We use round-robin strategy to make multiple ctxs share one same poll thread, but there are various condition in __io_sq_thread(), which seems complicated and may affect round-robin strategy. The second patch adds a IORING_SETUP_SQPOLL_PERCPU flag, for those rings which have SQPOLL enabled and are willing to be bound to one same cpu, hence share one same poll thread, add a capability that these rings can share one poll thread by specifying a new IORING_SETUP_SQPOLL_PERCPU flag. FIO tool can integrate this feature easily, so we can test multiple rings to share same poll thread easily. TEST: This patch set have passed liburing test cases. I also make fio support IORING_SETUP_SQPOLL_PERCPU flag, and make some io stress tests, no errors or performance regression. See below fio job file: First in unpatched kernel, I test a fio file which only contains one job with iodepth being 128, see below: [global] ioengine=io_uring sqthread_poll=1 registerfiles=1 fixedbufs=1 hipri=1 thread=1 bs=4k direct=1 rw=randread time_based=1 runtime=120 ramp_time=0 randrepeat=0 group_reporting=1 filename=/dev/nvme0n1 sqthread_poll_cpu=15 [job0] cpus_allowed=5 iodepth=128 sqthread_poll_cpu=9 performance data: IOPS: 453k, avg lat: 282.37usec Second in unpatched kernel, I test a fio file which contains 4 jobs with each iodepth being 32, see below: [global] ioengine=io_uring sqthread_poll=1 registerfiles=1 fixedbufs=1 hipri=1 thread=1 bs=4k direct=1 rw=randread time_based=1 runtime=120 ramp_time=0 randrepeat=0 group_reporting=1 filename=/dev/nvme0n1 sqthread_poll_cpu=15 [job0] cpus_allowed=5 iodepth=32 sqthread_poll_cpu=9 [job1] cpus_allowed=6 iodepth=32 sqthread_poll_cpu=9 [job2] cpus_allowed=7 iodepth=32 sqthread_poll_cpu=9 [job3] cpus_allowed=8 iodepth=32 sqthread_poll_cpu=9 performance data: IOPS: 254k, avg lat: 503.80 usec, obvious performance drop. Finally in patched kernel, I test a fio file which contains 4 jobs with each iodepth being 32, and now we enable sqthread_poll_percpu flag, see blow: [global] ioengine=io_uring sqthread_poll=1 registerfiles=1 fixedbufs=1 hipri=1 thread=1 bs=4k direct=1 rw=randread time_based=1 runtime=120 ramp_time=0 randrepeat=0 group_reporting=1 filename=/dev/nvme0n1 #sqthread_poll_cpu=15 sqthread_poll_percpu=1 # enable percpu feature [job0] cpus_allowed=5 iodepth=32 sqthread_poll_cpu=9 [job1] cpus_allowed=6 iodepth=32 sqthread_poll_cpu=9 [job2] cpus_allowed=7 iodepth=32 sqthread_poll_cpu=9 performance data: IOPS: 438k, avg lat: 291.69usec >From above teses, we can see that IORING_SETUP_SQPOLL_PERCPU is easy to use, and no obvious performance regression. Note I don't test IORING_SETUP_ATTACH_WQ in above three test cases, it's a little hard to support IORING_SETUP_ATTACH_WQ in fio. Xiaoguang Wang (2): io_uring: refactor io_sq_thread() handling io_uring: support multiple rings to share same poll thread by specifying same cpu fs/io_uring.c | 289 +++++++++++++++++++--------------- include/uapi/linux/io_uring.h | 1 + 2 files changed, 166 insertions(+), 124 deletions(-) -- 2.17.2