Re: [PATCH RFC for-next 0/8] io_uring: tw contention improvments

Hao Xu <hao.xu@xxxxxxxxx> · Wed, 22 Jun 2022 20:28:26 +0800

On 6/22/22 19:51, Dylan Yudaken wrote:
On Wed, 2022-06-22 at 19:24 +0800, Hao Xu wrote:
On 6/22/22 19:16, Hao Xu wrote:
On 6/22/22 17:31, Dylan Yudaken wrote:
On Tue, 2022-06-21 at 15:34 +0800, Hao Xu wrote:
On 6/21/22 15:03, Dylan Yudaken wrote:
On Tue, 2022-06-21 at 13:10 +0800, Hao Xu wrote:
On 6/21/22 00:18, Dylan Yudaken wrote:
Task work currently uses a spin lock to guard task_list
and
task_running. Some use cases such as networking can
trigger
task_work_add
from multiple threads all at once, which suffers from
contention
here.

This can be changed to use a lockless list which seems to
have
better
performance. Running the micro benchmark in [1] I see 20%
improvment in
multithreaded task work add. It required removing the
priority
tw
list
optimisation, however it isn't clear how important that
optimisation is.
Additionally it has fairly easy to break semantics.

Patch 1-2 remove the priority tw list optimisation
Patch 3-5 add lockless lists for task work
Patch 6 fixes a bug I noticed in io_uring event tracing
Patch 7-8 adds tracing for task_work_run

Compared to the spinlock overhead, the prio task list
optimization is
definitely unimportant, so I agree with removing it here.
Replace the task list with llisy was something I considered
but I
gave
it up since it changes the list to a stack which means we
have to
handle
the tasks in a reverse order. This may affect the latency,
do you
have
some numbers for it, like avg and 99% 95% lat?

Do you have an idea for how to test that? I used a
microbenchmark
as
well as a network benchmark [1] to verify that overall
throughput
is
higher. TW latency sounds a lot more complicated to measure
as it's
difficult to trigger accurately.

My feeling is that with reasonable batching (say 8-16 items)
the
latency will be low as TW is generally very quick, but if you
have
an
idea for benchmarking I can take a look

[1]: https://github.com/DylanZA/netbench

It can be normal IO requests I think. We can test the latency
by fio
with small size IO to a fast block device(like nvme) in SQPOLL
mode(since for non-SQPOLL, it doesn't make difference). This
way we
can
see the influence of reverse order handling.

Regards,
Hao

I see little difference locally, but there is quite a big stdev
so it's
possible my test setup is a bit wonky

new:
      clat (msec): min=2027, max=10544, avg=6347.10, stdev=2458.20
       lat (nsec): min=1440, max=16719k, avg=119714.72,
stdev=153571.49
old:
      clat (msec): min=2738, max=10550, avg=6700.68, stdev=2251.77
       lat (nsec): min=1278, max=16610k, avg=121025.73,
stdev=211896.14

Hi Dylan,

Could you post the arguments you use and the 99% 95% latency as
well?

Regards,
Hao

One thing I'm worrying about is under heavy workloads, there are
contiguous TWs coming in, thus the TWs at the end of the TW list
doesn't
get the chance to run, which leads to the latency of those ones
becoming
high.

Pavel mentioned I should change some arguments, so I reran it. I'll
just post all the output below as not sure exactly what you are looking
for. Note I checked that it was definitely batching and it is doing
batches of 10-20 in tctx_task_work

*** config ***

[global]
ioengine=io_uring
sqthread_poll=1
registerfiles=1
fixedbufs=1
hipri=0
thread=1
direct=0
rw=randread
time_based=1
runtime=600
ramp_time=30
randrepeat=0
group_reporting=0
sqthread_poll_cpu=15
iodepth=32

[job0]
filename=/dev/nullb0
cpus_allowed=1
bs=512

*** new ***
job0: (g=0): rw=randread, bs=(R) 512B-512B, (W) 512B-512B, (T) 512B-
512B, ioengine=io_uring, iodepth=32
fio-3.30-59-gd4bf5
Starting 1 thread
Jobs: 1 (f=0): [f(1)][100.0%][r=360MiB/s][r=738k IOPS][eta 00m:00s]
job0: (groupid=0, jobs=1): err= 0: pid=37255: Wed Jun 22 03:44:23 2022
   read: IOPS=596k, BW=291MiB/s (305MB/s)(171GiB/600001msec)
     clat (msec): min=30343, max=630343, avg=369885.75, stdev=164921.26
      lat (usec): min=14, max=1802, avg=53.23, stdev=18.84
     clat percentiles (msec):
      |  1.00th=[17113],  5.00th=[17113], 10.00th=[17113],
20.00th=[17113],
      | 30.00th=[17113], 40.00th=[17113], 50.00th=[17113],
60.00th=[17113],
      | 70.00th=[17113], 80.00th=[17113], 90.00th=[17113],
95.00th=[17113],
      | 99.00th=[17113], 99.50th=[17113], 99.90th=[17113],
99.95th=[17113],
      | 99.99th=[17113]
    bw (  KiB/s): min=169237, max=381603, per=100.00%, avg=298171.22,
stdev=70580.65, samples=1199
    iops        : min=338474, max=763206, avg=596342.60,
stdev=141161.31, samples=1199
   lat (msec)   : >=2000=100.00%
   cpu          : usr=99.98%, sys=0.00%, ctx=4378, majf=0, minf=9
   IO depths    : 1=0.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=100.0%,
=64=0.0%
      submit    : 0=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
=64=0.0%
      complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%,
=64=0.0%
      issued rwts: total=357661967,0,0,0 short=0,0,0,0 dropped=0,0,0,0
      latency   : target=0, window=0, percentile=100.00%, depth=32

Run status group 0 (all jobs):
    READ: bw=291MiB/s (305MB/s), 291MiB/s-291MiB/s (305MB/s-305MB/s),
io=171GiB (183GB), run=600001-600001msec

Disk stats (read/write):
   nullb0: ios=72127555/0, merge=11/0, ticks=1396298/0,
in_queue=1396298, util=100.00%

*** old ***

job0: (g=0): rw=randread, bs=(R) 512B-512B, (W) 512B-512B, (T) 512B-
512B, ioengine=io_uring, iodepth=32
fio-3.30-59-gd4bf5
Starting 1 thread
Jobs: 1 (f=1): [r(1)][100.0%][r=367MiB/s][r=751k IOPS][eta 00m:00s]
job0: (groupid=0, jobs=1): err= 0: pid=19216: Wed Jun 22 04:43:36 2022
   read: IOPS=609k, BW=297MiB/s (312MB/s)(174GiB/600001msec)
     clat (msec): min=30333, max=630333, avg=368961.53, stdev=164532.01
      lat (usec): min=14, max=5830, avg=52.11, stdev=18.64
     clat percentiles (msec):
      |  1.00th=[17113],  5.00th=[17113], 10.00th=[17113],
20.00th=[17113],
      | 30.00th=[17113], 40.00th=[17113], 50.00th=[17113],
60.00th=[17113],
      | 70.00th=[17113], 80.00th=[17113], 90.00th=[17113],
95.00th=[17113],
      | 99.00th=[17113], 99.50th=[17113], 99.90th=[17113],
99.95th=[17113],
      | 99.99th=[17113]
    bw (  KiB/s): min=170273, max=386932, per=100.00%, avg=304548.39,
stdev=70732.20, samples=1200
    iops        : min=340547, max=773864, avg=609096.94,
stdev=141464.41, samples=1200
   lat (msec)   : >=2000=100.00%
   cpu          : usr=99.98%, sys=0.00%, ctx=3912, majf=0, minf=5
   IO depths    : 1=0.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=100.0%,
=64=0.0%
      submit    : 0=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
=64=0.0%
      complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%,
=64=0.0%
      issued rwts: total=365258392,0,0,0 short=0,0,0,0 dropped=0,0,0,0
      latency   : target=0, window=0, percentile=100.00%, depth=32

Run status group 0 (all jobs):
    READ: bw=297MiB/s (312MB/s), 297MiB/s-297MiB/s (312MB/s-312MB/s),
io=174GiB (187GB), run=600001-600001msec

Disk stats (read/write):
   nullb0: ios=69031421/0, merge=1/0, ticks=1323086/0, in_queue=1323086,
util=100.00%

Ok, the clat percentiles seems meanless here... from the min max and avg
data it should be fine. Thanks for testing.