Re: [PATCH RFC for-next 0/8] io_uring: tw contention improvments

Hao Xu <hao.xu@xxxxxxxxx> · Wed, 22 Jun 2022 19:52:36 +0800

On 6/22/22 19:24, Hao Xu wrote:
On 6/22/22 19:16, Hao Xu wrote:
On 6/22/22 17:31, Dylan Yudaken wrote:
On Tue, 2022-06-21 at 15:34 +0800, Hao Xu wrote:
On 6/21/22 15:03, Dylan Yudaken wrote:
On Tue, 2022-06-21 at 13:10 +0800, Hao Xu wrote:
On 6/21/22 00:18, Dylan Yudaken wrote:
Task work currently uses a spin lock to guard task_list and
task_running. Some use cases such as networking can trigger
task_work_add
from multiple threads all at once, which suffers from
contention
here.

This can be changed to use a lockless list which seems to have
better
performance. Running the micro benchmark in [1] I see 20%
improvment in
multithreaded task work add. It required removing the priority
tw
list
optimisation, however it isn't clear how important that
optimisation is.
Additionally it has fairly easy to break semantics.

Patch 1-2 remove the priority tw list optimisation
Patch 3-5 add lockless lists for task work
Patch 6 fixes a bug I noticed in io_uring event tracing
Patch 7-8 adds tracing for task_work_run

Compared to the spinlock overhead, the prio task list
optimization is
definitely unimportant, so I agree with removing it here.
Replace the task list with llisy was something I considered but I
gave
it up since it changes the list to a stack which means we have to
handle
the tasks in a reverse order. This may affect the latency, do you
have
some numbers for it, like avg and 99% 95% lat?

Do you have an idea for how to test that? I used a microbenchmark
as
well as a network benchmark [1] to verify that overall throughput
is
higher. TW latency sounds a lot more complicated to measure as it's
difficult to trigger accurately.

My feeling is that with reasonable batching (say 8-16 items) the
latency will be low as TW is generally very quick, but if you have
an
idea for benchmarking I can take a look

[1]: https://github.com/DylanZA/netbench

It can be normal IO requests I think. We can test the latency by fio
with small size IO to a fast block device(like nvme) in SQPOLL
mode(since for non-SQPOLL, it doesn't make difference). This way we
can
see the influence of reverse order handling.

Regards,
Hao

I see little difference locally, but there is quite a big stdev so it's
possible my test setup is a bit wonky

new:
     clat (msec): min=2027, max=10544, avg=6347.10, stdev=2458.20
      lat (nsec): min=1440, max=16719k, avg=119714.72, stdev=153571.49
old:
     clat (msec): min=2738, max=10550, avg=6700.68, stdev=2251.77
      lat (nsec): min=1278, max=16610k, avg=121025.73, stdev=211896.14

Hi Dylan,

Could you post the arguments you use and the 99% 95% latency as well?

Regards,
Hao

One thing I'm worrying about is under heavy workloads, there are
contiguous TWs coming in, thus the TWs at the end of the TW list doesn't
get the chance to run, which leads to the latency of those ones becoming
high.

Ah, looked at the code again, seems we take the whole list not a single
node at each time, so it shouldn't be a big problem.