Re: [RFC 0/2] optimise local-tw task resheduling

Pavel Begunkov <asml.silence@xxxxxxxxx> · Mon, 13 Mar 2023 03:45:43 +0000

On 3/12/23 15:30, Jens Axboe wrote:
On 3/11/23 1:45?PM, Pavel Begunkov wrote:
On 3/11/23 17:24, Jens Axboe wrote:
On 3/10/23 12:04?PM, Pavel Begunkov wrote:
io_uring extensively uses task_work, but when a task is waiting
for multiple CQEs it causes lots of rescheduling. This series
is an attempt to optimise it and be a base for future improvements.

For some zc network tests eventually waiting for a portion of
buffers I've got 10x descrease in the number of context switches,
which reduced the CPU consumption more than twice (17% -> 8%).
It also helps storage cases, while running fio/t/io_uring against
a low performant drive it got 2x descrease of the number of context
switches for QD8 and ~4 times for QD32.

Not for inclusion yet, I want to add an optimisation for when
waiting for 1 CQE.

Ran this on the usual peak benchmark, using IRQ. IOPS is around ~70M for
that, and I see context rates of around 8.1-8.3M/sec with the current
kernel.

Applied the two patches, but didn't see much of a change? Performance is
about the same, and cx rate ditto. Confused... As you probably know,
this test waits for 32 ios at the time.

If I'd to guess it already has perfect batching, for which case
the patch does nothing. Maybe it's due to SSD coalescing +
small ro I/O + consistency and small latencies of Optanes,
or might be on the scheduling and the kernel side to be slow
to react.

I was looking at trace_io_uring_local_work_run() while testing,
It's always should be @loop=QD (i.e. 32) for the patch, but
the guess is it's also 32 with that setup but without patches.

It very well could be that it's just loaded enough that we get perfect
batching anyway. I'd need to reuse some of your tracing to know for
sure.

I used existing trace points. If you see a pattern

trace_io_uring_local_work_run()
trace_io_uring_cqring_wait(@count=32)

trace_io_uring_local_work_run()
trace_io_uring_cqring_wait(@count=32)

...

that would mean a perfect batching. Even more so
if @loops=1

Didn't take a closer look just yet, but I grok the concept. One
immediate thing I'd want to change is the FACILE part of it. Let's call
it something a bit more straightforward, perhaps LIGHT? Or LIGHTWEIGHT?

I don't really care, will change, but let me also ask why?
They're more or less synonyms, though facile is much less
popular. Is that your reasoning?

Yep, it's not very common and the name should be self-explanatory
immediately for most people.

That's exactly the problem. Someone will think that it's
like normal tw but "better" and blindly apply it. Same happened
before with priority tw lists.

--
Pavel Begunkov