Re: [PATCHSET v2 0/7] Improve MSG_RING DEFER_TASKRUN performance

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 6/4/24 19:57, Jens Axboe wrote:
On 6/3/24 7:53 AM, Pavel Begunkov wrote:
On 5/30/24 16:23, Jens Axboe wrote:
Hi,

For v1 and replies to that and tons of perf measurements, go here:

I'd really prefer the task_work version rather than carving
yet another path specific to msg_ring. Perf might sounds better,
but it's duplicating wake up paths, not integrated with batch
waiting, not clear how affects different workloads with target
locking and would work weird in terms of ordering.

The duplication is really minor, basically non-existent imho. It's a
wakeup call, it's literally 2 lines of code. I do agree on the batching,

Well, v3 tries to add msg_ring/nr_overflow handling to local
task work, that what I mean by duplicating paths, and we'll
continue gutting the hot path for supporting msg_ring in
this way.

Does it work with eventfd? I can't find any handling, so next
you'd be adding:

io_commit_cqring_flush(ctx);

Likely draining around cq_extra should also be patched.
Yes, fixable, but it'll be a pile of fun, and without many
users, it'll take time to discover it all.

though I don't think that's really a big concern as most usage I'd
expect from this would be sending single messages. You're not batch
waiting on those. But there could obviously be cases where you have a
lot of mixed traffic, and for those it would make sense to have the
batch wakeups.

What I do like with this version is that we end up with just one method
for delivering the CQE, rather than needing to split it into two. And it
gets rid of the uring_lock double locking for non-SINGLE_ISSUER. I know

You can't get rid of target locking for fd passing, the file tables
are sync'ed by the lock. Otherwise it's only IOPOLL, because with
normal rings it can and IIRC does take the completion_lock for CQE
posting. I don't see a problem here, unless you care that much about
IOPOLL?

we always try and push people towards DEFER_TASKRUN|SINGLE_ISSUER, but
that doesn't mean we should just ignore the cases where that isn't true.
Unifying that code and making it faster all around is a worthy goal in
and of itself. The code is CERTAINLY a lot cleaner after the change than
all the IOPOLL etc.

If the swing back is that expensive, another option is to
allocate a new request and let the target ring to deallocate
it once the message is delivered (similar to that overflow
entry).

I can give it a shot, and then run some testing. If we get close enough
with the latencies and performance, then I'd certainly be more amenable
to going either route.

We'd definitely need to pass in the required memory and avoid the return

Right, same as with CQEs

round trip, as that basically doubles the cost (and latency) of sending

Sender's latency, which is IMHO not important at all

a message. The downside of what you suggest here is that while that
should integrate nicely with existing local task_work, it'll also mean
that we'll need hot path checks for treating that request type as a
special thing. Things like req->ctx being not local, freeing the request
rather than recycling, etc. And that'll need to happen in multiple
spots.

I'm not suggesting feeding that request into flush_completions()
and common completion infra, can be killed right in the tw callback.

--
Pavel Begunkov




[Index of Archives]     [Linux Samsung SoC]     [Linux Rockchip SoC]     [Linux Actions SoC]     [Linux for Synopsys ARC Processors]     [Linux NFS]     [Linux NILFS]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]


  Powered by Linux