Re: Sending CQE to a different ring

Jens Axboe <axboe@xxxxxxxxx> · Wed, 9 Mar 2022 21:03:36 -0700

On 3/9/22 8:48 PM, Artyom Pavlov wrote:
>> OK, so what you're asking is to be able to submit an sqe to ring1, but
>> have the completion show up in ring2? With the idea being that the rings
>> are setup so that you're basing this on which thread should ultimately
>> process the request when it completes, which is why you want it to
>> target another ring?
> 
> Yes, to both questions.
> 
>> 1) It's a fast path code addition to every request, we'd need to check
>>     some new field (sqe->completion_ring_fd) and then also grab a
>>     reference to that file for use at completion time.
> 
> Since migration of tasks will be relatively rare, the relevant branch
> could be marked as cold and such branch should be relatively easy for
> CPU branch predictor. So I don't think we will see a measurable
> performance regression for the common case.

It's not the branches I'm worried about, it's the growing of the request
to accomodate it, and the need to bring in another fd for this. We're
not just talking one piece of branch here, a request being tied to a
specific ring is a pretty core foundation of the internal code. It would
require massive refactoring to disconnect those two. We have a lot of
optimizations in place to handle completions efficiently as it is.

But I guess I'm still a bit confused on what this will buy is. The
request is still being executed on the first ring (and hence the thread
associated with it), with the suggested approach here the only thing
you'd gain is the completion going somewhere else. Is this purely about
the post-processing that happens when a completion is posted to a given
ring?

>> 2) Completions are protected by the completion lock, and it isn't
>>     trivial to nest these. What happens if ring1 submits an sqe with
>>     ring2 as the cqe target, and ring2 submits an sqe with ring1 as the
>>     cqe target? We can't safely nest these, as we could easily introduce
>>     deadlocks that way.
> 
> I thought a better approach will be to copy SQE from ring1 into ring2
> internal buffer and execute it as usual (IIUC kernel copies SQEs first
> before processing them). I am not familiar with internals of io-uring
> implementation, so I can not give any practical proposals.

That's certainly possible, but what does it buy us? Why not just issue
it on ring2 to begin with? The issue per ring is serialized anyway, by
an internal mutex.

>> My knee jerk reaction is that it'd be both simpler and cheaper to
>> implement this in userspace... Unless there's an elegant solution to it,
>> which I don't immediately see.
> 
> Yes, as I said in the initial post, it's certainly possible to do it
> in user-space. But I think it's a quite common problem, so it could
> warrant including a built-in solution into io-uring API. Also it could
> be a bit more efficient to do in kernel space, e.g. you would not need
> mutexes, which in the worst case may involve parking and unparking
> threads, thus stalling event loop.

I'm all for having solutions for common problems, but it has to make
sense to do so. 

liburing has some state in the ring structure which makes it hard to
share, but for the raw interface, there's really not that concern
outside of needing to ensure that you serialize access to writing the sq
ring head and sqe entry. The kernel doesn't really care, though you
don't want two threads entering the kernel for the same ring as one of
them would simply then just be blocked until the other is done.

>> The submitting task is the owner of the request, and will ultimately
>> be the one that ends up running eg task_work associated with the
>> request. It's not really a good way to shift work from one ring to
>> another, if the setup is such that the rings are tied to a thread and
>> the threads are in turn mostly tied to a CPU or group of CPUs.
> 
> I am not sure I understand your point here. In my understanding, the
> common approach for using io-uring is to keep in user_data a pointer
> to FSM (Finite State Machine) state together with pointer to a
> function used to drive FSM further after CQE is received
> (alternatively, instead of the function pointer a jump table could be
> used).
> 
> Usually, it does not matter much on which thread FSM will be driven
> since FSM state is kept on the heap. Yes, it may not be great from CPU
> cache point of view, but it's better than having unbalanced thread
> load.

OK, so it is just about the post-processing. These are key questions to
answer, because it'll help drive what the best solution is here.

How did the original thread end up with the work to begin with? Was the
workload evenly distributed at that point, but later conditions (before
it get issued) mean that the situation has now changed and we'd prefer
to execute it somewhere else?

I'll mull over this a bit...

-- 
Jens Axboe