OK, so what you're asking is to be able to submit an sqe to ring1, but
have the completion show up in ring2? With the idea being that the rings
are setup so that you're basing this on which thread should ultimately
process the request when it completes, which is why you want it to
target another ring?
Yes, to both questions.
1) It's a fast path code addition to every request, we'd need to check
some new field (sqe->completion_ring_fd) and then also grab a
reference to that file for use at completion time.
Since migration of tasks will be relatively rare, the relevant branch
could be marked as cold and such branch should be relatively easy for
CPU branch predictor. So I don't think we will see a measurable
performance regression for the common case.
2) Completions are protected by the completion lock, and it isn't
trivial to nest these. What happens if ring1 submits an sqe with
ring2 as the cqe target, and ring2 submits an sqe with ring1 as the
cqe target? We can't safely nest these, as we could easily introduce
deadlocks that way.
I thought a better approach will be to copy SQE from ring1 into ring2
internal buffer and execute it as usual (IIUC kernel copies SQEs first
before processing them). I am not familiar with internals of io-uring
implementation, so I can not give any practical proposals.
My knee jerk reaction is that it'd be both simpler and cheaper to
implement this in userspace... Unless there's an elegant solution to it,
which I don't immediately see.
Yes, as I said in the initial post, it's certainly possible to do it in
user-space. But I think it's a quite common problem, so it could warrant
including a built-in solution into io-uring API. Also it could be a bit
more efficient to do in kernel space, e.g. you would not need mutexes,
which in the worst case may involve parking and unparking threads, thus
stalling event loop.
> The submitting task is the owner of the request, and will ultimately
> be the one that ends up running eg task_work associated with the
> request. It's not really a good way to shift work from one ring to
> another, if the setup is such that the rings are tied to a thread and
> the threads are in turn mostly tied to a CPU or group of CPUs.
I am not sure I understand your point here. In my understanding, the
common approach for using io-uring is to keep in user_data a pointer to
FSM (Finite State Machine) state together with pointer to a function
used to drive FSM further after CQE is received (alternatively, instead
of the function pointer a jump table could be used).
Usually, it does not matter much on which thread FSM will be driven
since FSM state is kept on the heap. Yes, it may not be great from CPU
cache point of view, but it's better than having unbalanced thread load.