Re: Sending CQE to a different ring

Artyom Pavlov <newpavlov@xxxxxxxxx> · Thu, 10 Mar 2022 06:48:16 +0300

OK, so what you're asking is to be able to submit an sqe to ring1, but
have the completion show up in ring2? With the idea being that the rings
are setup so that you're basing this on which thread should ultimately
process the request when it completes, which is why you want it to
target another ring?

Yes, to both questions.

1) It's a fast path code addition to every request, we'd need to check
    some new field (sqe->completion_ring_fd) and then also grab a
    reference to that file for use at completion time.

Since migration of tasks will be relatively rare, the relevant branch 
could be marked as cold and such branch should be relatively easy for 
CPU branch predictor. So I don't think we will see a measurable 
performance regression for the common case.

2) Completions are protected by the completion lock, and it isn't
    trivial to nest these. What happens if ring1 submits an sqe with
    ring2 as the cqe target, and ring2 submits an sqe with ring1 as the
    cqe target? We can't safely nest these, as we could easily introduce
    deadlocks that way.

I thought a better approach will be to copy SQE from ring1 into ring2 
internal buffer and execute it as usual (IIUC kernel copies SQEs first 
before processing them). I am not familiar with internals of io-uring 
implementation, so I can not give any practical proposals.

My knee jerk reaction is that it'd be both simpler and cheaper to
implement this in userspace... Unless there's an elegant solution to it,
which I don't immediately see.

Yes, as I said in the initial post, it's certainly possible to do it in 
user-space. But I think it's a quite common problem, so it could warrant 
including a built-in solution into io-uring API. Also it could be a bit 
more efficient to do in kernel space, e.g. you would not need mutexes, 
which in the worst case may involve parking and unparking threads, thus 
stalling event loop.

> The submitting task is the owner of the request, and will ultimately
> be the one that ends up running eg task_work associated with the
> request. It's not really a good way to shift work from one ring to
> another, if the setup is such that the rings are tied to a thread and
> the threads are in turn mostly tied to a CPU or group of CPUs.

I am not sure I understand your point here. In my understanding, the 
common approach for using io-uring is to keep in user_data a pointer to 
FSM (Finite State Machine) state together with pointer to a function 
used to drive FSM further after CQE is received (alternatively, instead 
of the function pointer a jump table could be used).

Usually, it does not matter much on which thread FSM will be driven 
since FSM state is kept on the heap. Yes, it may not be great from CPU 
cache point of view, but it's better than having unbalanced thread load.