It's not the branches I'm worried about, it's the growing of the request
to accomodate it, and the need to bring in another fd for this.
Maybe it's worth to pre-register fds of rings to which we can send CQEs
similarly to pre-registering file fds? It would allow us to use u8 or
u16 instead of u64 for identifying recipient ring.
But I guess I'm still a bit confused on what this will buy is. The
request is still being executed on the first ring (and hence the thread
associated with it), with the suggested approach here the only thing
you'd gain is the completion going somewhere else. Is this purely about
the post-processing that happens when a completion is posted to a given
ring?
As I wrote earlier, I am not familiar with internals of the io-uring
implementation, so I am talking purely from user point of view. I will
trust your judgment in regards of implementation complexity.
I guess, from user PoV, it does not matter on which ring the SQE will be
executed. It can have certain performance implications, but otherwise it
for user it's simply an implementation detail.
How did the original thread end up with the work to begin with? Was the
workload evenly distributed at that point, but later conditions (before
it get issued) mean that the situation has now changed and we'd prefer
to execute it somewhere else?
Let's talk about a concrete simplified example. Imagine a server which
accepts from network commands to compute hash for a file with given
path. The server executes the following algorithm:
1) Accept connection
2) Read command
3) Open file and create hasher state
4) Read chunk of data from file
5) If read data is not empty, update hasher state and go to step 4, else
finalize hasher
6) Return the resulting hash and go to step 2
We have two places where we can balance load. First, after we accepted
connection we should decide a ring which will process this connection.
Second, during creation of SQE for step 4, if the current thread is
overloaded, we can transfer task to a different thread.
The problem is that we can not predict how kernel will return read
chunks. Even if we distributed SQEs evenly across rings, it's possible
that kernel will return CQEs for a single ring in burst thus overloading
it, while other threads will starve for events.
On a second thought, it looks like your solution with
IORING_OP_WAKEUP_RING will have the following advantage: it will allow
us to migrate task before execution of step 5 has started, while with my
proposal we will be able to migrate tasks only on SQE creation (i.e. on
step 4).
One idea... You issue the request as you normally would for ring1, and
you mark that request A with IOSQE_CQE_SKIP_SUCCESS. Then you link an
IORING_OP_WAKEUP_RING to request A, with the fd for it set to ring2, and
also mark that with IOSQE_CQE_SKIP_SUCCESS.
Looks interesting! I have forgot about linking and IOSQE_CQE_SKIP_SUCCESS.