On 4/30/24 16:00, Ming Lei wrote:
On Tue, Apr 30, 2024 at 01:27:10PM +0100, Pavel Begunkov wrote:
...
And what does it achieve? The infra has matured since early days,
it saves user-kernel transitions at best but not context switching
overhead, and not even that if you do wait(1) and happen to catch
middle CQEs. And it disables LAZY_WAKE, so CQ side batching with
timers and what not is effectively useless with links.
Not only the context switch, it supports 1:N or N:M dependency which
I completely missed, how N:M is supported? That starting to sound
terrifying.
N:M is actually from Kevin's idea.
sqe group can be made to be more flexible by:
Inside the group, all SQEs are submitted in parallel, so there isn't any
dependency among SQEs in one group.
The 1st SQE is group leader, and the other SQEs are group member. The whole
group share single IOSQE_IO_LINK and IOSQE_IO_DRAIN from group leader, and
the two flags can't be set for group members.
When the group is in one link chain, this group isn't submitted until
the previous SQE or group is completed. And the following SQE or group
can't be started if this group isn't completed.
When IOSQE_IO_DRAIN is set for group leader, all requests in this group
and previous requests submitted are drained. Given IOSQE_IO_DRAIN can
be set for group leader only, we respect IO_DRAIN for SQE group by
always completing group leader as the last on in the group.
SQE group provides flexible way to support N:M dependency, such as:
- group A is chained with group B together by IOSQE_IO_LINK
- group A has N SQEs
- group B has M SQEs
then M SQEs in group B depend on N SQEs in group A.
is missing in io_uring, but also makes async application easier to write by
saving extra context switches, which just adds extra intermediate states for
application.
You're still executing requests (i.e. ->issue) primarily from the
submitter task context, they would still fly back to the task and
wake it up. You may save something by completing all of them
together via that refcounting, but you might just as well try to
batch CQ, which is a more generic issue. It's not clear what
context switches you save then.
Wrt. the above N:M example, one io_uring_enter() is enough, and
it can't be done in single context switch without sqe group, please
see the liburing test code:
Do you mean doing all that in a single system call? The main
performance problem for io_uring is waiting, i.e. schedule()ing
the task out and in, that's what I meant by context switching.
--
Pavel Begunkov