On 12/18/21 06:57, Hao Xu wrote:
在 2021/12/18 上午3:33, Pavel Begunkov 写道:
On 12/16/21 16:55, Hao Xu wrote:
在 2021/12/15 上午2:16, Pavel Begunkov 写道:
On 12/14/21 16:53, Hao Xu wrote:
在 2021/12/14 下午11:21, Pavel Begunkov 写道:
On 12/14/21 05:57, Hao Xu wrote:
This is just a proof of concept which is incompleted, send it early for
thoughts and suggestions.
We already have IOSQE_IO_LINK to describe linear dependency
relationship sqes. While this patchset provides a new feature to
support DAG dependency. For instance, 4 sqes have a relationship
as below:
--> 2 --
/ \
1 --- ---> 4
\ /
--> 3 --
IOSQE_IO_LINK serializes them to 1-->2-->3-->4, which unneccessarily
serializes 2 and 3. But a DAG can fully describe it.
For the detail usage, see the following patches' messages.
Tested it with 100 direct read sqes, each one reads a BS=4k block data
in a same file, blocks are not overlapped. These sqes form a graph:
2
3
1 --> 4 --> 100
...
99
This is an extreme case, just to show the idea.
results below:
io_link:
IOPS: 15898251
graph_link:
IOPS: 29325513
io_link:
IOPS: 16420361
graph_link:
IOPS: 29585798
io_link:
IOPS: 18148820
graph_link:
IOPS: 27932960
Hmm, what do we compare here? IIUC,
"io_link" is a huge link of 100 requests. Around 15898251 IOPS
"graph_link" is a graph of diameter 3. Around 29585798 IOPS
Diam 2 graph, my bad
Is that right? If so it'd more more fair to compare with a
similar graph-like scheduling on the userspace side.
The above test is more like to show the disadvantage of LINK
Oh yeah, links can be slow, especially when it kills potential
parallelism or need extra allocations for keeping state, like
READV and WRITEV.
But yes, it's better to test the similar userspace scheduling since
LINK is definitely not a good choice so have to prove the graph stuff
beat the userspace scheduling. Will test that soon. Thanks.
Would be also great if you can also post the benchmark once
it's done
Wrote a new test to test nop sqes forming a full binary tree with (2^10)-1 nodes,
which I think it a more general case. Turns out the result is still not stable and
the kernel side graph link is much slow. I'll try to optimize it.
That's expected unfortunately. And without reacting on results
of previous requests, it's hard to imagine to be useful. BPF may
have helped, e.g. not keeping an explicit graph but just generating
new requests from the kernel... But apparently even with this it's
hard to compete with just leaving it in userspace.
Tried to exclude the memory allocation stuff, seems it's a bit better than the user graph.
For the result delivery, I was thinking of attaching BPF program within a sqe, not creating
a single BPF type sqe. Then we can have data flow in the graph or linkchain. But I haven't
had a clear draft for it
Oh, I dismissed this idea before. Even if it can be done in-place without any
additional tw (consider recursion and submit_state not prepared for that), it'll
be a horror to maintain. And I also don't see it being flexible enough.
There is one idea from guys that I have to implement, i.e. having a per-CQ
callback. Might interesting to experiment, but I don't see it being viable
in the long run.
Btw, is there any comparison data between the current io link feature and the
userspace scheduling.
Don't remember. I'd try to look up the cover-letter for the patches
implementing it, I believe there should've been some numbers and
hopefully test description.
fwiw, before io_uring mailing list got established patches/etc.
were mostly going through linux-block mailing list. Links are old, so
patches might be there.
--
Pavel Begunkov