On Wed, Oct 30, 2024 at 07:20:48AM -0600, Jens Axboe wrote: > On 10/29/24 10:11 PM, Ming Lei wrote: > > On Wed, Oct 30, 2024 at 11:08:16AM +0800, Ming Lei wrote: > >> On Tue, Oct 29, 2024 at 08:43:39PM -0600, Jens Axboe wrote: > > > > ... > > > >>> You could avoid the OP dependency with just a flag, if you really wanted > >>> to. But I'm not sure it makes a lot of sense. And it's a hell of a lot > >> > >> Yes, IO_LINK won't work for submitting multiple IOs concurrently, extra > >> syscall makes application too complicated, and IO latency is increased. > >> > >>> simpler than the sqe group scheme, which I'm a bit worried about as it's > >>> a bit complicated in how deep it needs to go in the code. This one > >>> stands alone, so I'd strongly encourage we pursue this a bit further and > >>> iron out the kinks. Maybe it won't work in the end, I don't know, but it > >>> seems pretty promising and it's soooo much simpler. > >> > >> If buffer register and lookup are always done in ->prep(), OP dependency > >> may be avoided. > > > > Even all buffer register and lookup are done in ->prep(), OP dependency > > still can't be avoided completely, such as: > > > > 1) two local buffers for sending to two sockets > > > > 2) group 1: IORING_OP_LOCAL_KBUF1 & [send(sock1), send(sock2)] > > > > 3) group 2: IORING_OP_LOCAL_KBUF2 & [send(sock1), send(sock2)] > > > > group 1 and group 2 needs to be linked, but inside each group, the two > > sends may be submitted in parallel. > > That is where groups of course work, in that you can submit 2 groups and > have each member inside each group run independently. But I do think we > need to decouple the local buffer and group concepts entirely. For the > first step, getting local buffers working with zero copy would be ideal, > and then just live with the fact that group 1 needs to be submitted > first and group 2 once the first ones are done. IMHO, it is one _kernel_ zero copy(_performance_) feature, which often imply: - better performance expectation - no big change on existed application for using this feature Application developer is less interested in sort of crippled or immature feature, especially need big change on existed code logic(then two code paths need to maintain), with potential performance regression. With sqe group and REQ_F_GROUP_KBUF, application just needs lines of code change for using the feature, and it is pretty easy to evaluate the feature since no any extra logic change & no extra syscall/wait introduced. The whole patchset has been mature enough, unfortunately blocked without obvious reasons. > > Once local buffers are done, we can look at doing the sqe grouping in a > nice way. I do think it's a potentially powerful concept, but we're > going to make a lot more progress on this issue if we carefully separate > dependencies and get each of them done separately. One fundamental difference between local buffer and REQ_F_GROUP_KBUF is - local buffer has to be provided and used in ->prep() - REQ_F_GROUP_KBUF needs to be provided in ->issue() instead of ->prep() The only common code could be one buffer abstraction for OP to use, but still used differently, ->prep() vs. ->issue(). So it is hard to call it decouple, especially REQ_F_GROUP_KBUF has been simple enough, and the main change is to import it in OP code. Local buffer is one smart idea, but I hope the following things may be settled first: 1) is it generic enough to just allow to provide local buffer during ->prep()? - this way actually becomes sync & nowait IO, instead AIO, and has been one strong constraint from UAPI viewpoint. - Driver may need to wait until some data comes, then return & provide the buffer with data, and local buffer can't cover this case 2) is it allowed to call ->uring_cmd() from io_uring_cmd_prep()? If not, any idea to call into driver for leasing the kernel buffer to io_uring? 3) in OP code, how to differentiate normal userspace buffer select with local buffer? And how does OP know if normal buffer select or local kernel buffer should be used? Some OP may want to use normal buffer select instead of local buffer, others may want to use local buffer. 4) arbitrary numbers of local buffer needs to be supported, since IO often comes at batch, it shouldn't be hard to support it by adding xarray to submission state, what do you think of this added complexity? Without supporting arbitrary number of local buffers, performance can be just bad, it doesn't make sense as zc viewpoint. Meantime as number of local buffer is increased, more rsrc_node & imu allocation is introduced, this still may degrade perf a bit. 5) io_rsrc_node becomes part of interface between io_uring and driver for releasing the leased buffer, so extra data has to be added to `io_rsrc_node` for driver use. However, from above, the following can be concluded at least: - it isn't generic enough, #1, #3 - it still need sqe group - it is much more complicated than REQ_F_GROUP_KBUF only - it can't be more efficient Thanks, Ming