On Wed, Mar 30, 2022 at 02:22:20PM -0400, Gabriel Krisman Bertazi wrote: > Ming Lei <ming.lei@xxxxxxxxxx> writes: > > > On Tue, Mar 29, 2022 at 01:20:57PM -0400, Gabriel Krisman Bertazi wrote: > >> Ming Lei <ming.lei@xxxxxxxxxx> writes: > >> > >> >> I was thinking of something like this, or having a way for the server to > >> >> only operate on the fds and do splice/sendfile. But, I don't know if it > >> >> would be useful for many use cases. We also want to be able to send the > >> >> data to userspace, for instance, for userspace networking. > >> > > >> > I understand the big point is that how to pass the io data to ubd driver's > >> > request/bio pages. But splice/sendfile just transfers data between two FDs, > >> > then how can the block request/bio's pages get filled with expected data? > >> > Can you explain a bit in detail? > >> > >> Hi Ming, > >> > >> My idea was to split the control and dataplanes in different file > >> descriptors. > >> > >> A queue has a fd that is mapped to a shared memory area where the > >> request descriptors are. Submission/completion are done by read/writing > >> the index of the request on the shared memory area. > >> > >> For the data plane, each request descriptor in the queue has an > >> associated file descriptor to be used for data transfer, which is > >> preallocated at queue creation time. I'm mapping the bio linearly, from > >> offset 0, on these descriptors on .queue_rq(). Userspace operates on > >> these data file descriptors with regular RW syscalls, direct splice to > >> another fd or pipe, or mmap it to move data around. The data is > >> available on that fd until IO is completed through the queue fd. After > >> an operation is completed, the fds are reused for the next IO on that > >> queue position. > >> > >> Hannes has pointed out the issues with fd limits. :) > > > > OK, thanks for the detailed explanation! > > > > Also you may switch to map each request queue/disk into a FD, and every > > request is mapped to one fixed extent of the 'file' via rq->tag since we > > have max sectors limit for each request, then fd limits can be avoided. > > > > But I am wondering if this way is friendly to userspace side implementation, > > since there isn't buffer, only FDs visible to userspace. > > The advantages would be not mapping the request data in userspace if we > could avoid it, since it would be possible to just forward the data > inside the kernel. But my latest understanding is that most use cases > will want to directly manipulate the data anyway, maybe to checksum, or > even for sending through userspace networking. It is not clear to me > anymore that we'd benefit from not always mapping the requests to > userspace. Yeah, I think it is more flexible or usable to allow userspace to operate on data directly as one generic solution, such as, implement one disk to read/write on qcow2 image, or read from/write to network by parsing protocol, or whatever. > I've been looking at your implementation and I really like how simple it > is. I think it's the most promising approach for this feature I've > reviewed so far. I'd like to send you a few patches for bugs I found > when testing it and keep working on making it upstreamable. How can I > send you those patches? Is it fine to just email you or should I also > cc linux-block, even though this is yet out-of-tree code? The topic has been discussed for a bit long, and looks people are still interested in it, so I prefer to send out patches on linux-block if no one objects. Then we can still discuss further when reviewing patches. Thanks, Ming