Re: [LSF/MM/BPF TOPIC] block drivers in user space

Ming Lei <ming.lei@xxxxxxxxxx> · Thu, 31 Mar 2022 09:38:42 +0800

On Wed, Mar 30, 2022 at 02:22:20PM -0400, Gabriel Krisman Bertazi wrote:
> Ming Lei <ming.lei@xxxxxxxxxx> writes:
> 
> > On Tue, Mar 29, 2022 at 01:20:57PM -0400, Gabriel Krisman Bertazi wrote:
> >> Ming Lei <ming.lei@xxxxxxxxxx> writes:
> >> 
> >> >> I was thinking of something like this, or having a way for the server to
> >> >> only operate on the fds and do splice/sendfile.  But, I don't know if it
> >> >> would be useful for many use cases.  We also want to be able to send the
> >> >> data to userspace, for instance, for userspace networking.
> >> >
> >> > I understand the big point is that how to pass the io data to ubd driver's
> >> > request/bio pages. But splice/sendfile just transfers data between two FDs,
> >> > then how can the block request/bio's pages get filled with expected data?
> >> > Can you explain a bit in detail?
> >> 
> >> Hi Ming,
> >> 
> >> My idea was to split the control and dataplanes in different file
> >> descriptors.
> >> 
> >> A queue has a fd that is mapped to a shared memory area where the
> >> request descriptors are.  Submission/completion are done by read/writing
> >> the index of the request on the shared memory area.
> >> 
> >> For the data plane, each request descriptor in the queue has an
> >> associated file descriptor to be used for data transfer, which is
> >> preallocated at queue creation time.  I'm mapping the bio linearly, from
> >> offset 0, on these descriptors on .queue_rq().  Userspace operates on
> >> these data file descriptors with regular RW syscalls, direct splice to
> >> another fd or pipe, or mmap it to move data around. The data is
> >> available on that fd until IO is completed through the queue fd.  After
> >> an operation is completed, the fds are reused for the next IO on that
> >> queue position.
> >> 
> >> Hannes has pointed out the issues with fd limits. :)
> >
> > OK, thanks for the detailed explanation!
> >
> > Also you may switch to map each request queue/disk into a FD, and every
> > request is mapped to one fixed extent of the 'file' via rq->tag since we
> > have max sectors limit for each request, then fd limits can be avoided.
> >
> > But I am wondering if this way is friendly to userspace side implementation,
> > since there isn't buffer, only FDs visible to userspace.
> 
> The advantages would be not mapping the request data in userspace if we
> could avoid it, since it would be possible to just forward the data
> inside the kernel.  But my latest understanding is that most use cases
> will want to directly manipulate the data anyway, maybe to checksum, or
> even for sending through userspace networking.  It is not clear to me
> anymore that we'd benefit from not always mapping the requests to
> userspace.

Yeah, I think it is more flexible or usable to allow userspace to
operate on data directly as one generic solution, such as, implement one disk
to read/write on qcow2 image, or read from/write to network by parsing
protocol, or whatever.

> I've been looking at your implementation and I really like how simple it
> is. I think it's the most promising approach for this feature I've
> reviewed so far.  I'd like to send you a few patches for bugs I found
> when testing it and keep working on making it upstreamable.  How can I
> send you those patches?  Is it fine to just email you or should I also
> cc linux-block, even though this is yet out-of-tree code?

The topic has been discussed for a bit long, and looks people are still
interested in it, so I prefer to send out patches on linux-block if no
one objects. Then we can still discuss further when reviewing patches.

Thanks,
Ming