Re: [RFC PATCH] ubd: add io_uring based userspace block driver

Stefan Hajnoczi <stefanha@xxxxxxxxxx> · Tue, 17 May 2022 07:17:40 +0100

On Tue, May 17, 2022 at 09:57:56AM +0800, Ming Lei wrote:
> Hello Stefan,
> 
> On Mon, May 16, 2022 at 08:29:25PM +0100, Stefan Hajnoczi wrote:
> > Hi,
> > This looks interesting! I have some questions:
> 
> Thanks for your comment!
> 
> > 
> > 1. What is the ubdsrv permission model?
> > 
> > A big usability challenge for *-in-userspace interfaces is the balance
> > between security and allowing unprivileged processes to use these
> > features.
> > 
> > - Does /dev/ubd-control need to be privileged? I guess the answer is
> >   yes since an evil ubdsrv can hang I/O and corrupt data in hopes of
> >   triggering file system bugs.
> 
> Yes, I think so.
> 
> UBD should be in same position with NBD which does require
> capable(CAP_SYS_ADMIN).
> 
> > - Can multiple processes that don't trust each other use UBD at the same
> >   time? I guess not since ubd_index_idr is global.
> 
> Only single process can open /dev/ubdcN for communicating with ubd
> driver, see ubd_ch_open().

I was thinking about /dev/ubd-control:
Commands like UBD_CMD_STOP_DEV use ubd_find_device() to look up an
arbitrary device. This means there is no isolation of UBD devices
between processes.

It makes me wonder whether a design that isolates devices would be
better in the long-term. For example, UBD_CMD_ADD_DEV returns a new UBD
device control file descriptor that responds to control commands like
UBD_CMD_START_DEV/UBD_CMD_STOP_DEV. info->dev_id isn't necessary anymore
and doesn't provide access to arbitrary devices because the per-device
control file descriptor provides that information instead. You can still
grant another process access to the control file descriptor (e.g. UNIX
domain socket fd passing) but other processes cannot access your UBD
devices unless you allow it. I think it's worth thinking about this
approach because access is much more controlled.

> > - What about containers and namespaces? They currently have (write)
> >   access to the same global ubd_index_idr.
> 
> I understand contrainers/namespaces only need to see /dev/ubdbN, and
> the usage model should be same with kernel loop: the global ubd_index_idr
> is same with loop's loop_index_idr too.
> 
> Or can you explain a bit in detail if I misunderstood your point.

If containers only have access to /dev/ubdbN, how can a program running
inside a container create new UBD devices? I think it's worth supporting
use cases where containers create new UBD devices because more and more
software is being deployed in containers and it's not always possible to
rely on a non-containerized tool that sets up the environment for the
container.

> > - Maybe there should be a struct ubd_device "owner" (struct
> >   task_struct *) so only devices created by the current process can be
> >   modified?
> 
> I guess it isn't needed since /dev/ubdcN is opened by single process.

Yes, but the control device has the same problem as /dev/ubdcN. It
should be possible for multiple processes that are not cooperating to
control their own UBD devices.

> 
> > 
> > 2. io_uring_cmd design
> > 
> > The rationale for the io_uring_cmd design is not explained in the cover
> > letter. I think it's worth explaining the design. Here are my guesses:
> > 
> > The same thing can be achieved with just file_operations and io_uring.
> > ubdsrv could read I/O submissions with IORING_OP_READ and write I/O
> > completions with IORING_OP_WRITE. That would require 2 sqes per
> > roundtrip instead of 1, but the same number of io_uring_enter(2) calls
> > since multiple sqes/cqes can be batched per syscall:
> > 
> > - IORING_OP_READ, addr=(struct ubdsrv_io_desc*) (for submission)
> > - IORING_OP_WRITE, addr=(struct ubdsrv_io_cmd*) (for completion)
> > 
> > Both operations require a copy_to/from_user() to access the command
> > metadata.
> 
> Yes, but it can't be efficient as io_uring command.
> 
> Two OPs require two long code path for read and write which are supposed
> for handling fs io, so reusing complicated FS IO interface for sending
> command via cha dev is really overkill, and nvme passthrough has shown
> better IOPS than read/write interface with io_uring command, and extra
> copy_to/from_user() may fault with extra meta copy, which can slow down
> the ubd server.
> 
> Also for IORING_OP_READ, copy_to_user() has to be done in the ubq daemon
> context, even though that isn't a big deal, but with extra cost(cpu utilization)
> in the ubq deamon context or sleep for handling page fault, that is
> really what should be avoided, we need to save more CPU for handling user
> space IO logic in that context.
> 
> > 
> > The io_uring_cmd approach works differently. The IORING_OP_URING_CMD sqe
> > carries a 40-byte payload so it's possible to embed struct ubdsrv_io_cmd
> > inside it. The struct ubdsrv_io_desc mmap gets around the fact that
> > io_uring cqes contain no payload. The driver therefore needs a
> > side-channel to transfer the request submission details to ubdsrv. I
> > don't see much of a difference between IORING_OP_READ and the mmap
> > approach though.
> 
> At least the performance difference, ->uring_cmd() requires much less
> code path(single simple o_uring command) than read/write, without any copy
> on command data, without fault in copy_to/from_user(), without two long/
> complicated FS IO code path.
> 
> Single command of UBD_IO_COMMIT_AND_FETCH_REQ can handle both fetching
> io request desc and commit command result in one trip.
> 
> > 
> > It's not obvious to me how much more efficient the io_uring_cmd approach
> > is, but taking fewer trips around the io_uring submission/completion
> > code path is likely to be faster. Something similar can be done with
> > file_operations ->ioctl(), but I guess the point of using io_uring is
> > that is composes. If ubdsrv itself wants to use io_uring for other I/O
> > activity (e.g. networking, disk I/O, etc) then it can do so and won't be
> > stuck in a blocking ioctl() syscall.
> 
> ioctl can't be a choice, since we will lose the benefit of batching
> handling.
> 
> > 
> > It would be nice if you could write 2 or 3 paragraphs explaining why the
> > io_uring_cmd design and the struct ubdsrv_io_desc mmap was chosen.
> 
> Fine, I guess most are the above inline comment?

Yes, thanks for posting explanation! It would be great to have it in the
commit description or cover letter.

> > 
> > 3. Miscellaneous stuff
> > 
> > - There isn't much in the way of memory ordering in the code. I worry a
> >   little that changes to the struct ubdsrv_io_desc mmap may not be
> >   visible at the expected time with respect to the io_uring cq ring.
> 
> I believe io_uring_cmd_done() with userspace cqe helper implies enough memory
> barrier, once the cqe is observed in userspace, any memory OP done before
> io_uring_cmd_done() should be observed by user side cqe handling code,
> otherwise it can be thought as io_uring bug.
> 
> If it isn't this way, we still can avoid any barrier by moving
> setting io desc into ubq daemon context(ubd_rq_task_work_fn), but I
> really want to save cpu in that context, and don't think it is needed.

It might be worth a comment that ubd_setup_iod() stores are implicitly
ordered by io_uring_cmd_done() so userspace will see the stores when it
sees the cqe.

By the way, does ubd_ch_mmap() need to check vma->vm_flags to prevent
ubdsrv from mapping with PROT_WRITE?

Stefan
Attachment:
signature.asc

Description: PGP signature