Re: [LSF/MM/BPF BoF]: extend UBLK to cover real storage hardware

Stefan Hajnoczi <stefanha@xxxxxxxxxx> · Mon, 6 Feb 2023 15:27:09 -0500

On Mon, Feb 06, 2023 at 11:00:27PM +0800, Ming Lei wrote:
> Hello,
> 
> So far UBLK is only used for implementing virtual block device from
> userspace, such as loop, nbd, qcow2, ...[1].

I won't be at LSF/MM so here are my thoughts:

> 
> It could be useful for UBLK to cover real storage hardware too:
> 
> - for fast prototype or performance evaluation
> 
> - some network storages are attached to host, such as iscsi and nvme-tcp,
> the current UBLK interface doesn't support such devices, since it needs
> all LUNs/Namespaces to share host resources(such as tag)

Can you explain this in more detail? It seems like an iSCSI or
NVMe-over-TCP initiator could be implemented as a ublk server today.
What am I missing?

> 
> - SPDK has supported user space driver for real hardware

I think this could already be implemented today. There will be extra
memory copies because SPDK won't have access to the application's memory
pages.

> 
> So propose to extend UBLK for supporting real hardware device:
> 
> 1) extend UBLK ABI interface to support disks attached to host, such
> as SCSI Luns/NVME Namespaces
> 
> 2) the followings are related with operating hardware from userspace,
> so userspace driver has to be trusted, and root is required, and
> can't support unprivileged UBLK device

Linux VFIO provides a safe userspace API for userspace device drivers.
That means memory and interrupts are isolated. Neither userspace nor the
hardware device can access memory or interrupts that the userspace
process is not allowed to access.

I think there are still limitations like all memory pages exposed to the
device need to be pinned. So effectively you might still need privileges
to get the mlock resource limits.

But overall I think what you're saying about root and unprivileged ublk
devices is not true. Hardware support should be developed with the goal
of supporting unprivileged userspace ublk servers.

Those unprivileged userspace ublk servers cannot claim any PCI device
they want. The user/admin will need to give them permission to open a
network card, SCSI HBA, etc.

> 
> 3) how to operating hardware memory space
> - unbind kernel driver and rebind with uio/vfio
> - map PCI BAR into userspace[2], then userspace can operate hardware
> with mapped user address via MMIO
>
> 4) DMA
> - DMA requires physical memory address, UBLK driver actually has
> block request pages, so can we export request SG list(each segment
> physical address, offset, len) into userspace? If the max_segments
> limit is not too big(<=64), the needed buffer for holding SG list
> can be small enough.

DMA with an IOMMU requires an I/O Virtual Address, not a CPU physical
address. The IOVA space is defined by the IOMMU page tables. Userspace
controls the IOMMU page tables via Linux VFIO ioctls.

For example, <linux/vfio.h> struct vfio_iommu_type1_dma_map defines the
IOMMU mapping that makes a range of userspace virtual addresses
available at a given IOVA.

Mapping and unmapping operations are not free. Similar to mmap(2), the
program will be slow if it does this frequently.

I think it's effectively the same problem as ublk zero-copy. We want to
give the ublk server access to just the I/O buffers that it currently
needs, but doing so would be expensive :(.

I think Linux has strategies for avoiding the expense like
iommu.strict=0 and swiotlb. The drawback is that in our case userspace
and/or the hardware device controller by userspace would still have
access to the memory pages after I/O has completed. This reduces memory
isolation :(.

DPDK/SPDK and QEMU use long-lived Linux VFIO DMA mappings.

What I'm trying to get at is that either memory isolation is compromised
or performance is reduced. It's hard to have good performance together
with memory isolation.

I think ublk should follow the VFIO philosophy of being a safe
kernel/userspace interface. If userspace is malicious or buggy, the
kernel's and other process' memory should not be corrupted.

> 
> - small amount of physical memory for using as DMA descriptor can be
> pre-allocated from userspace, and ask kernel to pin pages, then still
> return physical address to userspace for programming DMA

I think this is possible today. The ublk server owns the I/O buffers. It
can mlock them and DMA map them via VFIO. ublk doesn't need to know
anything about this.

> - this way is still zero copy

True zero-copy would be when an application does O_DIRECT I/O and the
hardware device DMAs to/from the application's memory pages. ublk
doesn't do that today and when combined with VFIO it doesn't get any
easier. I don't think it's possible because you cannot allow userspace
to control a hardware device and grant DMA access to pages that
userspace isn't allowed to access. A malicious userspace will program
the device to access those pages :).

> 
> 5) notification from hardware: interrupt or polling
> - SPDK applies userspace polling, this way is doable, but
> eat CPU, so it is only one choice
> 
> - io_uring command has been proved as very efficient, if io_uring
> command is applied(similar way with UBLK for forwarding blk io
> command from kernel to userspace) to uio/vfio for delivering interrupt,
> which should be efficient too, given batching processes are done after
> the io_uring command is completed

I wonder how much difference there is between the new io_uring command
for receiving VFIO irqs that you are suggesting compared to the existing
io_uring approach IORING_OP_READ eventfd.

> - or it could be flexible by hybrid interrupt & polling, given
> userspace single pthread/queue implementation can retrieve all
> kinds of inflight IO info in very cheap way, and maybe it is likely
> to apply some ML model to learn & predict when IO will be completed

Stefano Garzarella and I have discussed but not yet attempted to add a
userspace memory polling command to io_uring. IORING_OP_POLL_MEMORY
would be useful together with IORING_SETUP_IOPOLL. That way kernel
polling can be combined with userspace polling on a single CPU.

I'm not sure it's useful for ublk because you may not have any reason to
use IORING_SETUP_IOPOLL. But applications that have an Linux NVMe block
device open with IORING_SETUP_IOPOLL could use the new
IORING_OP_POLL_MEMORY command to also watch for activity on a VIRTIO or
VFIO PCI device or maybe just to get kicked by another userspace thread.

> 6) others?
> 
> 
> 
> [1] https://github.com/ming1/ubdsrv
> [2] https://spdk.io/doc/userspace.html
>  
> 
> Thanks, 
> Ming
> 
Attachment:
signature.asc

Description: PGP signature