Re: [LSF/MM/BPF BoF]: extend UBLK to cover real storage hardware

Ming Lei <ming.lei@xxxxxxxxxx> · Mon, 13 Feb 2023 11:47:31 +0800

On Wed, Feb 08, 2023 at 07:17:10AM -0500, Stefan Hajnoczi wrote:
> On Wed, Feb 08, 2023 at 10:12:19AM +0800, Ming Lei wrote:
> > On Mon, Feb 06, 2023 at 03:27:09PM -0500, Stefan Hajnoczi wrote:
> > > On Mon, Feb 06, 2023 at 11:00:27PM +0800, Ming Lei wrote:
> > > > Hello,
> > > > 
> > > > So far UBLK is only used for implementing virtual block device from
> > > > userspace, such as loop, nbd, qcow2, ...[1].
> > > 
> > > I won't be at LSF/MM so here are my thoughts:
> > 
> > Thanks for the thoughts, :-)
> > 
> > > 
> > > > 
> > > > It could be useful for UBLK to cover real storage hardware too:
> > > > 
> > > > - for fast prototype or performance evaluation
> > > > 
> > > > - some network storages are attached to host, such as iscsi and nvme-tcp,
> > > > the current UBLK interface doesn't support such devices, since it needs
> > > > all LUNs/Namespaces to share host resources(such as tag)
> > > 
> > > Can you explain this in more detail? It seems like an iSCSI or
> > > NVMe-over-TCP initiator could be implemented as a ublk server today.
> > > What am I missing?
> > 
> > The current ublk can't do that yet, because the interface doesn't
> > support multiple ublk disks sharing single host, which is exactly
> > the case of scsi and nvme.
> 
> Can you give an example that shows exactly where a problem is hit?
> 
> I took a quick look at the ublk source code and didn't spot a place
> where it prevents a single ublk server process from handling multiple
> devices.
> 
> Regarding "host resources(such as tag)", can the ublk server deal with
> that in userspace? The Linux block layer doesn't have the concept of a
> "host", that would come in at the SCSI/NVMe level that's implemented in
> userspace.
> 
> I don't understand yet...

blk_mq_tag_set is embedded into driver host structure, and referred by queue
via q->tag_set, both scsi and nvme allocates tag in host/queue wide,
that said all LUNs/NSs share host/queue tags, current every ublk
device is independent, and can't shard tags.

> 
> > 
> > > 
> > > > 
> > > > - SPDK has supported user space driver for real hardware
> > > 
> > > I think this could already be implemented today. There will be extra
> > > memory copies because SPDK won't have access to the application's memory
> > > pages.
> > 
> > Here I proposed zero copy, and current SPDK nvme-pci implementation haven't
> > such extra copy per my understanding.
> > 
> > > 
> > > > 
> > > > So propose to extend UBLK for supporting real hardware device:
> > > > 
> > > > 1) extend UBLK ABI interface to support disks attached to host, such
> > > > as SCSI Luns/NVME Namespaces
> > > > 
> > > > 2) the followings are related with operating hardware from userspace,
> > > > so userspace driver has to be trusted, and root is required, and
> > > > can't support unprivileged UBLK device
> > > 
> > > Linux VFIO provides a safe userspace API for userspace device drivers.
> > > That means memory and interrupts are isolated. Neither userspace nor the
> > > hardware device can access memory or interrupts that the userspace
> > > process is not allowed to access.
> > > 
> > > I think there are still limitations like all memory pages exposed to the
> > > device need to be pinned. So effectively you might still need privileges
> > > to get the mlock resource limits.
> > > 
> > > But overall I think what you're saying about root and unprivileged ublk
> > > devices is not true. Hardware support should be developed with the goal
> > > of supporting unprivileged userspace ublk servers.
> > > 
> > > Those unprivileged userspace ublk servers cannot claim any PCI device
> > > they want. The user/admin will need to give them permission to open a
> > > network card, SCSI HBA, etc.
> > 
> > It depends on implementation, please see
> > 
> > 	https://spdk.io/doc/userspace.html
> > 
> > 	```
> > 	The SPDK NVMe Driver, for instance, maps the BAR for the NVMe device and
> > 	then follows along with the NVMe Specification to initialize the device,
> > 	create queue pairs, and ultimately send I/O.
> > 	```
> > 
> > The above way needs userspace to operating hardware by the mapped BAR,
> > which can't be allowed for unprivileged user.
> 
> From https://spdk.io/doc/system_configuration.html:
> 
>   Running SPDK as non-privileged user
> 
>   One of the benefits of using the VFIO Linux kernel driver is the
>   ability to perform DMA operations with peripheral devices as
>   unprivileged user. The permissions to access particular devices still
>   need to be granted by the system administrator, but only on a one-time
>   basis. Note that this functionality is supported with DPDK starting
>   from version 18.11.
> 
> This is what I had described in my previous reply.

My reference on spdk were mostly from spdk/nvme doc.
Just take quick look at spdk code, looks both vfio and direct
programming hardware are supported:

1) lib/nvme/nvme_vfio_user.c
const struct spdk_nvme_transport_ops vfio_ops {
	.qpair_submit_request = nvme_pcie_qpair_submit_request,

2) lib/nvme/nvme_pcie.c
const struct spdk_nvme_transport_ops pcie_ops = {
	.qpair_submit_request = nvme_pcie_qpair_submit_request
		nvme_pcie_qpair_submit_tracker
			nvme_pcie_qpair_submit_tracker
				nvme_pcie_qpair_ring_sq_doorbell

but vfio dma isn't used in nvme_pcie_qpair_submit_request, and simply
write/read mmaped mmio.

> 
> > 
> > > 
> > > > 
> > > > 3) how to operating hardware memory space
> > > > - unbind kernel driver and rebind with uio/vfio
> > > > - map PCI BAR into userspace[2], then userspace can operate hardware
> > > > with mapped user address via MMIO
> > > >
> > > > 4) DMA
> > > > - DMA requires physical memory address, UBLK driver actually has
> > > > block request pages, so can we export request SG list(each segment
> > > > physical address, offset, len) into userspace? If the max_segments
> > > > limit is not too big(<=64), the needed buffer for holding SG list
> > > > can be small enough.
> > > 
> > > DMA with an IOMMU requires an I/O Virtual Address, not a CPU physical
> > > address. The IOVA space is defined by the IOMMU page tables. Userspace
> > > controls the IOMMU page tables via Linux VFIO ioctls.
> > > 
> > > For example, <linux/vfio.h> struct vfio_iommu_type1_dma_map defines the
> > > IOMMU mapping that makes a range of userspace virtual addresses
> > > available at a given IOVA.
> > > 
> > > Mapping and unmapping operations are not free. Similar to mmap(2), the
> > > program will be slow if it does this frequently.
> > 
> > Yeah, but SPDK shouldn't use vfio DMA interface, see:
> > 
> > https://spdk.io/doc/memory.html
> > 
> > they just programs DMA directly with physical address of pinned hugepages.
> 
> From the page you linked:
> 
>   IOMMU Support
> 
>   ...
> 
>   This is a future-proof, hardware-accelerated solution for performing
>   DMA operations into and out of a user space process and forms the
>   long-term foundation for SPDK and DPDK's memory management strategy.
>   We highly recommend that applications are deployed using vfio and the
>   IOMMU enabled, which is fully supported today.
> 
> Yes, SPDK supports running without IOMMU, but they recommend running
> with the IOMMU.
> 
> > 
> > > 
> > > I think it's effectively the same problem as ublk zero-copy. We want to
> > > give the ublk server access to just the I/O buffers that it currently
> > > needs, but doing so would be expensive :(.
> > > 
> > > I think Linux has strategies for avoiding the expense like
> > > iommu.strict=0 and swiotlb. The drawback is that in our case userspace
> > > and/or the hardware device controller by userspace would still have
> > > access to the memory pages after I/O has completed. This reduces memory
> > > isolation :(.
> > > 
> > > DPDK/SPDK and QEMU use long-lived Linux VFIO DMA mappings.
> > 
> > Per the above SPDK links, the nvme-pci doesn't use vfio dma mapping.
> 
> When using VFIO (recommended by the docs), SPDK uses long-lived DMA
> mappings. Here are places in the SPDK/DPDK source code where VFIO DMA
> mapping is used:
> https://github.com/spdk/spdk/blob/master/lib/env_dpdk/memory.c#L1371
> https://github.com/spdk/dpdk/blob/e89c0845a60831864becc261cff48dd9321e7e79/lib/eal/linux/eal_vfio.c#L2164

I meant spdk nvme implementation.

> 
> > 
> > > 
> > > What I'm trying to get at is that either memory isolation is compromised
> > > or performance is reduced. It's hard to have good performance together
> > > with memory isolation.
> > > 
> > > I think ublk should follow the VFIO philosophy of being a safe
> > > kernel/userspace interface. If userspace is malicious or buggy, the
> > > kernel's and other process' memory should not be corrupted.
> > 
> > It is tradeoff between performance and isolation, that is why I mention
> > that directing programming hardware in userspace can be done by root
> > only.
> 
> Yes, there is a trade-off. Over the years the use of unsafe approaches
> has been discouraged and replaced (/dev/kmem, uio -> VFIO, etc). As
> secure boot, integrity architecture, and stuff like that becomes more
> widely used, it's harder to include features that break memory isolation
> in software in mainstream distros. There can be an option to sacrifice
> memory isolation for performance and some users may be willing to accept
> the trade-off. I think it should be an option feature though.
> 
> I did want to point out that the statement that "direct programming
> hardware in userspace can be done by root only" is false (see VFIO).

Unfortunately not see vfio is used when spdk/nvme is operating hardware
mmio.

> 
> > > 
> > > > 
> > > > - small amount of physical memory for using as DMA descriptor can be
> > > > pre-allocated from userspace, and ask kernel to pin pages, then still
> > > > return physical address to userspace for programming DMA
> > > 
> > > I think this is possible today. The ublk server owns the I/O buffers. It
> > > can mlock them and DMA map them via VFIO. ublk doesn't need to know
> > > anything about this.
> > 
> > It depends on if such VFIO DMA mapping is required for each IO. If it
> > is required, that won't help one high performance driver.
> 
> It is not necessary to perform a DMA mapping for each IO. ublk's
> existing model is sufficient:
> 1. ublk server allocates I/O buffers and VFIO DMA maps them on startup.
> 2. At runtime the ublk server provides these I/O buffers to the kernel,
>    no further DMA mapping is required.
> 
> Unfortunately there's still the kernel<->userspace copy that existing
> ublk applications have, but there's no new overhead related to VFIO.

We are working on ublk zero copy for avoiding the copy.

> 
> > > 
> > > > - this way is still zero copy
> > > 
> > > True zero-copy would be when an application does O_DIRECT I/O and the
> > > hardware device DMAs to/from the application's memory pages. ublk
> > > doesn't do that today and when combined with VFIO it doesn't get any
> > > easier. I don't think it's possible because you cannot allow userspace
> > > to control a hardware device and grant DMA access to pages that
> > > userspace isn't allowed to access. A malicious userspace will program
> > > the device to access those pages :).
> > 
> > But that should be what SPDK nvme/pci is doing per the above links, :-)
> 
> Sure, it's possible to break memory isolation. Breaking memory isolation
> isn't specific to ublk servers that access hardware. The same unsafe
> zero-copy approach would probably also work for regular ublk servers.
> This is basically bringing back /dev/kmem :).
> 
> > 
> > > 
> > > > 
> > > > 5) notification from hardware: interrupt or polling
> > > > - SPDK applies userspace polling, this way is doable, but
> > > > eat CPU, so it is only one choice
> > > > 
> > > > - io_uring command has been proved as very efficient, if io_uring
> > > > command is applied(similar way with UBLK for forwarding blk io
> > > > command from kernel to userspace) to uio/vfio for delivering interrupt,
> > > > which should be efficient too, given batching processes are done after
> > > > the io_uring command is completed
> > > 
> > > I wonder how much difference there is between the new io_uring command
> > > for receiving VFIO irqs that you are suggesting compared to the existing
> > > io_uring approach IORING_OP_READ eventfd.
> > 
> > eventfd needs extra read/write on the event fd, so more syscalls are
> > required.
> 
> No extra syscall is required because IORING_OP_READ is used to read the
> eventfd, but maybe you were referring to bypassing the
> file->f_op->read() code path?

OK, missed that, it is usually done in the following way:

	io_uring_prep_poll_add(sqe, evfd, POLLIN)
	sqe->flags |= IOSQE_IO_LINK;
	...
    sqe = io_uring_get_sqe(&ring);
    io_uring_prep_readv(sqe, evfd, &vec, 1, 0);
    sqe->flags |= IOSQE_IO_LINK;

When I get time, will compare the two and see which one performs better.

thanks, 
Ming