On Mon, Feb 06, 2023 at 11:00:27PM +0800, Ming Lei wrote: > Hello, > > So far UBLK is only used for implementing virtual block device from > userspace, such as loop, nbd, qcow2, ...[1]. I won't be at LSF/MM so here are my thoughts: > > It could be useful for UBLK to cover real storage hardware too: > > - for fast prototype or performance evaluation > > - some network storages are attached to host, such as iscsi and nvme-tcp, > the current UBLK interface doesn't support such devices, since it needs > all LUNs/Namespaces to share host resources(such as tag) Can you explain this in more detail? It seems like an iSCSI or NVMe-over-TCP initiator could be implemented as a ublk server today. What am I missing? > > - SPDK has supported user space driver for real hardware I think this could already be implemented today. There will be extra memory copies because SPDK won't have access to the application's memory pages. > > So propose to extend UBLK for supporting real hardware device: > > 1) extend UBLK ABI interface to support disks attached to host, such > as SCSI Luns/NVME Namespaces > > 2) the followings are related with operating hardware from userspace, > so userspace driver has to be trusted, and root is required, and > can't support unprivileged UBLK device Linux VFIO provides a safe userspace API for userspace device drivers. That means memory and interrupts are isolated. Neither userspace nor the hardware device can access memory or interrupts that the userspace process is not allowed to access. I think there are still limitations like all memory pages exposed to the device need to be pinned. So effectively you might still need privileges to get the mlock resource limits. But overall I think what you're saying about root and unprivileged ublk devices is not true. Hardware support should be developed with the goal of supporting unprivileged userspace ublk servers. Those unprivileged userspace ublk servers cannot claim any PCI device they want. The user/admin will need to give them permission to open a network card, SCSI HBA, etc. > > 3) how to operating hardware memory space > - unbind kernel driver and rebind with uio/vfio > - map PCI BAR into userspace[2], then userspace can operate hardware > with mapped user address via MMIO > > 4) DMA > - DMA requires physical memory address, UBLK driver actually has > block request pages, so can we export request SG list(each segment > physical address, offset, len) into userspace? If the max_segments > limit is not too big(<=64), the needed buffer for holding SG list > can be small enough. DMA with an IOMMU requires an I/O Virtual Address, not a CPU physical address. The IOVA space is defined by the IOMMU page tables. Userspace controls the IOMMU page tables via Linux VFIO ioctls. For example, <linux/vfio.h> struct vfio_iommu_type1_dma_map defines the IOMMU mapping that makes a range of userspace virtual addresses available at a given IOVA. Mapping and unmapping operations are not free. Similar to mmap(2), the program will be slow if it does this frequently. I think it's effectively the same problem as ublk zero-copy. We want to give the ublk server access to just the I/O buffers that it currently needs, but doing so would be expensive :(. I think Linux has strategies for avoiding the expense like iommu.strict=0 and swiotlb. The drawback is that in our case userspace and/or the hardware device controller by userspace would still have access to the memory pages after I/O has completed. This reduces memory isolation :(. DPDK/SPDK and QEMU use long-lived Linux VFIO DMA mappings. What I'm trying to get at is that either memory isolation is compromised or performance is reduced. It's hard to have good performance together with memory isolation. I think ublk should follow the VFIO philosophy of being a safe kernel/userspace interface. If userspace is malicious or buggy, the kernel's and other process' memory should not be corrupted. > > - small amount of physical memory for using as DMA descriptor can be > pre-allocated from userspace, and ask kernel to pin pages, then still > return physical address to userspace for programming DMA I think this is possible today. The ublk server owns the I/O buffers. It can mlock them and DMA map them via VFIO. ublk doesn't need to know anything about this. > - this way is still zero copy True zero-copy would be when an application does O_DIRECT I/O and the hardware device DMAs to/from the application's memory pages. ublk doesn't do that today and when combined with VFIO it doesn't get any easier. I don't think it's possible because you cannot allow userspace to control a hardware device and grant DMA access to pages that userspace isn't allowed to access. A malicious userspace will program the device to access those pages :). > > 5) notification from hardware: interrupt or polling > - SPDK applies userspace polling, this way is doable, but > eat CPU, so it is only one choice > > - io_uring command has been proved as very efficient, if io_uring > command is applied(similar way with UBLK for forwarding blk io > command from kernel to userspace) to uio/vfio for delivering interrupt, > which should be efficient too, given batching processes are done after > the io_uring command is completed I wonder how much difference there is between the new io_uring command for receiving VFIO irqs that you are suggesting compared to the existing io_uring approach IORING_OP_READ eventfd. > - or it could be flexible by hybrid interrupt & polling, given > userspace single pthread/queue implementation can retrieve all > kinds of inflight IO info in very cheap way, and maybe it is likely > to apply some ML model to learn & predict when IO will be completed Stefano Garzarella and I have discussed but not yet attempted to add a userspace memory polling command to io_uring. IORING_OP_POLL_MEMORY would be useful together with IORING_SETUP_IOPOLL. That way kernel polling can be combined with userspace polling on a single CPU. I'm not sure it's useful for ublk because you may not have any reason to use IORING_SETUP_IOPOLL. But applications that have an Linux NVMe block device open with IORING_SETUP_IOPOLL could use the new IORING_OP_POLL_MEMORY command to also watch for activity on a VIRTIO or VFIO PCI device or maybe just to get kicked by another userspace thread. > 6) others? > > > > [1] https://github.com/ming1/ubdsrv > [2] https://spdk.io/doc/userspace.html > > > Thanks, > Ming >
Attachment:
signature.asc
Description: PGP signature