On Wed, Feb 08, 2023 at 07:17:10AM -0500, Stefan Hajnoczi wrote: > On Wed, Feb 08, 2023 at 10:12:19AM +0800, Ming Lei wrote: > > On Mon, Feb 06, 2023 at 03:27:09PM -0500, Stefan Hajnoczi wrote: > > > On Mon, Feb 06, 2023 at 11:00:27PM +0800, Ming Lei wrote: > > > > Hello, > > > > > > > > So far UBLK is only used for implementing virtual block device from > > > > userspace, such as loop, nbd, qcow2, ...[1]. > > > > > > I won't be at LSF/MM so here are my thoughts: > > > > Thanks for the thoughts, :-) > > > > > > > > > > > > > It could be useful for UBLK to cover real storage hardware too: > > > > > > > > - for fast prototype or performance evaluation > > > > > > > > - some network storages are attached to host, such as iscsi and nvme-tcp, > > > > the current UBLK interface doesn't support such devices, since it needs > > > > all LUNs/Namespaces to share host resources(such as tag) > > > > > > Can you explain this in more detail? It seems like an iSCSI or > > > NVMe-over-TCP initiator could be implemented as a ublk server today. > > > What am I missing? > > > > The current ublk can't do that yet, because the interface doesn't > > support multiple ublk disks sharing single host, which is exactly > > the case of scsi and nvme. > > Can you give an example that shows exactly where a problem is hit? > > I took a quick look at the ublk source code and didn't spot a place > where it prevents a single ublk server process from handling multiple > devices. > > Regarding "host resources(such as tag)", can the ublk server deal with > that in userspace? The Linux block layer doesn't have the concept of a > "host", that would come in at the SCSI/NVMe level that's implemented in > userspace. > > I don't understand yet... blk_mq_tag_set is embedded into driver host structure, and referred by queue via q->tag_set, both scsi and nvme allocates tag in host/queue wide, that said all LUNs/NSs share host/queue tags, current every ublk device is independent, and can't shard tags. > > > > > > > > > > > > > > - SPDK has supported user space driver for real hardware > > > > > > I think this could already be implemented today. There will be extra > > > memory copies because SPDK won't have access to the application's memory > > > pages. > > > > Here I proposed zero copy, and current SPDK nvme-pci implementation haven't > > such extra copy per my understanding. > > > > > > > > > > > > > So propose to extend UBLK for supporting real hardware device: > > > > > > > > 1) extend UBLK ABI interface to support disks attached to host, such > > > > as SCSI Luns/NVME Namespaces > > > > > > > > 2) the followings are related with operating hardware from userspace, > > > > so userspace driver has to be trusted, and root is required, and > > > > can't support unprivileged UBLK device > > > > > > Linux VFIO provides a safe userspace API for userspace device drivers. > > > That means memory and interrupts are isolated. Neither userspace nor the > > > hardware device can access memory or interrupts that the userspace > > > process is not allowed to access. > > > > > > I think there are still limitations like all memory pages exposed to the > > > device need to be pinned. So effectively you might still need privileges > > > to get the mlock resource limits. > > > > > > But overall I think what you're saying about root and unprivileged ublk > > > devices is not true. Hardware support should be developed with the goal > > > of supporting unprivileged userspace ublk servers. > > > > > > Those unprivileged userspace ublk servers cannot claim any PCI device > > > they want. The user/admin will need to give them permission to open a > > > network card, SCSI HBA, etc. > > > > It depends on implementation, please see > > > > https://spdk.io/doc/userspace.html > > > > ``` > > The SPDK NVMe Driver, for instance, maps the BAR for the NVMe device and > > then follows along with the NVMe Specification to initialize the device, > > create queue pairs, and ultimately send I/O. > > ``` > > > > The above way needs userspace to operating hardware by the mapped BAR, > > which can't be allowed for unprivileged user. > > From https://spdk.io/doc/system_configuration.html: > > Running SPDK as non-privileged user > > One of the benefits of using the VFIO Linux kernel driver is the > ability to perform DMA operations with peripheral devices as > unprivileged user. The permissions to access particular devices still > need to be granted by the system administrator, but only on a one-time > basis. Note that this functionality is supported with DPDK starting > from version 18.11. > > This is what I had described in my previous reply. My reference on spdk were mostly from spdk/nvme doc. Just take quick look at spdk code, looks both vfio and direct programming hardware are supported: 1) lib/nvme/nvme_vfio_user.c const struct spdk_nvme_transport_ops vfio_ops { .qpair_submit_request = nvme_pcie_qpair_submit_request, 2) lib/nvme/nvme_pcie.c const struct spdk_nvme_transport_ops pcie_ops = { .qpair_submit_request = nvme_pcie_qpair_submit_request nvme_pcie_qpair_submit_tracker nvme_pcie_qpair_submit_tracker nvme_pcie_qpair_ring_sq_doorbell but vfio dma isn't used in nvme_pcie_qpair_submit_request, and simply write/read mmaped mmio. > > > > > > > > > > > > > > 3) how to operating hardware memory space > > > > - unbind kernel driver and rebind with uio/vfio > > > > - map PCI BAR into userspace[2], then userspace can operate hardware > > > > with mapped user address via MMIO > > > > > > > > 4) DMA > > > > - DMA requires physical memory address, UBLK driver actually has > > > > block request pages, so can we export request SG list(each segment > > > > physical address, offset, len) into userspace? If the max_segments > > > > limit is not too big(<=64), the needed buffer for holding SG list > > > > can be small enough. > > > > > > DMA with an IOMMU requires an I/O Virtual Address, not a CPU physical > > > address. The IOVA space is defined by the IOMMU page tables. Userspace > > > controls the IOMMU page tables via Linux VFIO ioctls. > > > > > > For example, <linux/vfio.h> struct vfio_iommu_type1_dma_map defines the > > > IOMMU mapping that makes a range of userspace virtual addresses > > > available at a given IOVA. > > > > > > Mapping and unmapping operations are not free. Similar to mmap(2), the > > > program will be slow if it does this frequently. > > > > Yeah, but SPDK shouldn't use vfio DMA interface, see: > > > > https://spdk.io/doc/memory.html > > > > they just programs DMA directly with physical address of pinned hugepages. > > From the page you linked: > > IOMMU Support > > ... > > This is a future-proof, hardware-accelerated solution for performing > DMA operations into and out of a user space process and forms the > long-term foundation for SPDK and DPDK's memory management strategy. > We highly recommend that applications are deployed using vfio and the > IOMMU enabled, which is fully supported today. > > Yes, SPDK supports running without IOMMU, but they recommend running > with the IOMMU. > > > > > > > > > I think it's effectively the same problem as ublk zero-copy. We want to > > > give the ublk server access to just the I/O buffers that it currently > > > needs, but doing so would be expensive :(. > > > > > > I think Linux has strategies for avoiding the expense like > > > iommu.strict=0 and swiotlb. The drawback is that in our case userspace > > > and/or the hardware device controller by userspace would still have > > > access to the memory pages after I/O has completed. This reduces memory > > > isolation :(. > > > > > > DPDK/SPDK and QEMU use long-lived Linux VFIO DMA mappings. > > > > Per the above SPDK links, the nvme-pci doesn't use vfio dma mapping. > > When using VFIO (recommended by the docs), SPDK uses long-lived DMA > mappings. Here are places in the SPDK/DPDK source code where VFIO DMA > mapping is used: > https://github.com/spdk/spdk/blob/master/lib/env_dpdk/memory.c#L1371 > https://github.com/spdk/dpdk/blob/e89c0845a60831864becc261cff48dd9321e7e79/lib/eal/linux/eal_vfio.c#L2164 I meant spdk nvme implementation. > > > > > > > > > What I'm trying to get at is that either memory isolation is compromised > > > or performance is reduced. It's hard to have good performance together > > > with memory isolation. > > > > > > I think ublk should follow the VFIO philosophy of being a safe > > > kernel/userspace interface. If userspace is malicious or buggy, the > > > kernel's and other process' memory should not be corrupted. > > > > It is tradeoff between performance and isolation, that is why I mention > > that directing programming hardware in userspace can be done by root > > only. > > Yes, there is a trade-off. Over the years the use of unsafe approaches > has been discouraged and replaced (/dev/kmem, uio -> VFIO, etc). As > secure boot, integrity architecture, and stuff like that becomes more > widely used, it's harder to include features that break memory isolation > in software in mainstream distros. There can be an option to sacrifice > memory isolation for performance and some users may be willing to accept > the trade-off. I think it should be an option feature though. > > I did want to point out that the statement that "direct programming > hardware in userspace can be done by root only" is false (see VFIO). Unfortunately not see vfio is used when spdk/nvme is operating hardware mmio. > > > > > > > > > > > > - small amount of physical memory for using as DMA descriptor can be > > > > pre-allocated from userspace, and ask kernel to pin pages, then still > > > > return physical address to userspace for programming DMA > > > > > > I think this is possible today. The ublk server owns the I/O buffers. It > > > can mlock them and DMA map them via VFIO. ublk doesn't need to know > > > anything about this. > > > > It depends on if such VFIO DMA mapping is required for each IO. If it > > is required, that won't help one high performance driver. > > It is not necessary to perform a DMA mapping for each IO. ublk's > existing model is sufficient: > 1. ublk server allocates I/O buffers and VFIO DMA maps them on startup. > 2. At runtime the ublk server provides these I/O buffers to the kernel, > no further DMA mapping is required. > > Unfortunately there's still the kernel<->userspace copy that existing > ublk applications have, but there's no new overhead related to VFIO. We are working on ublk zero copy for avoiding the copy. > > > > > > > > - this way is still zero copy > > > > > > True zero-copy would be when an application does O_DIRECT I/O and the > > > hardware device DMAs to/from the application's memory pages. ublk > > > doesn't do that today and when combined with VFIO it doesn't get any > > > easier. I don't think it's possible because you cannot allow userspace > > > to control a hardware device and grant DMA access to pages that > > > userspace isn't allowed to access. A malicious userspace will program > > > the device to access those pages :). > > > > But that should be what SPDK nvme/pci is doing per the above links, :-) > > Sure, it's possible to break memory isolation. Breaking memory isolation > isn't specific to ublk servers that access hardware. The same unsafe > zero-copy approach would probably also work for regular ublk servers. > This is basically bringing back /dev/kmem :). > > > > > > > > > > > > > > 5) notification from hardware: interrupt or polling > > > > - SPDK applies userspace polling, this way is doable, but > > > > eat CPU, so it is only one choice > > > > > > > > - io_uring command has been proved as very efficient, if io_uring > > > > command is applied(similar way with UBLK for forwarding blk io > > > > command from kernel to userspace) to uio/vfio for delivering interrupt, > > > > which should be efficient too, given batching processes are done after > > > > the io_uring command is completed > > > > > > I wonder how much difference there is between the new io_uring command > > > for receiving VFIO irqs that you are suggesting compared to the existing > > > io_uring approach IORING_OP_READ eventfd. > > > > eventfd needs extra read/write on the event fd, so more syscalls are > > required. > > No extra syscall is required because IORING_OP_READ is used to read the > eventfd, but maybe you were referring to bypassing the > file->f_op->read() code path? OK, missed that, it is usually done in the following way: io_uring_prep_poll_add(sqe, evfd, POLLIN) sqe->flags |= IOSQE_IO_LINK; ... sqe = io_uring_get_sqe(&ring); io_uring_prep_readv(sqe, evfd, &vec, 1, 0); sqe->flags |= IOSQE_IO_LINK; When I get time, will compare the two and see which one performs better. thanks, Ming