On Mon, Feb 13, 2023 at 02:13:59PM -0500, Stefan Hajnoczi wrote: > On Mon, Feb 13, 2023 at 11:47:31AM +0800, Ming Lei wrote: > > On Wed, Feb 08, 2023 at 07:17:10AM -0500, Stefan Hajnoczi wrote: > > > On Wed, Feb 08, 2023 at 10:12:19AM +0800, Ming Lei wrote: > > > > On Mon, Feb 06, 2023 at 03:27:09PM -0500, Stefan Hajnoczi wrote: > > > > > On Mon, Feb 06, 2023 at 11:00:27PM +0800, Ming Lei wrote: > > > > > > Hello, > > > > > > > > > > > > So far UBLK is only used for implementing virtual block device from > > > > > > userspace, such as loop, nbd, qcow2, ...[1]. > > > > > > > > > > I won't be at LSF/MM so here are my thoughts: > > > > > > > > Thanks for the thoughts, :-) > > > > > > > > > > > > > > > > > > > > > It could be useful for UBLK to cover real storage hardware too: > > > > > > > > > > > > - for fast prototype or performance evaluation > > > > > > > > > > > > - some network storages are attached to host, such as iscsi and nvme-tcp, > > > > > > the current UBLK interface doesn't support such devices, since it needs > > > > > > all LUNs/Namespaces to share host resources(such as tag) > > > > > > > > > > Can you explain this in more detail? It seems like an iSCSI or > > > > > NVMe-over-TCP initiator could be implemented as a ublk server today. > > > > > What am I missing? > > > > > > > > The current ublk can't do that yet, because the interface doesn't > > > > support multiple ublk disks sharing single host, which is exactly > > > > the case of scsi and nvme. > > > > > > Can you give an example that shows exactly where a problem is hit? > > > > > > I took a quick look at the ublk source code and didn't spot a place > > > where it prevents a single ublk server process from handling multiple > > > devices. > > > > > > Regarding "host resources(such as tag)", can the ublk server deal with > > > that in userspace? The Linux block layer doesn't have the concept of a > > > "host", that would come in at the SCSI/NVMe level that's implemented in > > > userspace. > > > > > > I don't understand yet... > > > > blk_mq_tag_set is embedded into driver host structure, and referred by queue > > via q->tag_set, both scsi and nvme allocates tag in host/queue wide, > > that said all LUNs/NSs share host/queue tags, current every ublk > > device is independent, and can't shard tags. > > Does this actually prevent ublk servers with multiple ublk devices or is > it just sub-optimal? It is former, ublk can't support multiple devices which share single host because duplicated tag can be seen in host side, then io is failed. > > Also, is this specific to real storage hardware? I guess userspace > NVMe-over-TCP or iSCSI initiators would be affected regardless of > whether they simply use the Sockets API (software) or userspace device > drivers (hardware). > > Sorry for all these questions, I think I'm a little confused because you > said "doesn't support such devices" and I thought this discussion was > about real storage hardware. Neither of these seem to apply to the > tag_set issue. The reality is that both scsi and nvme(either virt or real hardware) supports multi LUNs/NSs, so tag_set issue has to be solved, or multi-LUNs/NSs has to be supported. > > > > > > > > > > > > > > > > > > > > > > > > > > > - SPDK has supported user space driver for real hardware > > > > > > > > > > I think this could already be implemented today. There will be extra > > > > > memory copies because SPDK won't have access to the application's memory > > > > > pages. > > > > > > > > Here I proposed zero copy, and current SPDK nvme-pci implementation haven't > > > > such extra copy per my understanding. > > > > > > > > > > > > > > > > > > > > > So propose to extend UBLK for supporting real hardware device: > > > > > > > > > > > > 1) extend UBLK ABI interface to support disks attached to host, such > > > > > > as SCSI Luns/NVME Namespaces > > > > > > > > > > > > 2) the followings are related with operating hardware from userspace, > > > > > > so userspace driver has to be trusted, and root is required, and > > > > > > can't support unprivileged UBLK device > > > > > > > > > > Linux VFIO provides a safe userspace API for userspace device drivers. > > > > > That means memory and interrupts are isolated. Neither userspace nor the > > > > > hardware device can access memory or interrupts that the userspace > > > > > process is not allowed to access. > > > > > > > > > > I think there are still limitations like all memory pages exposed to the > > > > > device need to be pinned. So effectively you might still need privileges > > > > > to get the mlock resource limits. > > > > > > > > > > But overall I think what you're saying about root and unprivileged ublk > > > > > devices is not true. Hardware support should be developed with the goal > > > > > of supporting unprivileged userspace ublk servers. > > > > > > > > > > Those unprivileged userspace ublk servers cannot claim any PCI device > > > > > they want. The user/admin will need to give them permission to open a > > > > > network card, SCSI HBA, etc. > > > > > > > > It depends on implementation, please see > > > > > > > > https://spdk.io/doc/userspace.html > > > > > > > > ``` > > > > The SPDK NVMe Driver, for instance, maps the BAR for the NVMe device and > > > > then follows along with the NVMe Specification to initialize the device, > > > > create queue pairs, and ultimately send I/O. > > > > ``` > > > > > > > > The above way needs userspace to operating hardware by the mapped BAR, > > > > which can't be allowed for unprivileged user. > > > > > > From https://spdk.io/doc/system_configuration.html: > > > > > > Running SPDK as non-privileged user > > > > > > One of the benefits of using the VFIO Linux kernel driver is the > > > ability to perform DMA operations with peripheral devices as > > > unprivileged user. The permissions to access particular devices still > > > need to be granted by the system administrator, but only on a one-time > > > basis. Note that this functionality is supported with DPDK starting > > > from version 18.11. > > > > > > This is what I had described in my previous reply. > > > > My reference on spdk were mostly from spdk/nvme doc. > > Just take quick look at spdk code, looks both vfio and direct > > programming hardware are supported: > > > > 1) lib/nvme/nvme_vfio_user.c > > const struct spdk_nvme_transport_ops vfio_ops { > > .qpair_submit_request = nvme_pcie_qpair_submit_request, > > Ignore this, it's the userspace vfio-user UNIX domain socket protocol > support. It's not kernel VFIO and is unrelated to what we're discussing. > More info on vfio-user: https://spdk.io/news/2021/05/04/vfio-user/ Not sure, why does .qpair_submit_request point to nvme_pcie_qpair_submit_request? > > > > > > > 2) lib/nvme/nvme_pcie.c > > const struct spdk_nvme_transport_ops pcie_ops = { > > .qpair_submit_request = nvme_pcie_qpair_submit_request > > nvme_pcie_qpair_submit_tracker > > nvme_pcie_qpair_submit_tracker > > nvme_pcie_qpair_ring_sq_doorbell > > > > but vfio dma isn't used in nvme_pcie_qpair_submit_request, and simply > > write/read mmaped mmio. > > I have only a small amount of SPDK code experienced, so this might be Me too. > wrong, but I think the NVMe PCI driver code does not need to directly > call VFIO APIs. That is handled by DPDK/SPDK's EAL operating system > abstractions and device driver APIs. > > DMA memory is mapped permanently so the device driver doesn't need to > perform individual map/unmap operations in the data path. NVMe PCI > request submission builds the NVMe command structures containing device > addresses (i.e. IOVAs when IOMMU is enabled). If IOMMU isn't used, it is physical address of memory. Then I guess you may understand why I said this way can't be done by un-privileged user, cause driver is writing memory physical address to device register directly. But other driver can follow this approach if the way is accepted. > > This code probably supports both IOMMU (VFIO) and non-IOMMU operation. > > > > > > > > > > > > > > > > > > > > > > > > > > > 3) how to operating hardware memory space > > > > > > - unbind kernel driver and rebind with uio/vfio > > > > > > - map PCI BAR into userspace[2], then userspace can operate hardware > > > > > > with mapped user address via MMIO > > > > > > > > > > > > 4) DMA > > > > > > - DMA requires physical memory address, UBLK driver actually has > > > > > > block request pages, so can we export request SG list(each segment > > > > > > physical address, offset, len) into userspace? If the max_segments > > > > > > limit is not too big(<=64), the needed buffer for holding SG list > > > > > > can be small enough. > > > > > > > > > > DMA with an IOMMU requires an I/O Virtual Address, not a CPU physical > > > > > address. The IOVA space is defined by the IOMMU page tables. Userspace > > > > > controls the IOMMU page tables via Linux VFIO ioctls. > > > > > > > > > > For example, <linux/vfio.h> struct vfio_iommu_type1_dma_map defines the > > > > > IOMMU mapping that makes a range of userspace virtual addresses > > > > > available at a given IOVA. > > > > > > > > > > Mapping and unmapping operations are not free. Similar to mmap(2), the > > > > > program will be slow if it does this frequently. > > > > > > > > Yeah, but SPDK shouldn't use vfio DMA interface, see: > > > > > > > > https://spdk.io/doc/memory.html > > > > > > > > they just programs DMA directly with physical address of pinned hugepages. > > > > > > From the page you linked: > > > > > > IOMMU Support > > > > > > ... > > > > > > This is a future-proof, hardware-accelerated solution for performing > > > DMA operations into and out of a user space process and forms the > > > long-term foundation for SPDK and DPDK's memory management strategy. > > > We highly recommend that applications are deployed using vfio and the > > > IOMMU enabled, which is fully supported today. > > > > > > Yes, SPDK supports running without IOMMU, but they recommend running > > > with the IOMMU. > > > > > > > > > > > > > > > > > I think it's effectively the same problem as ublk zero-copy. We want to > > > > > give the ublk server access to just the I/O buffers that it currently > > > > > needs, but doing so would be expensive :(. > > > > > > > > > > I think Linux has strategies for avoiding the expense like > > > > > iommu.strict=0 and swiotlb. The drawback is that in our case userspace > > > > > and/or the hardware device controller by userspace would still have > > > > > access to the memory pages after I/O has completed. This reduces memory > > > > > isolation :(. > > > > > > > > > > DPDK/SPDK and QEMU use long-lived Linux VFIO DMA mappings. > > > > > > > > Per the above SPDK links, the nvme-pci doesn't use vfio dma mapping. > > > > > > When using VFIO (recommended by the docs), SPDK uses long-lived DMA > > > mappings. Here are places in the SPDK/DPDK source code where VFIO DMA > > > mapping is used: > > > https://github.com/spdk/spdk/blob/master/lib/env_dpdk/memory.c#L1371 > > > https://github.com/spdk/dpdk/blob/e89c0845a60831864becc261cff48dd9321e7e79/lib/eal/linux/eal_vfio.c#L2164 > > > > I meant spdk nvme implementation. > > I did too. The NVMe PCI driver will use the PCI driver APIs and the EAL > (operating system abstraction) will deal with IOMMU APIs (VFIO) > transparently. > > > > > > > > > > > > > > > > > > > > What I'm trying to get at is that either memory isolation is compromised > > > > > or performance is reduced. It's hard to have good performance together > > > > > with memory isolation. > > > > > > > > > > I think ublk should follow the VFIO philosophy of being a safe > > > > > kernel/userspace interface. If userspace is malicious or buggy, the > > > > > kernel's and other process' memory should not be corrupted. > > > > > > > > It is tradeoff between performance and isolation, that is why I mention > > > > that directing programming hardware in userspace can be done by root > > > > only. > > > > > > Yes, there is a trade-off. Over the years the use of unsafe approaches > > > has been discouraged and replaced (/dev/kmem, uio -> VFIO, etc). As > > > secure boot, integrity architecture, and stuff like that becomes more > > > widely used, it's harder to include features that break memory isolation > > > in software in mainstream distros. There can be an option to sacrifice > > > memory isolation for performance and some users may be willing to accept > > > the trade-off. I think it should be an option feature though. > > > > > > I did want to point out that the statement that "direct programming > > > hardware in userspace can be done by root only" is false (see VFIO). > > > > Unfortunately not see vfio is used when spdk/nvme is operating hardware > > mmio. > > I think my responses above answered this, but just to be clear: with > VFIO PCI userspace mmaps the BARs and performs direct accesses to them > (load/store instructions). No VFIO API wrappers are necessary for MMIO > accesses, so the code you posted works fine with VFIO. > > > > > > > > > > > > > > > > > > > > > > > - small amount of physical memory for using as DMA descriptor can be > > > > > > pre-allocated from userspace, and ask kernel to pin pages, then still > > > > > > return physical address to userspace for programming DMA > > > > > > > > > > I think this is possible today. The ublk server owns the I/O buffers. It > > > > > can mlock them and DMA map them via VFIO. ublk doesn't need to know > > > > > anything about this. > > > > > > > > It depends on if such VFIO DMA mapping is required for each IO. If it > > > > is required, that won't help one high performance driver. > > > > > > It is not necessary to perform a DMA mapping for each IO. ublk's > > > existing model is sufficient: > > > 1. ublk server allocates I/O buffers and VFIO DMA maps them on startup. > > > 2. At runtime the ublk server provides these I/O buffers to the kernel, > > > no further DMA mapping is required. > > > > > > Unfortunately there's still the kernel<->userspace copy that existing > > > ublk applications have, but there's no new overhead related to VFIO. > > > > We are working on ublk zero copy for avoiding the copy. > > I'm curious if it's possible to come up with a solution that doesn't > break memory isolation. Userspace controls the IOMMU with Linux VFIO, so > if kernel pages are exposed to the device, then userspace will also be > able to access them (e.g. by submitting a request that gets the device > to DMA those pages). spdk nvme already exposes physical address of memory and uses the physical address to program hardware directly. And I think it can't be done by un-trusted user. But I agree with you that this way should be avoided as far as possible. > > > > > > > > > > > > > > > > > - this way is still zero copy > > > > > > > > > > True zero-copy would be when an application does O_DIRECT I/O and the > > > > > hardware device DMAs to/from the application's memory pages. ublk > > > > > doesn't do that today and when combined with VFIO it doesn't get any > > > > > easier. I don't think it's possible because you cannot allow userspace > > > > > to control a hardware device and grant DMA access to pages that > > > > > userspace isn't allowed to access. A malicious userspace will program > > > > > the device to access those pages :). > > > > > > > > But that should be what SPDK nvme/pci is doing per the above links, :-) > > > > > > Sure, it's possible to break memory isolation. Breaking memory isolation > > > isn't specific to ublk servers that access hardware. The same unsafe > > > zero-copy approach would probably also work for regular ublk servers. > > > This is basically bringing back /dev/kmem :). > > > > > > > > > > > > > > > > > > > > > > > > 5) notification from hardware: interrupt or polling > > > > > > - SPDK applies userspace polling, this way is doable, but > > > > > > eat CPU, so it is only one choice > > > > > > > > > > > > - io_uring command has been proved as very efficient, if io_uring > > > > > > command is applied(similar way with UBLK for forwarding blk io > > > > > > command from kernel to userspace) to uio/vfio for delivering interrupt, > > > > > > which should be efficient too, given batching processes are done after > > > > > > the io_uring command is completed > > > > > > > > > > I wonder how much difference there is between the new io_uring command > > > > > for receiving VFIO irqs that you are suggesting compared to the existing > > > > > io_uring approach IORING_OP_READ eventfd. > > > > > > > > eventfd needs extra read/write on the event fd, so more syscalls are > > > > required. > > > > > > No extra syscall is required because IORING_OP_READ is used to read the > > > eventfd, but maybe you were referring to bypassing the > > > file->f_op->read() code path? > > > > OK, missed that, it is usually done in the following way: > > > > io_uring_prep_poll_add(sqe, evfd, POLLIN) > > sqe->flags |= IOSQE_IO_LINK; > > ... > > sqe = io_uring_get_sqe(&ring); > > io_uring_prep_readv(sqe, evfd, &vec, 1, 0); > > sqe->flags |= IOSQE_IO_LINK; > > > > When I get time, will compare the two and see which one performs better. > > That would be really interesting. Anyway, interrupt notification looks not one big deal. Thanks, Ming