Re: [LSF/MM/BPF BoF]: extend UBLK to cover real storage hardware

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Mon, Feb 13, 2023 at 11:47:31AM +0800, Ming Lei wrote:
> On Wed, Feb 08, 2023 at 07:17:10AM -0500, Stefan Hajnoczi wrote:
> > On Wed, Feb 08, 2023 at 10:12:19AM +0800, Ming Lei wrote:
> > > On Mon, Feb 06, 2023 at 03:27:09PM -0500, Stefan Hajnoczi wrote:
> > > > On Mon, Feb 06, 2023 at 11:00:27PM +0800, Ming Lei wrote:
> > > > > Hello,
> > > > > 
> > > > > So far UBLK is only used for implementing virtual block device from
> > > > > userspace, such as loop, nbd, qcow2, ...[1].
> > > > 
> > > > I won't be at LSF/MM so here are my thoughts:
> > > 
> > > Thanks for the thoughts, :-)
> > > 
> > > > 
> > > > > 
> > > > > It could be useful for UBLK to cover real storage hardware too:
> > > > > 
> > > > > - for fast prototype or performance evaluation
> > > > > 
> > > > > - some network storages are attached to host, such as iscsi and nvme-tcp,
> > > > > the current UBLK interface doesn't support such devices, since it needs
> > > > > all LUNs/Namespaces to share host resources(such as tag)
> > > > 
> > > > Can you explain this in more detail? It seems like an iSCSI or
> > > > NVMe-over-TCP initiator could be implemented as a ublk server today.
> > > > What am I missing?
> > > 
> > > The current ublk can't do that yet, because the interface doesn't
> > > support multiple ublk disks sharing single host, which is exactly
> > > the case of scsi and nvme.
> > 
> > Can you give an example that shows exactly where a problem is hit?
> > 
> > I took a quick look at the ublk source code and didn't spot a place
> > where it prevents a single ublk server process from handling multiple
> > devices.
> > 
> > Regarding "host resources(such as tag)", can the ublk server deal with
> > that in userspace? The Linux block layer doesn't have the concept of a
> > "host", that would come in at the SCSI/NVMe level that's implemented in
> > userspace.
> > 
> > I don't understand yet...
> 
> blk_mq_tag_set is embedded into driver host structure, and referred by queue
> via q->tag_set, both scsi and nvme allocates tag in host/queue wide,
> that said all LUNs/NSs share host/queue tags, current every ublk
> device is independent, and can't shard tags.

Does this actually prevent ublk servers with multiple ublk devices or is
it just sub-optimal?

Also, is this specific to real storage hardware? I guess userspace
NVMe-over-TCP or iSCSI initiators would be affected  regardless of
whether they simply use the Sockets API (software) or userspace device
drivers (hardware).

Sorry for all these questions, I think I'm a little confused because you
said "doesn't support such devices" and I thought this discussion was
about real storage hardware. Neither of these seem to apply to the
tag_set issue.

> 
> > 
> > > 
> > > > 
> > > > > 
> > > > > - SPDK has supported user space driver for real hardware
> > > > 
> > > > I think this could already be implemented today. There will be extra
> > > > memory copies because SPDK won't have access to the application's memory
> > > > pages.
> > > 
> > > Here I proposed zero copy, and current SPDK nvme-pci implementation haven't
> > > such extra copy per my understanding.
> > > 
> > > > 
> > > > > 
> > > > > So propose to extend UBLK for supporting real hardware device:
> > > > > 
> > > > > 1) extend UBLK ABI interface to support disks attached to host, such
> > > > > as SCSI Luns/NVME Namespaces
> > > > > 
> > > > > 2) the followings are related with operating hardware from userspace,
> > > > > so userspace driver has to be trusted, and root is required, and
> > > > > can't support unprivileged UBLK device
> > > > 
> > > > Linux VFIO provides a safe userspace API for userspace device drivers.
> > > > That means memory and interrupts are isolated. Neither userspace nor the
> > > > hardware device can access memory or interrupts that the userspace
> > > > process is not allowed to access.
> > > > 
> > > > I think there are still limitations like all memory pages exposed to the
> > > > device need to be pinned. So effectively you might still need privileges
> > > > to get the mlock resource limits.
> > > > 
> > > > But overall I think what you're saying about root and unprivileged ublk
> > > > devices is not true. Hardware support should be developed with the goal
> > > > of supporting unprivileged userspace ublk servers.
> > > > 
> > > > Those unprivileged userspace ublk servers cannot claim any PCI device
> > > > they want. The user/admin will need to give them permission to open a
> > > > network card, SCSI HBA, etc.
> > > 
> > > It depends on implementation, please see
> > > 
> > > 	https://spdk.io/doc/userspace.html
> > > 
> > > 	```
> > > 	The SPDK NVMe Driver, for instance, maps the BAR for the NVMe device and
> > > 	then follows along with the NVMe Specification to initialize the device,
> > > 	create queue pairs, and ultimately send I/O.
> > > 	```
> > > 
> > > The above way needs userspace to operating hardware by the mapped BAR,
> > > which can't be allowed for unprivileged user.
> > 
> > From https://spdk.io/doc/system_configuration.html:
> > 
> >   Running SPDK as non-privileged user
> > 
> >   One of the benefits of using the VFIO Linux kernel driver is the
> >   ability to perform DMA operations with peripheral devices as
> >   unprivileged user. The permissions to access particular devices still
> >   need to be granted by the system administrator, but only on a one-time
> >   basis. Note that this functionality is supported with DPDK starting
> >   from version 18.11.
> > 
> > This is what I had described in my previous reply.
> 
> My reference on spdk were mostly from spdk/nvme doc.
> Just take quick look at spdk code, looks both vfio and direct
> programming hardware are supported:
> 
> 1) lib/nvme/nvme_vfio_user.c
> const struct spdk_nvme_transport_ops vfio_ops {
> 	.qpair_submit_request = nvme_pcie_qpair_submit_request,

Ignore this, it's the userspace vfio-user UNIX domain socket protocol
support. It's not kernel VFIO and is unrelated to what we're discussing.
More info on vfio-user: https://spdk.io/news/2021/05/04/vfio-user/

> 
> 
> 2) lib/nvme/nvme_pcie.c
> const struct spdk_nvme_transport_ops pcie_ops = {
> 	.qpair_submit_request = nvme_pcie_qpair_submit_request
> 		nvme_pcie_qpair_submit_tracker
> 			nvme_pcie_qpair_submit_tracker
> 				nvme_pcie_qpair_ring_sq_doorbell
> 
> but vfio dma isn't used in nvme_pcie_qpair_submit_request, and simply
> write/read mmaped mmio.

I have only a small amount of SPDK code experienced, so this might be
wrong, but I think the NVMe PCI driver code does not need to directly
call VFIO APIs. That is handled by DPDK/SPDK's EAL operating system
abstractions and device driver APIs.

DMA memory is mapped permanently so the device driver doesn't need to
perform individual map/unmap operations in the data path. NVMe PCI
request submission builds the NVMe command structures containing device
addresses (i.e. IOVAs when IOMMU is enabled).

This code probably supports both IOMMU (VFIO) and non-IOMMU operation.

> 
> > 
> > > 
> > > > 
> > > > > 
> > > > > 3) how to operating hardware memory space
> > > > > - unbind kernel driver and rebind with uio/vfio
> > > > > - map PCI BAR into userspace[2], then userspace can operate hardware
> > > > > with mapped user address via MMIO
> > > > >
> > > > > 4) DMA
> > > > > - DMA requires physical memory address, UBLK driver actually has
> > > > > block request pages, so can we export request SG list(each segment
> > > > > physical address, offset, len) into userspace? If the max_segments
> > > > > limit is not too big(<=64), the needed buffer for holding SG list
> > > > > can be small enough.
> > > > 
> > > > DMA with an IOMMU requires an I/O Virtual Address, not a CPU physical
> > > > address. The IOVA space is defined by the IOMMU page tables. Userspace
> > > > controls the IOMMU page tables via Linux VFIO ioctls.
> > > > 
> > > > For example, <linux/vfio.h> struct vfio_iommu_type1_dma_map defines the
> > > > IOMMU mapping that makes a range of userspace virtual addresses
> > > > available at a given IOVA.
> > > > 
> > > > Mapping and unmapping operations are not free. Similar to mmap(2), the
> > > > program will be slow if it does this frequently.
> > > 
> > > Yeah, but SPDK shouldn't use vfio DMA interface, see:
> > > 
> > > https://spdk.io/doc/memory.html
> > > 
> > > they just programs DMA directly with physical address of pinned hugepages.
> > 
> > From the page you linked:
> > 
> >   IOMMU Support
> > 
> >   ...
> > 
> >   This is a future-proof, hardware-accelerated solution for performing
> >   DMA operations into and out of a user space process and forms the
> >   long-term foundation for SPDK and DPDK's memory management strategy.
> >   We highly recommend that applications are deployed using vfio and the
> >   IOMMU enabled, which is fully supported today.
> > 
> > Yes, SPDK supports running without IOMMU, but they recommend running
> > with the IOMMU.
> > 
> > > 
> > > > 
> > > > I think it's effectively the same problem as ublk zero-copy. We want to
> > > > give the ublk server access to just the I/O buffers that it currently
> > > > needs, but doing so would be expensive :(.
> > > > 
> > > > I think Linux has strategies for avoiding the expense like
> > > > iommu.strict=0 and swiotlb. The drawback is that in our case userspace
> > > > and/or the hardware device controller by userspace would still have
> > > > access to the memory pages after I/O has completed. This reduces memory
> > > > isolation :(.
> > > > 
> > > > DPDK/SPDK and QEMU use long-lived Linux VFIO DMA mappings.
> > > 
> > > Per the above SPDK links, the nvme-pci doesn't use vfio dma mapping.
> > 
> > When using VFIO (recommended by the docs), SPDK uses long-lived DMA
> > mappings. Here are places in the SPDK/DPDK source code where VFIO DMA
> > mapping is used:
> > https://github.com/spdk/spdk/blob/master/lib/env_dpdk/memory.c#L1371
> > https://github.com/spdk/dpdk/blob/e89c0845a60831864becc261cff48dd9321e7e79/lib/eal/linux/eal_vfio.c#L2164
> 
> I meant spdk nvme implementation.

I did too. The NVMe PCI driver will use the PCI driver APIs and the EAL
(operating system abstraction) will deal with IOMMU APIs (VFIO)
transparently.

> 
> > 
> > > 
> > > > 
> > > > What I'm trying to get at is that either memory isolation is compromised
> > > > or performance is reduced. It's hard to have good performance together
> > > > with memory isolation.
> > > > 
> > > > I think ublk should follow the VFIO philosophy of being a safe
> > > > kernel/userspace interface. If userspace is malicious or buggy, the
> > > > kernel's and other process' memory should not be corrupted.
> > > 
> > > It is tradeoff between performance and isolation, that is why I mention
> > > that directing programming hardware in userspace can be done by root
> > > only.
> > 
> > Yes, there is a trade-off. Over the years the use of unsafe approaches
> > has been discouraged and replaced (/dev/kmem, uio -> VFIO, etc). As
> > secure boot, integrity architecture, and stuff like that becomes more
> > widely used, it's harder to include features that break memory isolation
> > in software in mainstream distros. There can be an option to sacrifice
> > memory isolation for performance and some users may be willing to accept
> > the trade-off. I think it should be an option feature though.
> > 
> > I did want to point out that the statement that "direct programming
> > hardware in userspace can be done by root only" is false (see VFIO).
> 
> Unfortunately not see vfio is used when spdk/nvme is operating hardware
> mmio.

I think my responses above answered this, but just to be clear: with
VFIO PCI userspace mmaps the BARs and performs direct accesses to them
(load/store instructions). No VFIO API wrappers are necessary for MMIO
accesses, so the code you posted works fine with VFIO.

> 
> > 
> > > > 
> > > > > 
> > > > > - small amount of physical memory for using as DMA descriptor can be
> > > > > pre-allocated from userspace, and ask kernel to pin pages, then still
> > > > > return physical address to userspace for programming DMA
> > > > 
> > > > I think this is possible today. The ublk server owns the I/O buffers. It
> > > > can mlock them and DMA map them via VFIO. ublk doesn't need to know
> > > > anything about this.
> > > 
> > > It depends on if such VFIO DMA mapping is required for each IO. If it
> > > is required, that won't help one high performance driver.
> > 
> > It is not necessary to perform a DMA mapping for each IO. ublk's
> > existing model is sufficient:
> > 1. ublk server allocates I/O buffers and VFIO DMA maps them on startup.
> > 2. At runtime the ublk server provides these I/O buffers to the kernel,
> >    no further DMA mapping is required.
> > 
> > Unfortunately there's still the kernel<->userspace copy that existing
> > ublk applications have, but there's no new overhead related to VFIO.
> 
> We are working on ublk zero copy for avoiding the copy.

I'm curious if it's possible to come up with a solution that doesn't
break memory isolation. Userspace controls the IOMMU with Linux VFIO, so
if kernel pages are exposed to the device, then userspace will also be
able to access them (e.g. by submitting a request that gets the device
to DMA those pages).

> 
> > 
> > > > 
> > > > > - this way is still zero copy
> > > > 
> > > > True zero-copy would be when an application does O_DIRECT I/O and the
> > > > hardware device DMAs to/from the application's memory pages. ublk
> > > > doesn't do that today and when combined with VFIO it doesn't get any
> > > > easier. I don't think it's possible because you cannot allow userspace
> > > > to control a hardware device and grant DMA access to pages that
> > > > userspace isn't allowed to access. A malicious userspace will program
> > > > the device to access those pages :).
> > > 
> > > But that should be what SPDK nvme/pci is doing per the above links, :-)
> > 
> > Sure, it's possible to break memory isolation. Breaking memory isolation
> > isn't specific to ublk servers that access hardware. The same unsafe
> > zero-copy approach would probably also work for regular ublk servers.
> > This is basically bringing back /dev/kmem :).
> > 
> > > 
> > > > 
> > > > > 
> > > > > 5) notification from hardware: interrupt or polling
> > > > > - SPDK applies userspace polling, this way is doable, but
> > > > > eat CPU, so it is only one choice
> > > > > 
> > > > > - io_uring command has been proved as very efficient, if io_uring
> > > > > command is applied(similar way with UBLK for forwarding blk io
> > > > > command from kernel to userspace) to uio/vfio for delivering interrupt,
> > > > > which should be efficient too, given batching processes are done after
> > > > > the io_uring command is completed
> > > > 
> > > > I wonder how much difference there is between the new io_uring command
> > > > for receiving VFIO irqs that you are suggesting compared to the existing
> > > > io_uring approach IORING_OP_READ eventfd.
> > > 
> > > eventfd needs extra read/write on the event fd, so more syscalls are
> > > required.
> > 
> > No extra syscall is required because IORING_OP_READ is used to read the
> > eventfd, but maybe you were referring to bypassing the
> > file->f_op->read() code path?
> 
> OK, missed that, it is usually done in the following way:
> 
> 	io_uring_prep_poll_add(sqe, evfd, POLLIN)
> 	sqe->flags |= IOSQE_IO_LINK;
> 	...
>     sqe = io_uring_get_sqe(&ring);
>     io_uring_prep_readv(sqe, evfd, &vec, 1, 0);
>     sqe->flags |= IOSQE_IO_LINK;
> 
> When I get time, will compare the two and see which one performs better.

That would be really interesting.

Stefan

Attachment: signature.asc
Description: PGP signature


[Index of Archives]     [Linux RAID]     [Linux SCSI]     [Linux ATA RAID]     [IDE]     [Linux Wireless]     [Linux Kernel]     [ATH6KL]     [Linux Bluetooth]     [Linux Netdev]     [Kernel Newbies]     [Security]     [Git]     [Netfilter]     [Bugtraq]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Device Mapper]

  Powered by Linux