Re: [LSF/MM/BPF BoF]: extend UBLK to cover real storage hardware

Stefan Hajnoczi <stefanha@xxxxxxxxxx> · Wed, 15 Feb 2023 10:27:07 -0500

On Wed, Feb 15, 2023 at 08:51:27AM +0800, Ming Lei wrote:
> On Mon, Feb 13, 2023 at 02:13:59PM -0500, Stefan Hajnoczi wrote:
> > On Mon, Feb 13, 2023 at 11:47:31AM +0800, Ming Lei wrote:
> > > On Wed, Feb 08, 2023 at 07:17:10AM -0500, Stefan Hajnoczi wrote:
> > > > On Wed, Feb 08, 2023 at 10:12:19AM +0800, Ming Lei wrote:
> > > > > On Mon, Feb 06, 2023 at 03:27:09PM -0500, Stefan Hajnoczi wrote:
> > > > > > On Mon, Feb 06, 2023 at 11:00:27PM +0800, Ming Lei wrote:
> > > > > > > Hello,
> > > > > > > 
> > > > > > > So far UBLK is only used for implementing virtual block device from
> > > > > > > userspace, such as loop, nbd, qcow2, ...[1].
> > > > > > 
> > > > > > I won't be at LSF/MM so here are my thoughts:
> > > > > 
> > > > > Thanks for the thoughts, :-)
> > > > > 
> > > > > > 
> > > > > > > 
> > > > > > > It could be useful for UBLK to cover real storage hardware too:
> > > > > > > 
> > > > > > > - for fast prototype or performance evaluation
> > > > > > > 
> > > > > > > - some network storages are attached to host, such as iscsi and nvme-tcp,
> > > > > > > the current UBLK interface doesn't support such devices, since it needs
> > > > > > > all LUNs/Namespaces to share host resources(such as tag)
> > > > > > 
> > > > > > Can you explain this in more detail? It seems like an iSCSI or
> > > > > > NVMe-over-TCP initiator could be implemented as a ublk server today.
> > > > > > What am I missing?
> > > > > 
> > > > > The current ublk can't do that yet, because the interface doesn't
> > > > > support multiple ublk disks sharing single host, which is exactly
> > > > > the case of scsi and nvme.
> > > > 
> > > > Can you give an example that shows exactly where a problem is hit?
> > > > 
> > > > I took a quick look at the ublk source code and didn't spot a place
> > > > where it prevents a single ublk server process from handling multiple
> > > > devices.
> > > > 
> > > > Regarding "host resources(such as tag)", can the ublk server deal with
> > > > that in userspace? The Linux block layer doesn't have the concept of a
> > > > "host", that would come in at the SCSI/NVMe level that's implemented in
> > > > userspace.
> > > > 
> > > > I don't understand yet...
> > > 
> > > blk_mq_tag_set is embedded into driver host structure, and referred by queue
> > > via q->tag_set, both scsi and nvme allocates tag in host/queue wide,
> > > that said all LUNs/NSs share host/queue tags, current every ublk
> > > device is independent, and can't shard tags.
> > 
> > Does this actually prevent ublk servers with multiple ublk devices or is
> > it just sub-optimal?
> 
> It is former, ublk can't support multiple devices which share single host
> because duplicated tag can be seen in host side, then io is failed.

The kernel sees two independent block devices so there is no issue
within the kernel.

Userspace can do its own hw tag allocation if there are shared storage
controller resources (e.g. NVMe CIDs) to avoid duplicating tags.

Have I missed something?

> 
> > 
> > > 
> > > > 
> > > > > 
> > > > > > 
> > > > > > > 
> > > > > > > - SPDK has supported user space driver for real hardware
> > > > > > 
> > > > > > I think this could already be implemented today. There will be extra
> > > > > > memory copies because SPDK won't have access to the application's memory
> > > > > > pages.
> > > > > 
> > > > > Here I proposed zero copy, and current SPDK nvme-pci implementation haven't
> > > > > such extra copy per my understanding.
> > > > > 
> > > > > > 
> > > > > > > 
> > > > > > > So propose to extend UBLK for supporting real hardware device:
> > > > > > > 
> > > > > > > 1) extend UBLK ABI interface to support disks attached to host, such
> > > > > > > as SCSI Luns/NVME Namespaces
> > > > > > > 
> > > > > > > 2) the followings are related with operating hardware from userspace,
> > > > > > > so userspace driver has to be trusted, and root is required, and
> > > > > > > can't support unprivileged UBLK device
> > > > > > 
> > > > > > Linux VFIO provides a safe userspace API for userspace device drivers.
> > > > > > That means memory and interrupts are isolated. Neither userspace nor the
> > > > > > hardware device can access memory or interrupts that the userspace
> > > > > > process is not allowed to access.
> > > > > > 
> > > > > > I think there are still limitations like all memory pages exposed to the
> > > > > > device need to be pinned. So effectively you might still need privileges
> > > > > > to get the mlock resource limits.
> > > > > > 
> > > > > > But overall I think what you're saying about root and unprivileged ublk
> > > > > > devices is not true. Hardware support should be developed with the goal
> > > > > > of supporting unprivileged userspace ublk servers.
> > > > > > 
> > > > > > Those unprivileged userspace ublk servers cannot claim any PCI device
> > > > > > they want. The user/admin will need to give them permission to open a
> > > > > > network card, SCSI HBA, etc.
> > > > > 
> > > > > It depends on implementation, please see
> > > > > 
> > > > > 	https://spdk.io/doc/userspace.html
> > > > > 
> > > > > 	```
> > > > > 	The SPDK NVMe Driver, for instance, maps the BAR for the NVMe device and
> > > > > 	then follows along with the NVMe Specification to initialize the device,
> > > > > 	create queue pairs, and ultimately send I/O.
> > > > > 	```
> > > > > 
> > > > > The above way needs userspace to operating hardware by the mapped BAR,
> > > > > which can't be allowed for unprivileged user.
> > > > 
> > > > From https://spdk.io/doc/system_configuration.html:
> > > > 
> > > >   Running SPDK as non-privileged user
> > > > 
> > > >   One of the benefits of using the VFIO Linux kernel driver is the
> > > >   ability to perform DMA operations with peripheral devices as
> > > >   unprivileged user. The permissions to access particular devices still
> > > >   need to be granted by the system administrator, but only on a one-time
> > > >   basis. Note that this functionality is supported with DPDK starting
> > > >   from version 18.11.
> > > > 
> > > > This is what I had described in my previous reply.
> > > 
> > > My reference on spdk were mostly from spdk/nvme doc.
> > > Just take quick look at spdk code, looks both vfio and direct
> > > programming hardware are supported:
> > > 
> > > 1) lib/nvme/nvme_vfio_user.c
> > > const struct spdk_nvme_transport_ops vfio_ops {
> > > 	.qpair_submit_request = nvme_pcie_qpair_submit_request,
> > 
> > Ignore this, it's the userspace vfio-user UNIX domain socket protocol
> > support. It's not kernel VFIO and is unrelated to what we're discussing.
> > More info on vfio-user: https://spdk.io/news/2021/05/04/vfio-user/
> 
> Not sure, why does .qpair_submit_request point to
> nvme_pcie_qpair_submit_request?

The lib/nvme/nvme_vfio_user.c code is for when SPDK connects to a
vfio-user NVMe PCI device. The vfio-user protocol support is not handled
by the regular DPDK/SPDK PCI driver APIs, so the lib/nvme/nvme_pcie.c
doesn't work with vfio-user devices.

However, a lot of the code can be shared with the regular NVMe PCI
driver and that's why .qpair_submit_request points to
nvme_pcie_qpair_submit_request instead of a special version for
vfio-user.

If the vfio-user protocol becomes more widely used for other devices
besides NVMe PCI, then I guess the DPDK/SPDK developers will figure out
a way to move the vfio-user code into the core PCI driver API so that a
single lib/nvme/nvme_pcie.c file works with all PCI APIs (kernel VFIO,
vfio-user, etc). The code was probably structured like this because it's
hard to make those changes and they wanted to get vfio-user NVMe PCI
working quickly.

> 
> > 
> > > 
> > > 
> > > 2) lib/nvme/nvme_pcie.c
> > > const struct spdk_nvme_transport_ops pcie_ops = {
> > > 	.qpair_submit_request = nvme_pcie_qpair_submit_request
> > > 		nvme_pcie_qpair_submit_tracker
> > > 			nvme_pcie_qpair_submit_tracker
> > > 				nvme_pcie_qpair_ring_sq_doorbell
> > > 
> > > but vfio dma isn't used in nvme_pcie_qpair_submit_request, and simply
> > > write/read mmaped mmio.
> > 
> > I have only a small amount of SPDK code experienced, so this might be
> 
> Me too.
> 
> > wrong, but I think the NVMe PCI driver code does not need to directly
> > call VFIO APIs. That is handled by DPDK/SPDK's EAL operating system
> > abstractions and device driver APIs.
> > 
> > DMA memory is mapped permanently so the device driver doesn't need to
> > perform individual map/unmap operations in the data path. NVMe PCI
> > request submission builds the NVMe command structures containing device
> > addresses (i.e. IOVAs when IOMMU is enabled).
> 
> If IOMMU isn't used, it is physical address of memory.
> 
> Then I guess you may understand why I said this way can't be done by
> un-privileged user, cause driver is writing memory physical address to
> device register directly.
> 
> But other driver can follow this approach if the way is accepted.

Okay, I understand now that you were thinking of non-IOMMU use cases.

Stefan
Attachment:
signature.asc

Description: PGP signature