RE: [PATCH RFC] Introduce verbs API for NVMe-oF target offload

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



> From: linux-rdma-owner@xxxxxxxxxxxxxxx [mailto:linux-rdma-
> owner@xxxxxxxxxxxxxxx] On Behalf Of Oren Duer
> 
> This RFC patch introduces a libibverbs API for offloading NVMe-over-Fabrics
> target protocol to HCA.
> 
> Using this feature, the HCA will terminate the target side NVMe-oF protocol and
> submit the NVMe requests directly to NVMe devices over PCI using peer-to-peer
> communication. No CPU involvement is needed beyond first configuration,
> setting
> up connections with clients, and exceptions handling.
> 
> Userspace applications that implement NVMe-oF target, such as SPDK, are
> the target for this capability.
> 
> Full explanation is in Documentation/nvmf_offload.md, part of this patch.

Thanks for the RFC and sorry for the delayed response. We have some requests and comments.

We have two major requirements:
1) Providing an RNIC the ability not to support non-offloaded operations on an offloaded QP.
Once the QP will be modified to offloaded only a single SQ WQE with the connect response will 
Be executed and after that no additional SQ WQEs will be allowed from the consumer.

2) Adding optimization for event notification for NVMe backend completions.
This can be done by configuring the NVMe backend CQ to an interrupt vector dedicated for the 
RNIC and configuring its MSIX table entry to a device specific address and value.
Additional configuration parameters and management of interrupt vectors should be added.

See additional comments (some related to above requirements) inline.

> +### Creating a SRQ with NVMe-oF offload attributes
> +
> +A SRQ with NVMe-oF target offload represents a single NVMe-oF subsystem (a
> +storage target) to the fabric.  Software should call
> `ibv_create_srq_ex()`, with
> +`ibv_srq_init_attr_ex.srq_type` set to `nvmf_target_offload` and
> +`ibv_srq_init_attr_ex.nvmf_attr` set to the specific offload parameters
> +requested for this SRQ. Parameters should be within the boundaries of the
> +respected capabilities. Along with the parameters, a staging buffer is provided
> +for the device use during the offload. This is a piece of memory allocated,
> +registered and provided via {mr, addr, len}. Software should not modify this
> +memory after creating the SRQ, as the device manages it by itself, and stores
> +there data that is in transit between network and NVMe device.
> +
> +Note that this SRQ still has a receive queue. HCA will deliver to software
> +received commands that are not offloaded, and received commands from QPs
> +attached to the SRQ that are not configured with NVMF_OFFLOAD enabled.
> +

Who is responsible for mapping the staging buffer to all NVMe backend devices?

> +### Modifying QP to enable NVMe-oF offload
> +
> +Once a CONNECT command was received, software can modify the QP to
> enable its
> +NVMe-oF offload. `ibv_modify_qp_nvmf()` should be used to enable NVMe-oF
> +offload.  From this point on, the HCA is taking ownership over the QP,
> +inspecting each command capsule received by the SRQ, and if this should be an
> +offloaded command, the flow described above is followed.
> +
> +Note that enabling the NVMe-oF offload on the QP when created exposes the
> +solution to possible standard violation: if an IO command capsule will arrive
> +before the CONNECT request, the device will service it.

I didn't see any additional reference to ' ibv_modify_qp_nvmf'. Where is it defined?
Does 'ibv_qp_set_nvmf' replace it?

> +++ b/libibverbs/man/ibv_map_nvmf_nsid.3
> +.B ibv_map_nvmf_nsid()
> +adds a new NVMe-oF namespace mapping to a given \fInvme_ctrl\fR.
> +The mapping is from the fabric facing frontend namespace ID
> +.I fe_nsid
> +to namespace
> +.I nvme_nsid
> +on this NVMe subsystem.
> +.I fe_nsid
> +must be unique within the SRQ that
> +.I nvme_ctrl
> +belongs to, all ibv_nvme_ctrl objects attached to the same SRQ share
> the same number space.
> +.PP
> +.I lba_data_size
> +defines the block size this namespace is formatted to, in bytes. Only
> specific block sizes are supported by the device.
> +Mapping several namespaces of the same NVMe subsystem will be done by
> calling this function several times with the same
> +.I nvme_ctrl
> +while assigning different
> +.I fe_nsid
> +and
> +.I nvme_nsid
> +with each call.

Many NVMe controllers require multiple QPs in order to get high bandwidth. Having 
a single NVMe QP (nvme_ctrl object) for each NVMe-oF namespace might be a bottleneck.
Would it be worth getting a list of nvme_ctrl to be used by the device when accessing this 
namespace?

> +.SH "RETURN VALUE"
> +.B ibv_map_nvmf_nsid()
> +and
> +.B ibv_unmap_nvmf_nsid
> +returns 0 on success, or the value of errno on failure (which
> indicates the failure reason).
> +.PP
> +failure reasons may be:
> +.IP EEXIST
> +Trying to map an already existing front-facing
> +.I fe_nsid
> +.IP ENOENT
> +Trying to delete a non existing front-facing
> +.I fe_nsid
> +.IP ENOTSUP
> +Given
> +.I lba_data_size
> +is not supported on this device. Check device release notes for
> supported sizes. Format the NVMe namespace to a LBA Format where the
> data + metadata size is supported by the device.

Why should this not be a capability instead of checking in release notes of device?

> diff --git a/libibverbs/man/ibv_query_device_ex.3
> b/libibverbs/man/ibv_query_device_ex.3
> index 1172523..0336583 100644
> --- a/libibverbs/man/ibv_query_device_ex.3
> +++ b/libibverbs/man/ibv_query_device_ex.3
> +enum ibv_nvmf_offload_type {
> +.in +8
> +IBV_NVMF_WRITE_OFFLOAD            = 1 << 0,
> +IBV_NVMF_READ_OFFLOAD             = 1 << 1,
> +IBV_NVMF_READ_WRITE_OFFLOAD       = 1 << 2,
> +IBV_NVMF_READ_WRITE_FLUSH_OFFLOAD = 1 << 3,
> +.in -8
> +};
> +

How about adding 'write_zeroes' and 'dataset_management' as well?

> +struct ibv_nvmf_caps {
> +.in +8
> +enum ibv_nvmf_offload_type offload_type;        /* Which NVMe-oF
> operations can be offloaded, 0 = no NVMf offload support */
> +uint32_t        max_backend_ctrls_total;    /* Max NVMe backend
> controllers total for the device */
> +uint32_t        max_srq_backend_ctrls;      /* Max NVMe backend
> controllers per SRQ */
> +uint32_t        max_srq_namespaces;         /* Max namespaces per SRQ */
> +uint32_t        min_staging_buf_size;       /* Min size of the
> staging buffer */
> +uint32_t        max_io_sz;                  /* Max IO transfer per
> NVMf transaction */
> +uint16_t        max_nvme_queue_depth;       /* Max queue depth for
> NVMe backend controller queues */
> +uint16_t        min_nvme_queue_depth;       /* Min queue depth for
> NVMe backend controller queues */
> +uint32_t        max_ioccsz;                 /* Max IO command capsule
> size, 16B units (NVMe-oF spec) */
> +uint32_t        min_ioccsz;                 /* Min IO command capsule
> size, 16B units (NVMe-oF spec) */
> +uint16_t        max_icdoff;                 /* Max in-capsule data
> offset, 16B units (NVMe-oF spec) */
> +.in -8
> +};
>  .fi

A few capabilities I think we should add:
lba_data_size_supported - Bitmap of supported lba_data_size supported by device
max_namespaces - Maximum number of namespaces supported for all SRQs
supports_non_offloaded_ops - Flag indicating whether an offloaded QP can 
support non-offloaded operations. In case it doesn't any operation that arrives and
the device cannot support will cause a QP error.

> diff --git a/libibverbs/verbs.h b/libibverbs/verbs.h
> index 0785c77..3d32cf4 100644
> --- a/libibverbs/verbs.h
> +++ b/libibverbs/verbs.h
> @@ -398,6 +420,8 @@ enum ibv_event_type {
>         IBV_EVENT_CLIENT_REREGISTER,
>         IBV_EVENT_GID_CHANGE,
>         IBV_EVENT_WQ_FATAL,
> +       IBV_EVENT_NVME_PCI_ERR,
> +       IBV_EVENT_NVME_TIMEOUT
>  };

We need a new event to give in case a non-supported operation
Arrives on an offloaded QP that doesn't support co-existence of 
Non-offloaded operations. E.g. IBV_EVENT_UNSUPPORTED_NVMF_OP

��.n��������+%������w��{.n�����{���fk��ܨ}���Ơz�j:+v�����w����ޙ��&�)ߡ�a����z�ޗ���ݢj��w�f




[Index of Archives]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Photo]     [Yosemite News]     [Yosemite Photos]     [Linux Kernel]     [Linux SCSI]     [XFree86]

  Powered by Linux