Re: [PATCH RFC] Introduce verbs API for NVMe-oF target offload

Oren Duer <oren.duer@xxxxxxxxx> · Tue, 8 May 2018 15:01:20 +0300

On Mon, Apr 30, 2018 at 3:37 PM, Radzi, Amit <Amit.Radzi@xxxxxxxxxx> wrote:

> 2) Adding optimization for event notification for NVMe backend completions.
> This can be done by configuring the NVMe backend CQ to an interrupt vector dedicated for the
> RNIC and configuring its MSIX table entry to a device specific address and value.
> Additional configuration parameters and management of interrupt vectors should be added.

I think it is out of scope for this proposal. MSIX related stuff is
part of PCIe spec, so should probably go into the PCI drivers (or vfio
in userspace case).

>
> See additional comments (some related to above requirements) inline.
>
>> +### Creating a SRQ with NVMe-oF offload attributes
>> +
>> +A SRQ with NVMe-oF target offload represents a single NVMe-oF subsystem (a
>> +storage target) to the fabric.  Software should call
>> `ibv_create_srq_ex()`, with
>> +`ibv_srq_init_attr_ex.srq_type` set to `nvmf_target_offload` and
>> +`ibv_srq_init_attr_ex.nvmf_attr` set to the specific offload parameters
>> +requested for this SRQ. Parameters should be within the boundaries of the
>> +respected capabilities. Along with the parameters, a staging buffer is provided
>> +for the device use during the offload. This is a piece of memory allocated,
>> +registered and provided via {mr, addr, len}. Software should not modify this
>> +memory after creating the SRQ, as the device manages it by itself, and stores
>> +there data that is in transit between network and NVMe device.
>> +
>> +Note that this SRQ still has a receive queue. HCA will deliver to software
>> +received commands that are not offloaded, and received commands from QPs
>> +attached to the SRQ that are not configured with NVMF_OFFLOAD enabled.
>> +
>
> Who is responsible for mapping the staging buffer to all NVMe backend devices?

User software should. Good point to mention.

>
>> +### Modifying QP to enable NVMe-oF offload
>> +
>> +Once a CONNECT command was received, software can modify the QP to
>> enable its
>> +NVMe-oF offload. `ibv_modify_qp_nvmf()` should be used to enable NVMe-oF
>> +offload.  From this point on, the HCA is taking ownership over the QP,
>> +inspecting each command capsule received by the SRQ, and if this should be an
>> +offloaded command, the flow described above is followed.
>> +
>> +Note that enabling the NVMe-oF offload on the QP when created exposes the
>> +solution to possible standard violation: if an IO command capsule will arrive
>> +before the CONNECT request, the device will service it.
>
> I didn't see any additional reference to ' ibv_modify_qp_nvmf'. Where is it defined?
> Does 'ibv_qp_set_nvmf' replace it?

Right. Typo.

>
>> +++ b/libibverbs/man/ibv_map_nvmf_nsid.3
>> +.B ibv_map_nvmf_nsid()
>> +adds a new NVMe-oF namespace mapping to a given \fInvme_ctrl\fR.
>> +The mapping is from the fabric facing frontend namespace ID
>> +.I fe_nsid
>> +to namespace
>> +.I nvme_nsid
>> +on this NVMe subsystem.
>> +.I fe_nsid
>> +must be unique within the SRQ that
>> +.I nvme_ctrl
>> +belongs to, all ibv_nvme_ctrl objects attached to the same SRQ share
>> the same number space.
>> +.PP
>> +.I lba_data_size
>> +defines the block size this namespace is formatted to, in bytes. Only
>> specific block sizes are supported by the device.
>> +Mapping several namespaces of the same NVMe subsystem will be done by
>> calling this function several times with the same
>> +.I nvme_ctrl
>> +while assigning different
>> +.I fe_nsid
>> +and
>> +.I nvme_nsid
>> +with each call.
>
> Many NVMe controllers require multiple QPs in order to get high bandwidth. Having
> a single NVMe QP (nvme_ctrl object) for each NVMe-oF namespace might be a bottleneck.
> Would it be worth getting a list of nvme_ctrl to be used by the device when accessing this
> namespace?

You mean many NVMe controllers require multiple SQs in order to get high BW...
nvme_ctrl should represent a single NVMe device. A single front-facing
NSID will point to a single backend {nvme_ctrl,nsid}.
If that is a requirement, we should probably add multiple SQs to a
single nvme_ctrl, so that hardware can round-robin the requests
between them.

>
>> +.SH "RETURN VALUE"
>> +.B ibv_map_nvmf_nsid()
>> +and
>> +.B ibv_unmap_nvmf_nsid
>> +returns 0 on success, or the value of errno on failure (which
>> indicates the failure reason).
>> +.PP
>> +failure reasons may be:
>> +.IP EEXIST
>> +Trying to map an already existing front-facing
>> +.I fe_nsid
>> +.IP ENOENT
>> +Trying to delete a non existing front-facing
>> +.I fe_nsid
>> +.IP ENOTSUP
>> +Given
>> +.I lba_data_size
>> +is not supported on this device. Check device release notes for
>> supported sizes. Format the NVMe namespace to a LBA Format where the
>> data + metadata size is supported by the device.
>
> Why should this not be a capability instead of checking in release notes of device?

Yes, thought about making this a capability bitmask, but then it
seemed to many options… besides the regular power-of-two block sizes
there are combinations of metadata. So for instance a device could
support 4096, 4160 (8x 512B blocks + 8B DIF each), 4104 (1x 4K block
with 1 8B DIF), or any other metadata size…

>
>> diff --git a/libibverbs/man/ibv_query_device_ex.3
>> b/libibverbs/man/ibv_query_device_ex.3
>> index 1172523..0336583 100644
>> --- a/libibverbs/man/ibv_query_device_ex.3
>> +++ b/libibverbs/man/ibv_query_device_ex.3
>> +enum ibv_nvmf_offload_type {
>> +.in +8
>> +IBV_NVMF_WRITE_OFFLOAD            = 1 << 0,
>> +IBV_NVMF_READ_OFFLOAD             = 1 << 1,
>> +IBV_NVMF_READ_WRITE_OFFLOAD       = 1 << 2,
>> +IBV_NVMF_READ_WRITE_FLUSH_OFFLOAD = 1 << 3,
>> +.in -8
>> +};
>> +
>
> How about adding 'write_zeroes' and 'dataset_management' as well?

That's the purpose of this enum. As devices will offer more offloaded
operations this enum can increase.

>
>> +struct ibv_nvmf_caps {
>> +.in +8
>> +enum ibv_nvmf_offload_type offload_type;        /* Which NVMe-oF
>> operations can be offloaded, 0 = no NVMf offload support */
>> +uint32_t        max_backend_ctrls_total;    /* Max NVMe backend
>> controllers total for the device */
>> +uint32_t        max_srq_backend_ctrls;      /* Max NVMe backend
>> controllers per SRQ */
>> +uint32_t        max_srq_namespaces;         /* Max namespaces per SRQ */
>> +uint32_t        min_staging_buf_size;       /* Min size of the
>> staging buffer */
>> +uint32_t        max_io_sz;                  /* Max IO transfer per
>> NVMf transaction */
>> +uint16_t        max_nvme_queue_depth;       /* Max queue depth for
>> NVMe backend controller queues */
>> +uint16_t        min_nvme_queue_depth;       /* Min queue depth for
>> NVMe backend controller queues */
>> +uint32_t        max_ioccsz;                 /* Max IO command capsule
>> size, 16B units (NVMe-oF spec) */
>> +uint32_t        min_ioccsz;                 /* Min IO command capsule
>> size, 16B units (NVMe-oF spec) */
>> +uint16_t        max_icdoff;                 /* Max in-capsule data
>> offset, 16B units (NVMe-oF spec) */
>> +.in -8
>> +};
>>  .fi
>
> A few capabilities I think we should add:
> lba_data_size_supported - Bitmap of supported lba_data_size supported by device

Discussed above

> max_namespaces - Maximum number of namespaces supported for all SRQs
> supports_non_offloaded_ops - Flag indicating whether an offloaded QP can
> support non-offloaded operations. In case it doesn't any operation that arrives and
> the device cannot support will cause a QP error.

Sounds reasonable.

>
>> diff --git a/libibverbs/verbs.h b/libibverbs/verbs.h
>> index 0785c77..3d32cf4 100644
>> --- a/libibverbs/verbs.h
>> +++ b/libibverbs/verbs.h
>> @@ -398,6 +420,8 @@ enum ibv_event_type {
>>         IBV_EVENT_CLIENT_REREGISTER,
>>         IBV_EVENT_GID_CHANGE,
>>         IBV_EVENT_WQ_FATAL,
>> +       IBV_EVENT_NVME_PCI_ERR,
>> +       IBV_EVENT_NVME_TIMEOUT
>>  };
>
> We need a new event to give in case a non-supported operation
> Arrives on an offloaded QP that doesn't support co-existence of
> Non-offloaded operations. E.g. IBV_EVENT_UNSUPPORTED_NVMF_OP
>

You mean besides the QP going to error?

-- 
Oren
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html