On Mon, Apr 30, 2018 at 3:37 PM, Radzi, Amit <Amit.Radzi@xxxxxxxxxx> wrote: > 2) Adding optimization for event notification for NVMe backend completions. > This can be done by configuring the NVMe backend CQ to an interrupt vector dedicated for the > RNIC and configuring its MSIX table entry to a device specific address and value. > Additional configuration parameters and management of interrupt vectors should be added. I think it is out of scope for this proposal. MSIX related stuff is part of PCIe spec, so should probably go into the PCI drivers (or vfio in userspace case). > > See additional comments (some related to above requirements) inline. > >> +### Creating a SRQ with NVMe-oF offload attributes >> + >> +A SRQ with NVMe-oF target offload represents a single NVMe-oF subsystem (a >> +storage target) to the fabric. Software should call >> `ibv_create_srq_ex()`, with >> +`ibv_srq_init_attr_ex.srq_type` set to `nvmf_target_offload` and >> +`ibv_srq_init_attr_ex.nvmf_attr` set to the specific offload parameters >> +requested for this SRQ. Parameters should be within the boundaries of the >> +respected capabilities. Along with the parameters, a staging buffer is provided >> +for the device use during the offload. This is a piece of memory allocated, >> +registered and provided via {mr, addr, len}. Software should not modify this >> +memory after creating the SRQ, as the device manages it by itself, and stores >> +there data that is in transit between network and NVMe device. >> + >> +Note that this SRQ still has a receive queue. HCA will deliver to software >> +received commands that are not offloaded, and received commands from QPs >> +attached to the SRQ that are not configured with NVMF_OFFLOAD enabled. >> + > > Who is responsible for mapping the staging buffer to all NVMe backend devices? User software should. Good point to mention. > >> +### Modifying QP to enable NVMe-oF offload >> + >> +Once a CONNECT command was received, software can modify the QP to >> enable its >> +NVMe-oF offload. `ibv_modify_qp_nvmf()` should be used to enable NVMe-oF >> +offload. From this point on, the HCA is taking ownership over the QP, >> +inspecting each command capsule received by the SRQ, and if this should be an >> +offloaded command, the flow described above is followed. >> + >> +Note that enabling the NVMe-oF offload on the QP when created exposes the >> +solution to possible standard violation: if an IO command capsule will arrive >> +before the CONNECT request, the device will service it. > > I didn't see any additional reference to ' ibv_modify_qp_nvmf'. Where is it defined? > Does 'ibv_qp_set_nvmf' replace it? Right. Typo. > >> +++ b/libibverbs/man/ibv_map_nvmf_nsid.3 >> +.B ibv_map_nvmf_nsid() >> +adds a new NVMe-oF namespace mapping to a given \fInvme_ctrl\fR. >> +The mapping is from the fabric facing frontend namespace ID >> +.I fe_nsid >> +to namespace >> +.I nvme_nsid >> +on this NVMe subsystem. >> +.I fe_nsid >> +must be unique within the SRQ that >> +.I nvme_ctrl >> +belongs to, all ibv_nvme_ctrl objects attached to the same SRQ share >> the same number space. >> +.PP >> +.I lba_data_size >> +defines the block size this namespace is formatted to, in bytes. Only >> specific block sizes are supported by the device. >> +Mapping several namespaces of the same NVMe subsystem will be done by >> calling this function several times with the same >> +.I nvme_ctrl >> +while assigning different >> +.I fe_nsid >> +and >> +.I nvme_nsid >> +with each call. > > Many NVMe controllers require multiple QPs in order to get high bandwidth. Having > a single NVMe QP (nvme_ctrl object) for each NVMe-oF namespace might be a bottleneck. > Would it be worth getting a list of nvme_ctrl to be used by the device when accessing this > namespace? You mean many NVMe controllers require multiple SQs in order to get high BW... nvme_ctrl should represent a single NVMe device. A single front-facing NSID will point to a single backend {nvme_ctrl,nsid}. If that is a requirement, we should probably add multiple SQs to a single nvme_ctrl, so that hardware can round-robin the requests between them. > >> +.SH "RETURN VALUE" >> +.B ibv_map_nvmf_nsid() >> +and >> +.B ibv_unmap_nvmf_nsid >> +returns 0 on success, or the value of errno on failure (which >> indicates the failure reason). >> +.PP >> +failure reasons may be: >> +.IP EEXIST >> +Trying to map an already existing front-facing >> +.I fe_nsid >> +.IP ENOENT >> +Trying to delete a non existing front-facing >> +.I fe_nsid >> +.IP ENOTSUP >> +Given >> +.I lba_data_size >> +is not supported on this device. Check device release notes for >> supported sizes. Format the NVMe namespace to a LBA Format where the >> data + metadata size is supported by the device. > > Why should this not be a capability instead of checking in release notes of device? Yes, thought about making this a capability bitmask, but then it seemed to many options… besides the regular power-of-two block sizes there are combinations of metadata. So for instance a device could support 4096, 4160 (8x 512B blocks + 8B DIF each), 4104 (1x 4K block with 1 8B DIF), or any other metadata size… > >> diff --git a/libibverbs/man/ibv_query_device_ex.3 >> b/libibverbs/man/ibv_query_device_ex.3 >> index 1172523..0336583 100644 >> --- a/libibverbs/man/ibv_query_device_ex.3 >> +++ b/libibverbs/man/ibv_query_device_ex.3 >> +enum ibv_nvmf_offload_type { >> +.in +8 >> +IBV_NVMF_WRITE_OFFLOAD = 1 << 0, >> +IBV_NVMF_READ_OFFLOAD = 1 << 1, >> +IBV_NVMF_READ_WRITE_OFFLOAD = 1 << 2, >> +IBV_NVMF_READ_WRITE_FLUSH_OFFLOAD = 1 << 3, >> +.in -8 >> +}; >> + > > How about adding 'write_zeroes' and 'dataset_management' as well? That's the purpose of this enum. As devices will offer more offloaded operations this enum can increase. > >> +struct ibv_nvmf_caps { >> +.in +8 >> +enum ibv_nvmf_offload_type offload_type; /* Which NVMe-oF >> operations can be offloaded, 0 = no NVMf offload support */ >> +uint32_t max_backend_ctrls_total; /* Max NVMe backend >> controllers total for the device */ >> +uint32_t max_srq_backend_ctrls; /* Max NVMe backend >> controllers per SRQ */ >> +uint32_t max_srq_namespaces; /* Max namespaces per SRQ */ >> +uint32_t min_staging_buf_size; /* Min size of the >> staging buffer */ >> +uint32_t max_io_sz; /* Max IO transfer per >> NVMf transaction */ >> +uint16_t max_nvme_queue_depth; /* Max queue depth for >> NVMe backend controller queues */ >> +uint16_t min_nvme_queue_depth; /* Min queue depth for >> NVMe backend controller queues */ >> +uint32_t max_ioccsz; /* Max IO command capsule >> size, 16B units (NVMe-oF spec) */ >> +uint32_t min_ioccsz; /* Min IO command capsule >> size, 16B units (NVMe-oF spec) */ >> +uint16_t max_icdoff; /* Max in-capsule data >> offset, 16B units (NVMe-oF spec) */ >> +.in -8 >> +}; >> .fi > > A few capabilities I think we should add: > lba_data_size_supported - Bitmap of supported lba_data_size supported by device Discussed above > max_namespaces - Maximum number of namespaces supported for all SRQs > supports_non_offloaded_ops - Flag indicating whether an offloaded QP can > support non-offloaded operations. In case it doesn't any operation that arrives and > the device cannot support will cause a QP error. Sounds reasonable. > >> diff --git a/libibverbs/verbs.h b/libibverbs/verbs.h >> index 0785c77..3d32cf4 100644 >> --- a/libibverbs/verbs.h >> +++ b/libibverbs/verbs.h >> @@ -398,6 +420,8 @@ enum ibv_event_type { >> IBV_EVENT_CLIENT_REREGISTER, >> IBV_EVENT_GID_CHANGE, >> IBV_EVENT_WQ_FATAL, >> + IBV_EVENT_NVME_PCI_ERR, >> + IBV_EVENT_NVME_TIMEOUT >> }; > > We need a new event to give in case a non-supported operation > Arrives on an offloaded QP that doesn't support co-existence of > Non-offloaded operations. E.g. IBV_EVENT_UNSUPPORTED_NVMF_OP > You mean besides the QP going to error? -- Oren -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html