Re: [PATCH v4 17/18] nvmet: New NVMe PCI endpoint target driver

Manivannan Sadhasivam <manivannan.sadhasivam@xxxxxxxxxx> · Tue, 17 Dec 2024 14:23:55 +0530

On Thu, Dec 12, 2024 at 08:34:39PM +0900, Damien Le Moal wrote:
> Implement a PCI target driver using the PCI endpoint framework. This
> requires hardware with a PCI controller capable of executing in endpoint
> mode.
> 
> The PCI endpoint framework is used to set up a PCI endpoint device and
> its BAR compatible with a NVMe PCI controller. The framework is also
> used to map local memory to the PCI address space to execute MMIO
> accesses for retrieving NVMe commands from submission queues and posting
> completion entries to completion queues. If supported, DMA is used for
> command data transfers, based on the PCI address segments indicated by
> the command using either PRPs or SGLs.
> 
> The NVMe target driver relies on the NVMe target core code to execute
> all commands isssued by the host. The PCI target driver is mainly
> responsible for the following:
>  - Initialization and teardown of the endpoint device and its backend
>    PCI target controller. The PCI target controller is created using a
>    subsystem and a port defined through configfs. The port used must be
>    initialized with the "pci" transport type. The target controller is
>    allocated and initialized when the PCI endpoint is started by binding
>    it to the endpoint PCI device (nvmet_pciep_epf_epc_init() function).
> 
>  - Manage the endpoint controller state according to the PCI link state
>    and the actions of the host (e.g. checking the CC.EN register) and
>    propagate these actions to the PCI target controller. Polling of the
>    controller enable/disable is done using a delayed work scheduled
>    every 5ms (nvmet_pciep_poll_cc() function). This work is started
>    whenever the PCI link comes up (nvmet_pciep_epf_link_up() notifier
>    function) and stopped when the PCI link comes down
>    (nvmet_pciep_epf_link_down() notifier function).
>    nvmet_pciep_poll_cc() enables and disables the PCI controller using
>    the functions nvmet_pciep_enable_ctrl() and
>    nvmet_pciep_disable_ctrl(). The controller admin queue is created
>    using nvmet_pciep_create_cq(), which calls nvmet_cq_create(), and
>    nvmet_pciep_create_sq() which uses nvmet_sq_create().
>    nvmet_pciep_disable_ctrl() always resets the PCI controller to its
>    initial state so that nvmet_pciep_enable_ctrl() can be called again.
>    This ensure correct operation if, for instance, the host reboots
>    causing the PCI link to be temporarily down.
> 
>  - Manage the controller admin and I/O submission queues using local
>    memory. Commands are obtained from submission queues using a work
>    item that constantly polls the doorbells of all submissions queues
>    (nvmet_pciep_poll_sqs() function). This work is started whenever the
>    controller is enabled (nvmet_pciep_enable_ctrl() function) and
>    stopped when the controller is disabled (nvmet_pciep_disable_ctrl()
>    function). When new commands are submitted by the host, DMA transfers
>    are used to retrieve the commands.
> 
>  - Initiate the execution of all admin and I/O commands using the target
>    core code, by calling a requests execute() function. All commands are
>    individually handled using a per-command work item
>    (nvmet_pciep_iod_work() function). A command overall execution
>    includes: initializing a struct nvmet_req request for the command,
>    using nvmet_req_transfer_len() to get a command data transfer length,
>    parse the command PRPs or SGLs to get the PCI address segments of
>    the command data buffer, retrieve data from the host (if the command
>    is a write command), call req->execute() to execute the command and
>    transfer data to the host (for read commands).
> 
>  - Handle the completions of commands as notified by the
>    ->queue_response() operation of the PCI target controller
>    (nvmet_pciep_queue_response() function). Completed commands are added
>    to a list of completed command for their CQ. Each CQ list of
>    completed command is processed using a work item
>    (nvmet_pciep_cq_work() function) which posts entries for the
>    completed commands in the CQ memory and raise an IRQ to the host to
>    signal the completion. IRQ coalescing is supported as mandated by the
>    NVMe base specification for PCI controllers. Of note is that
>    completion entries are transmitted to the host using MMIO, after
>    mapping the completion queue memory to the host PCI address space.
>    Unlike for retrieving commands from SQs, DMA is not used as it
>    degrades performance due to the transfer serialization needed (which
>    delays completion entries transmission).
> 
> The configuration of a NVMe PCI endpoint controller is done using
> configfgs. First the NVMe PCI target controller configuration must be
> done to set up a subsystem and a port with the "pci" addr_trtype
> attribute. The subsystem can be setup using a file or block device
> backed namespace or using a passthrough NVMe device. After this, the
> PCI endpoint can be configured and bound to the PCI endpoint controller
> to start the NVMe endpoint controller.
> 
> In order to not overcomplicate this initial implementation of an
> endpoint PCI target controller driver, protection information is not
> for now supported. If the PCI controller port and namespace are
> configured with protection information support, an error will be
> returned when the controller is created and initialized when the
> endpoint function is started. Protection information support will be
> added in a follow-up patch series.
> 
> Using a Rock5B board (Rockchip RK3588 SoC, PCI Gen3x4 endpoint
> controller) with a target PCI controller setup with 4 I/O queues and a
> null_blk block device as a namespace, the maximum performance using fio
> was measured at 131 KIOPS for random 4K reads and up to 2.8 GB/S
> throughput. Some data points are:
> 
> Rnd read,   4KB,  QD=1, 1 job : IOPS=16.9k, BW=66.2MiB/s (69.4MB/s)
> Rnd read,   4KB, QD=32, 1 job : IOPS=78.5k, BW=307MiB/s (322MB/s)
> Rnd read,   4KB, QD=32, 4 jobs: IOPS=131k, BW=511MiB/s (536MB/s)
> Seq read, 512KB, QD=32, 1 job : IOPS=5381, BW=2691MiB/s (2821MB/s)
> 
> The NVMe PCI endpoint target driver is not intended for production use.
> It is a tool for learning NVMe, exploring existing features and testing
> implementations of new NVMe features.
> 
> Co-developed-by: Rick Wertenbroek <rick.wertenbroek@xxxxxxxxx>
> Signed-off-by: Damien Le Moal <dlemoal@xxxxxxxxxx>
> Reviewed-by: Christoph Hellwig <hch@xxxxxx>
> ---
>  drivers/nvme/target/Kconfig  |   10 +
>  drivers/nvme/target/Makefile |    2 +
>  drivers/nvme/target/pci-ep.c | 2626 ++++++++++++++++++++++++++++++++++
>  3 files changed, 2638 insertions(+)
>  create mode 100644 drivers/nvme/target/pci-ep.c
> 
> diff --git a/drivers/nvme/target/Kconfig b/drivers/nvme/target/Kconfig
> index 46be031f91b4..6a0818282427 100644
> --- a/drivers/nvme/target/Kconfig
> +++ b/drivers/nvme/target/Kconfig
> @@ -115,3 +115,13 @@ config NVME_TARGET_AUTH
>  	  target side.
>  
>  	  If unsure, say N.
> +
> +config NVME_TARGET_PCI_EP

Could you please use NVME_TARGET_PCI_EPF for consistency with other EPF drivers?

> +	tristate "NVMe PCI Endpoint Target support"
> +	depends on PCI_ENDPOINT && NVME_TARGET
> +	help
> +	  This enables the NVMe PCI endpoint target support which allows to
> +	  create an NVMe PCI controller using a PCI endpoint capable PCI
> +	  controller.

'This allows creating a NVMe PCI endpoint target using endpoint capable PCI
controller.'

> +
> +	  If unsure, say N.
> diff --git a/drivers/nvme/target/Makefile b/drivers/nvme/target/Makefile
> index f2b025bbe10c..8110faa1101f 100644
> --- a/drivers/nvme/target/Makefile
> +++ b/drivers/nvme/target/Makefile
> @@ -8,6 +8,7 @@ obj-$(CONFIG_NVME_TARGET_RDMA)		+= nvmet-rdma.o
>  obj-$(CONFIG_NVME_TARGET_FC)		+= nvmet-fc.o
>  obj-$(CONFIG_NVME_TARGET_FCLOOP)	+= nvme-fcloop.o
>  obj-$(CONFIG_NVME_TARGET_TCP)		+= nvmet-tcp.o
> +obj-$(CONFIG_NVME_TARGET_PCI_EP)	+= nvmet-pciep.o

Same as above: nvmet-pci-epf

>  
>  nvmet-y		+= core.o configfs.o admin-cmd.o fabrics-cmd.o \
>  			discovery.o io-cmd-file.o io-cmd-bdev.o pr.o
> @@ -20,4 +21,5 @@ nvmet-rdma-y	+= rdma.o
>  nvmet-fc-y	+= fc.o
>  nvme-fcloop-y	+= fcloop.o
>  nvmet-tcp-y	+= tcp.o
> +nvmet-pciep-y	+= pci-ep.o
>  nvmet-$(CONFIG_TRACING)	+= trace.o
> diff --git a/drivers/nvme/target/pci-ep.c b/drivers/nvme/target/pci-ep.c
> new file mode 100644
> index 000000000000..d30d35248e64
> --- /dev/null
> +++ b/drivers/nvme/target/pci-ep.c
> @@ -0,0 +1,2626 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * NVMe PCI endpoint device.

'NVMe PCI Endpoint Function driver'

> + * Copyright (c) 2024, Western Digital Corporation or its affiliates.
> + * Copyright (c) 2024, Rick Wertenbroek <rick.wertenbroek@xxxxxxxxx>
> + *                     REDS Institute, HEIG-VD, HES-SO, Switzerland
> + */
> +#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
> +
> +#include <linux/delay.h>
> +#include <linux/dmaengine.h>
> +#include <linux/io.h>
> +#include <linux/mempool.h>
> +#include <linux/module.h>
> +#include <linux/mutex.h>
> +#include <linux/nvme.h>
> +#include <linux/pci_ids.h>
> +#include <linux/pci-epc.h>
> +#include <linux/pci-epf.h>
> +#include <linux/pci_regs.h>
> +#include <linux/slab.h>
> +
> +#include "nvmet.h"
> +
> +static LIST_HEAD(nvmet_pciep_ports);

Please name all variables/functions as 'pci_epf' instead of 'pciep'.

> +static DEFINE_MUTEX(nvmet_pciep_ports_mutex);

[...]

> +/*
> + * PCI EPF driver private data.
> + */
> +struct nvmet_pciep_epf {

nvmet_pci_epf?

> +	struct pci_epf			*epf;
> +
> +	const struct pci_epc_features	*epc_features;
> +
> +	void				*reg_bar;
> +	size_t				msix_table_offset;
> +
> +	unsigned int			irq_type;
> +	unsigned int			nr_vectors;
> +
> +	struct nvmet_pciep_ctrl		ctrl;
> +
> +	struct dma_chan			*dma_tx_chan;
> +	struct mutex			dma_tx_lock;
> +	struct dma_chan			*dma_rx_chan;
> +	struct mutex			dma_rx_lock;
> +
> +	struct mutex			mmio_lock;
> +
> +	/* PCI endpoint function configfs attributes */
> +	struct config_group		group;
> +	bool				dma_enable;
> +	__le16				portid;
> +	char				subsysnqn[NVMF_NQN_SIZE];
> +	unsigned int			mdts_kb;
> +};
> +
> +static inline u32 nvmet_pciep_bar_read32(struct nvmet_pciep_ctrl *ctrl, u32 off)
> +{
> +	__le32 *bar_reg = ctrl->bar + off;
> +
> +	return le32_to_cpu(READ_ONCE(*bar_reg));

Looks like you can use readl/writel variants here. Any reason to not use them?

> +}
> +

[...]

> +static bool nvmet_pciep_epf_init_dma(struct nvmet_pciep_epf *nvme_epf)
> +{
> +	struct pci_epf *epf = nvme_epf->epf;
> +	struct device *dev = &epf->dev;
> +	struct nvmet_pciep_epf_dma_filter filter;
> +	struct dma_chan *chan;
> +	dma_cap_mask_t mask;
> +
> +	mutex_init(&nvme_epf->dma_rx_lock);
> +	mutex_init(&nvme_epf->dma_tx_lock);
> +
> +	dma_cap_zero(mask);
> +	dma_cap_set(DMA_SLAVE, mask);
> +
> +	filter.dev = epf->epc->dev.parent;
> +	filter.dma_mask = BIT(DMA_DEV_TO_MEM);
> +
> +	chan = dma_request_channel(mask, nvmet_pciep_epf_dma_filter, &filter);
> +	if (!chan)
> +		return false;

You should also destroy mutexes in error path.

> +
> +	nvme_epf->dma_rx_chan = chan;
> +
> +	dev_dbg(dev, "Using DMA RX channel %s, maximum segment size %u B\n",
> +		dma_chan_name(chan),
> +		dma_get_max_seg_size(dmaengine_get_dma_device(chan)));
> +

You should print this message at the end of the function. Otherwise, if there is
a problem in acquiring TX, this and 'DMA not supported...' will be printed. 

> +	filter.dma_mask = BIT(DMA_MEM_TO_DEV);
> +	chan = dma_request_channel(mask, nvmet_pciep_epf_dma_filter, &filter);
> +	if (!chan) {
> +		dma_release_channel(nvme_epf->dma_rx_chan);
> +		nvme_epf->dma_rx_chan = NULL;
> +		return false;
> +	}
> +
> +	nvme_epf->dma_tx_chan = chan;
> +
> +	dev_dbg(dev, "Using DMA TX channel %s, maximum segment size %u B\n",
> +		dma_chan_name(chan),
> +		dma_get_max_seg_size(dmaengine_get_dma_device(chan)));
> +
> +	return true;
> +}
> +
> +static void nvmet_pciep_epf_deinit_dma(struct nvmet_pciep_epf *nvme_epf)
> +{
> +	if (nvme_epf->dma_tx_chan) {

If you destroy mutexes in error path as I suggested above, you could just bail
out if !nvme_epf->dma_enable.

> +		dma_release_channel(nvme_epf->dma_tx_chan);
> +		nvme_epf->dma_tx_chan = NULL;
> +	}
> +
> +	if (nvme_epf->dma_rx_chan) {
> +		dma_release_channel(nvme_epf->dma_rx_chan);
> +		nvme_epf->dma_rx_chan = NULL;
> +	}
> +
> +	mutex_destroy(&nvme_epf->dma_rx_lock);
> +	mutex_destroy(&nvme_epf->dma_tx_lock);
> +}
> +
> +static int nvmet_pciep_epf_dma_transfer(struct nvmet_pciep_epf *nvme_epf,
> +		struct nvmet_pciep_segment *seg, enum dma_data_direction dir)
> +{
> +	struct pci_epf *epf = nvme_epf->epf;
> +	struct dma_async_tx_descriptor *desc;
> +	struct dma_slave_config sconf = {};
> +	struct device *dev = &epf->dev;
> +	struct device *dma_dev;
> +	struct dma_chan *chan;
> +	dma_cookie_t cookie;
> +	dma_addr_t dma_addr;
> +	struct mutex *lock;
> +	int ret;
> +
> +	switch (dir) {
> +	case DMA_FROM_DEVICE:
> +		lock = &nvme_epf->dma_rx_lock;
> +		chan = nvme_epf->dma_rx_chan;
> +		sconf.direction = DMA_DEV_TO_MEM;
> +		sconf.src_addr = seg->pci_addr;
> +		break;
> +	case DMA_TO_DEVICE:
> +		lock = &nvme_epf->dma_tx_lock;
> +		chan = nvme_epf->dma_tx_chan;
> +		sconf.direction = DMA_MEM_TO_DEV;
> +		sconf.dst_addr = seg->pci_addr;
> +		break;
> +	default:
> +		return -EINVAL;
> +	}
> +
> +	mutex_lock(lock);
> +
> +	dma_dev = dmaengine_get_dma_device(chan);
> +	dma_addr = dma_map_single(dma_dev, seg->buf, seg->length, dir);
> +	ret = dma_mapping_error(dma_dev, dma_addr);
> +	if (ret)
> +		goto unlock;
> +
> +	ret = dmaengine_slave_config(chan, &sconf);
> +	if (ret) {
> +		dev_err(dev, "Failed to configure DMA channel\n");
> +		goto unmap;
> +	}
> +
> +	desc = dmaengine_prep_slave_single(chan, dma_addr, seg->length,
> +					   sconf.direction, DMA_CTRL_ACK);
> +	if (!desc) {
> +		dev_err(dev, "Failed to prepare DMA\n");
> +		ret = -EIO;
> +		goto unmap;
> +	}
> +
> +	cookie = dmaengine_submit(desc);
> +	ret = dma_submit_error(cookie);
> +	if (ret) {
> +		dev_err(dev, "DMA submit failed %d\n", ret);
> +		goto unmap;
> +	}
> +
> +	if (dma_sync_wait(chan, cookie) != DMA_COMPLETE) {

Why do you need to do sync tranfer all the time? This defeats the purpose of
using DMA.

> +		dev_err(dev, "DMA transfer failed\n");
> +		ret = -EIO;
> +	}
> +
> +	dmaengine_terminate_sync(chan);
> +
> +unmap:
> +	dma_unmap_single(dma_dev, dma_addr, seg->length, dir);
> +
> +unlock:
> +	mutex_unlock(lock);
> +
> +	return ret;
> +}
> +
> +static int nvmet_pciep_epf_mmio_transfer(struct nvmet_pciep_epf *nvme_epf,
> +		struct nvmet_pciep_segment *seg, enum dma_data_direction dir)
> +{
> +	u64 pci_addr = seg->pci_addr;
> +	u32 length = seg->length;
> +	void *buf = seg->buf;
> +	struct pci_epc_map map;
> +	int ret = -EINVAL;
> +
> +	/*
> +	 * Note: mmio transfers do not need serialization but this is a

MMIO

> +	 * simple way to avoid using too many mapping windows.
> +	 */
> +	mutex_lock(&nvme_epf->mmio_lock);
> +
> +	while (length) {
> +		ret = nvmet_pciep_epf_mem_map(nvme_epf, pci_addr, length, &map);
> +		if (ret)
> +			break;
> +
> +		switch (dir) {
> +		case DMA_FROM_DEVICE:
> +			memcpy_fromio(buf, map.virt_addr, map.pci_size);
> +			break;
> +		case DMA_TO_DEVICE:
> +			memcpy_toio(map.virt_addr, buf, map.pci_size);
> +			break;
> +		default:
> +			ret = -EINVAL;
> +			goto unlock;
> +		}
> +
> +		pci_addr += map.pci_size;
> +		buf += map.pci_size;
> +		length -= map.pci_size;
> +
> +		nvmet_pciep_epf_mem_unmap(nvme_epf, &map);
> +	}
> +
> +unlock:
> +	mutex_unlock(&nvme_epf->mmio_lock);
> +
> +	return ret;
> +}
> +
> +static inline int nvmet_pciep_epf_transfer(struct nvmet_pciep_epf *nvme_epf,

No need to add 'inline' keyword in .c files.

> +		struct nvmet_pciep_segment *seg, enum dma_data_direction dir)
> +{
> +	if (nvme_epf->dma_enable)
> +		return nvmet_pciep_epf_dma_transfer(nvme_epf, seg, dir);
> +
> +	return nvmet_pciep_epf_mmio_transfer(nvme_epf, seg, dir);
> +}
> +

[...]

> +/*
> + * Transfer a prp list from the host and return the number of prps.

PRP (everywhere in comments)

> + */
> +static int nvmet_pciep_get_prp_list(struct nvmet_pciep_ctrl *ctrl, u64 prp,
> +				    size_t xfer_len, __le64 *prps)
> +{
> +	size_t nr_prps = (xfer_len + ctrl->mps_mask) >> ctrl->mps_shift;
> +	u32 length;
> +	int ret;
> +
> +	/*
> +	 * Compute the number of PRPs required for the number of bytes to
> +	 * transfer (xfer_len). If this number overflows the memory page size
> +	 * with the PRP list pointer specified, only return the space available
> +	 * in the memory page, the last PRP in there will be a PRP list pointer
> +	 * to the remaining PRPs.
> +	 */
> +	length = min(nvmet_pciep_prp_size(ctrl, prp), nr_prps << 3);
> +	ret = nvmet_pciep_transfer(ctrl, prps, prp, length, DMA_FROM_DEVICE);
> +	if (ret)
> +		return ret;
> +
> +	return length >> 3;
> +}
> +
> +static int nvmet_pciep_iod_parse_prp_list(struct nvmet_pciep_ctrl *ctrl,
> +					  struct nvmet_pciep_iod *iod)
> +{
> +	struct nvme_command *cmd = &iod->cmd;
> +	struct nvmet_pciep_segment *seg;
> +	size_t size = 0, ofst, prp_size, xfer_len;
> +	size_t transfer_len = iod->data_len;
> +	int nr_segs, nr_prps = 0;
> +	u64 pci_addr, prp;
> +	int i = 0, ret;
> +	__le64 *prps;
> +
> +	prps = kzalloc(ctrl->mps, GFP_KERNEL);
> +	if (!prps)
> +		goto internal;

Can you prefix 'err_' to the label?

> +
> +	/*
> +	 * Allocate PCI segments for the command: this considers the worst case
> +	 * scenario where all prps are discontiguous, so get as many segments
> +	 * as we can have prps. In practice, most of the time, we will have
> +	 * far less PCI segments than prps.
> +	 */
> +	prp = le64_to_cpu(cmd->common.dptr.prp1);
> +	if (!prp)
> +		goto invalid_field;
> +
> +	ofst = nvmet_pciep_prp_ofst(ctrl, prp);
> +	nr_segs = (transfer_len + ofst + ctrl->mps - 1) >> ctrl->mps_shift;
> +
> +	ret = nvmet_pciep_alloc_iod_data_segs(iod, nr_segs);
> +	if (ret)
> +		goto internal;
> +
> +	/* Set the first segment using prp1 */
> +	seg = &iod->data_segs[0];
> +	seg->pci_addr = prp;
> +	seg->length = nvmet_pciep_prp_size(ctrl, prp);
> +
> +	size = seg->length;
> +	pci_addr = prp + size;
> +	nr_segs = 1;
> +
> +	/*
> +	 * Now build the PCI address segments using the prp lists, starting
> +	 * from prp2.
> +	 */
> +	prp = le64_to_cpu(cmd->common.dptr.prp2);
> +	if (!prp)
> +		goto invalid_field;
> +
> +	while (size < transfer_len) {
> +		xfer_len = transfer_len - size;
> +
> +		if (!nr_prps) {
> +			/* Get the prp list */
> +			nr_prps = nvmet_pciep_get_prp_list(ctrl, prp,
> +							   xfer_len, prps);
> +			if (nr_prps < 0)
> +				goto internal;
> +
> +			i = 0;
> +			ofst = 0;
> +		}
> +
> +		/* Current entry */
> +		prp = le64_to_cpu(prps[i]);
> +		if (!prp)
> +			goto invalid_field;
> +
> +		/* Did we reach the last prp entry of the list ? */
> +		if (xfer_len > ctrl->mps && i == nr_prps - 1) {
> +			/* We need more PRPs: prp is a list pointer */
> +			nr_prps = 0;
> +			continue;
> +		}
> +
> +		/* Only the first prp is allowed to have an offset */
> +		if (nvmet_pciep_prp_ofst(ctrl, prp))
> +			goto invalid_offset;
> +
> +		if (prp != pci_addr) {
> +			/* Discontiguous prp: new segment */
> +			nr_segs++;
> +			if (WARN_ON_ONCE(nr_segs > iod->nr_data_segs))
> +				goto internal;
> +
> +			seg++;
> +			seg->pci_addr = prp;
> +			seg->length = 0;
> +			pci_addr = prp;
> +		}
> +
> +		prp_size = min_t(size_t, ctrl->mps, xfer_len);
> +		seg->length += prp_size;
> +		pci_addr += prp_size;
> +		size += prp_size;
> +
> +		i++;
> +	}
> +
> +	iod->nr_data_segs = nr_segs;
> +	ret = 0;
> +
> +	if (size != transfer_len) {
> +		dev_err(ctrl->dev, "PRPs transfer length mismatch %zu / %zu\n",
> +			size, transfer_len);
> +		goto internal;
> +	}
> +
> +	kfree(prps);
> +
> +	return 0;
> +
> +invalid_offset:
> +	dev_err(ctrl->dev, "PRPs list invalid offset\n");
> +	kfree(prps);
> +	iod->status = NVME_SC_PRP_INVALID_OFFSET | NVME_STATUS_DNR;
> +	return -EINVAL;
> +
> +invalid_field:
> +	dev_err(ctrl->dev, "PRPs list invalid field\n");
> +	kfree(prps);
> +	iod->status = NVME_SC_INVALID_FIELD | NVME_STATUS_DNR;
> +	return -EINVAL;
> +
> +internal:
> +	dev_err(ctrl->dev, "PRPs list internal error\n");
> +	kfree(prps);
> +	iod->status = NVME_SC_INTERNAL | NVME_STATUS_DNR;
> +	return -EINVAL;

Can't you organize the labels in such a way that there is only one return path?
Current code makes it difficult to read and also would confuse the static
checkers.

> +}
> +

[...]

> +static void nvmet_pciep_init_bar(struct nvmet_pciep_ctrl *ctrl)
> +{
> +	struct nvmet_ctrl *tctrl = ctrl->tctrl;
> +
> +	ctrl->bar = ctrl->nvme_epf->reg_bar;
> +
> +	/* Copy the target controller capabilities as a base */
> +	ctrl->cap = tctrl->cap;
> +
> +	/* Contiguous Queues Required (CQR) */
> +	ctrl->cap |= 0x1ULL << 16;
> +
> +	/* Set Doorbell stride to 4B (DSTRB) */
> +	ctrl->cap &= ~GENMASK(35, 32);
> +
> +	/* Clear NVM Subsystem Reset Supported (NSSRS) */
> +	ctrl->cap &= ~(0x1ULL << 36);
> +
> +	/* Clear Boot Partition Support (BPS) */
> +	ctrl->cap &= ~(0x1ULL << 45);
> +
> +	/* Clear Persistent Memory Region Supported (PMRS) */
> +	ctrl->cap &= ~(0x1ULL << 56);
> +
> +	/* Clear Controller Memory Buffer Supported (CMBS) */
> +	ctrl->cap &= ~(0x1ULL << 57);

Can you use macros for these?

> +
> +	/* Controller configuration */
> +	ctrl->cc = tctrl->cc & (~NVME_CC_ENABLE);
> +
> +	/* Controller status */
> +	ctrl->csts = ctrl->tctrl->csts;
> +
> +	nvmet_pciep_bar_write64(ctrl, NVME_REG_CAP, ctrl->cap);
> +	nvmet_pciep_bar_write32(ctrl, NVME_REG_VS, tctrl->subsys->ver);
> +	nvmet_pciep_bar_write32(ctrl, NVME_REG_CSTS, ctrl->csts);
> +	nvmet_pciep_bar_write32(ctrl, NVME_REG_CC, ctrl->cc);
> +}
> +

[...]

> +static int nvmet_pciep_epf_configure_bar(struct nvmet_pciep_epf *nvme_epf)
> +{
> +	struct pci_epf *epf = nvme_epf->epf;
> +	const struct pci_epc_features *epc_features = nvme_epf->epc_features;
> +	size_t reg_size, reg_bar_size;
> +	size_t msix_table_size = 0;
> +
> +	/*
> +	 * The first free BAR will be our register BAR and per NVMe
> +	 * specifications, it must be BAR 0.
> +	 */
> +	if (pci_epc_get_first_free_bar(epc_features) != BAR_0) {
> +		dev_err(&epf->dev, "BAR 0 is not free\n");
> +		return -EINVAL;

-ENOMEM or -ENODEV?

> +	}
> +
> +	/* Initialize BAR flags */
> +	if (epc_features->bar[BAR_0].only_64bit)
> +		epf->bar[BAR_0].flags |= PCI_BASE_ADDRESS_MEM_TYPE_64;
> +
> +	/*
> +	 * Calculate the size of the register bar: NVMe registers first with
> +	 * enough space for the doorbells, followed by the MSI-X table
> +	 * if supported.
> +	 */
> +	reg_size = NVME_REG_DBS + (NVMET_NR_QUEUES * 2 * sizeof(u32));
> +	reg_size = ALIGN(reg_size, 8);
> +
> +	if (epc_features->msix_capable) {
> +		size_t pba_size;
> +
> +		msix_table_size = PCI_MSIX_ENTRY_SIZE * epf->msix_interrupts;
> +		nvme_epf->msix_table_offset = reg_size;
> +		pba_size = ALIGN(DIV_ROUND_UP(epf->msix_interrupts, 8), 8);
> +
> +		reg_size += msix_table_size + pba_size;
> +	}
> +
> +	reg_bar_size = ALIGN(reg_size, max(epc_features->align, 4096));