On Thu, Nov 09 2017 at 12:44pm -0500, Christoph Hellwig <hch@xxxxxx> wrote: > This patch adds native multipath support to the nvme driver. For each > namespace we create only single block device node, which can be used > to access that namespace through any of the controllers that refer to it. > The gendisk for each controllers path to the name space still exists > inside the kernel, but is hidden from userspace. The character device > nodes are still available on a per-controller basis. A new link from > the sysfs directory for the subsystem allows to find all controllers > for a given subsystem. > > Currently we will always send I/O to the first available path, this will > be changed once the NVMe Asynchronous Namespace Access (ANA) TP is > ratified and implemented, at which point we will look at the ANA state > for each namespace. Another possibility that was prototyped is to > use the path that is closes to the submitting NUMA code, which will be > mostly interesting for PCI, but might also be useful for RDMA or FC > transports in the future. There is not plan to implement round robin > or I/O service time path selectors, as those are not scalable with > the performance rates provided by NVMe. > > The multipath device will go away once all paths to it disappear, > any delay to keep it alive needs to be implemented at the controller > level. > > Signed-off-by: Christoph Hellwig <hch@xxxxxx> Your 0th header speaks to the NVMe multipath IO path leveraging NVMe's lack of partial completion but I think it'd be useful to have this header (that actually gets committed) speak to it. > diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c > new file mode 100644 > index 000000000000..062754ebebfd > --- /dev/null > +++ b/drivers/nvme/host/multipath.c ... > +void nvme_failover_req(struct request *req) > +{ > + struct nvme_ns *ns = req->q->queuedata; > + unsigned long flags; > + > + spin_lock_irqsave(&ns->head->requeue_lock, flags); > + blk_steal_bios(&ns->head->requeue_list, req); > + spin_unlock_irqrestore(&ns->head->requeue_lock, flags); > + blk_mq_end_request(req, 0); > + > + nvme_reset_ctrl(ns->ctrl); > + kblockd_schedule_work(&ns->head->requeue_work); > +} Also, the block core patch to introduce blk_steal_bios() already went in but should there be a QUEUE_FLAG that gets set by drivers like NVMe that don't support partial completion? This would make it easier for other future drivers to know whether they can use a more optimized IO path. Mike