Re: [PATCH for-next 4/4] nvme-multipath: add multipathing for uring-passthrough commands

Kanchan Joshi <joshi.k@xxxxxxxxxxx> · Wed, 13 Jul 2022 11:07:57 +0530

The way I would do this that in nvme_ioucmd_failover_req (or in the
retry driven from command retriable failure) I would do the above,
requeue it and kick the requeue work, to go over the requeue_list and
just execute them again. Not sure why you even need an explicit retry
code.
During retry we need passthrough command. But passthrough command is not
stable (i.e. valid only during first submission). We can make it stable
either by:
(a) allocating in nvme (b) return -EAGAIN to io_uring, and it 
will do allocate + deferral
Both add a cost. And since any command can potentially fail, that
means taking that cost for every IO that we issue on mpath node. Even if
no failure (initial or subsquent after IO) occcured.

As mentioned, I think that if a driver consumes a command as queued,
it needs a stable copy for a later reformation of the request for
failover purposes.

So what do you propose to make that stable?
As I mentioned earlier, stable copy requires allocating/copying in fast
path. And for a condition (failover) that may not even occur.
I really think currrent solution is much better as it does not try to make
it stable. Rather it assembles pieces of passthrough command if retry
(which is rare) happens.

Well, I can understand that io_uring_cmd is space constrained, otherwise
we wouldn't be having this discussion. 

Indeed. If we had space for keeping passthrough command stable for
retry, that would really have simplified the plumbing. Retry logic would
be same as first submission.

However io_kiocb is less
constrained, and could be used as a context to hold such a space.

Even if it is undesired to have io_kiocb be passed to uring_cmd(), it
can still hold a driver specific space paired with a helper to obtain it
(i.e. something like io_uring_cmd_to_driver_ctx(ioucmd) ). Then if the
space is pre-allocated it is only a small memory copy for a stable copy
that would allow a saner failover design.

I am thinking along the same lines, but it's not about few bytes of
space rather we need 80 (72 to be precise). Will think more, but
these 72 bytes really stand tall in front of my optimism.

Do you see anything is possible in nvme-side?
Now also passthrough command (although in a modified form) gets copied
into this preallocated space i.e. nvme_req(req)->cmd. This part -

void nvme_init_request(struct request *req, struct nvme_command *cmd) 
{
	...
	memcpy(nvme_req(req)->cmd, cmd, sizeof(*cmd)); 
}

So to avoid commmon-path cost, we go about doing nothing (no allocation,
no deferral) in the outset and choose to recreate the passthrough
command if failure occured. Hope this explains the purpose of
nvme_uring_cmd_io_retry?

I think it does, but I'm still having a hard time with it...

Maybe I am reiterating but few main differences that should help -

- io_uring_cmd is at the forefront, and bio/request as secondary
objects. Former is persistent object across requeue attempts and the
only thing available when we discover the path, while other two are
created every time we retry.

- Receiving bio from upper layer is a luxury that we do not have for
 passthrough. When we receive bio, pages are already mapped and we
 do not have to deal with user-specific fields, so there is more
 freedom in using arbitrary context (workers etc.). But passthrough
 command continues to point to user-space fields/buffers, so we need
 that task context.

+    req = nvme_alloc_user_request(q, &c, nvme_to_user_ptr(d.addr),
+            d.data_len, nvme_to_user_ptr(d.metadata),
+            d.metadata_len, 0, &meta, d.timeout_ms ?
+            msecs_to_jiffies(d.timeout_ms) : 0,
+            ioucmd->cmd_op == NVME_URING_CMD_IO_VEC, 0, 0);
+    if (IS_ERR(req))
+        return PTR_ERR(req);
+
+    req->end_io = nvme_uring_cmd_end_io;
+    req->end_io_data = ioucmd;
+    pdu->bio = req->bio;
+    pdu->meta = meta;
+    req->cmd_flags |= REQ_NVME_MPATH;
+    blk_execute_rq_nowait(req, false);
+    return -EIOCBQUEUED;
+}
+
+void nvme_ioucmd_mpath_retry(struct io_uring_cmd *ioucmd)
+{
+    struct cdev *cdev = file_inode(ioucmd->file)->i_cdev;
+    struct nvme_ns_head *head = container_of(cdev, struct 
nvme_ns_head,
+            cdev);
+    int srcu_idx = srcu_read_lock(&head->srcu);
+    struct nvme_ns *ns = nvme_find_path(head);
+    unsigned int issue_flags = IO_URING_F_SQE128 | IO_URING_F_CQE32 |
+        IO_URING_F_MPATH;
+    struct device *dev = &head->cdev_device;
+
+    if (likely(ns)) {
+        struct nvme_uring_cmd_pdu *pdu = nvme_uring_cmd_pdu(ioucmd);
+        struct request *oreq = pdu->req;
+        int ret;
+
+        if (oreq == NULL) {
+            /*
+             * this was not submitted (to device) earlier. For this
+             * ioucmd->cmd points to persistent memory. Free that
+             * up post submission
+             */
+            const void *cmd = ioucmd->cmd;
+
+            ret = nvme_ns_uring_cmd(ns, ioucmd, issue_flags);
+            kfree(cmd);
+        } else {
+            /*
+             * this was submitted (to device) earlier. Use old
+             * request, bio (if it exists) and nvme-pdu to recreate
+             * the command for the discovered path
+             */
+            ret = nvme_uring_cmd_io_retry(ns, oreq, ioucmd, pdu);

Why is this needed? Why is reuse important here? Why not always call
nvme_ns_uring_cmd?

Please see the previous explanation.
If condition is for the case when we made the passthrough command stable
by allocating beforehand.
Else is for the case when we avoided taking that cost.

The current design of the multipath failover code is clean:
1. extract bio(s) from request
2. link in requeue_list
3. schedule requeue_work that,
3.1 takes bios 1-by-1, and submits them again (exactly the same way)

It is really clean, and fits really well with bio based entry interface.
But as I said earlier, it does not go well with uring-cmd based entry
interface, and bunch of of other things which needs to be done
differently for generic passthrough command.

I'd like us to try to follow the same design where retry is
literally "do the exact same thing, again". That would eliminate
two submission paths that do the same thing, but slightly different.

Exact same thing is possible if we make the common path slow i.e.
allocate/copy passthrough command and keep it alive until completion.
But that is really not the way to go I suppose.

I'm not sure. With Christoph's response, I'm not sure it is
universally desired to support failover (in my opinion it should). But
if we do in fact choose to support it, I think we need a better
solution. If fast-path allocation is your prime concern, then let's try
to address that with space pre-allocation.

Sure. I understand the benefit that space pre-allocation will give.

And overall, these are the top three things to iron out: 
- to do (failover) or not to do
- better way to keep the passthrough-cmd stable
- better way to link io_uring_cmd