Re: [PATCH 05/17] nvme: wire-up support for async-passthru on char-device.

Kanchan Joshi <joshi.k@xxxxxxxxxxx> · Wed, 16 Mar 2022 12:57:27 +0530




On Tue, Mar 15, 2022 at 09:54:10AM +0100, Christoph Hellwig wrote:
On Mon, Mar 14, 2022 at 09:53:56PM +0530, Kanchan Joshi wrote:
+struct nvme_uring_cmd_pdu {
+	u32 meta_len;
+	union {
+		struct bio *bio;
+		struct request *req;
+	};
+	void *meta; /* kernel-resident buffer */
+	void __user *meta_buffer;
+} __packed;

Why is this marked __packed?
Did not like doing it, but had to.
If not packed, this takes 32 bytes of space. While driver-pdu in struct
io_uring_cmd can take max 30 bytes. Packing nvme-pdu brought it down to
28 bytes, which fits and gives 2 bytes back.

What if you move meta_len to the end?  Even if we need the __packed
that will avoid all the unaligned access to pointers, which on some
architectures will crash the kernel.
ah, right. Will move that to the end.

And on moving meta elements outside the driver, my worry is that it
reduces scope of uring-cmd infra and makes it nvme passthru specific.
At this point uring-cmd is still generic async ioctl/fsctl facility
which may find other users (than nvme-passthru) down the line. Organization
of fields within "struct io_uring_cmd" is around the rule
that a field is kept out (of 28 bytes pdu) only if is accessed by both
io_uring and driver.

We have plenty of other interfaces of that kind.  Sockets are one case
already, and regular read/write with protection information will be
another one.  So having some core infrastrucure for "secondary data"
seems very useful.
So what is the picture that you have in mind for struct io_uring_cmd?
Moving meta fields out makes it look like this - 

@@ -28,7 +28,10 @@ struct io_uring_cmd {
       u32             cmd_op;
       u16             cmd_len;
       u16             unused;
-       u8              pdu[28]; /* available inline for free use */
+       void __user     *meta_buffer; /* nvme pt specific */
+       u32             meta_len; /* nvme pt specific */
+       u8              pdu[16]; /* available inline for free use */
+
};
And corresponding nvme 16 byte pdu - 
struct nvme_uring_cmd_pdu {
-       u32 meta_len;
       union {
               struct bio *bio;
               struct request *req;
       };
       void *meta; /* kernel-resident buffer */
-       void __user *meta_buffer;
} __packed;

I do not understand how this helps. Only the generic space (28 bytes)
got reduced to 16 bytes.

I see, so there is room for adding some efficiency.
Hope it will be ok if I carry this out as a separate effort.
Since this is about touching blk_mq_complete_request at its heart, and
improving sync-direct-io, this does not seem best-fit and slow this
series down.

I really rather to this properly.  Especially as the current effort
adds new exported interfaces.

Seems you are referring to io_uring_cmd_complete_in_task().

We would still need to use/export that even if we somehow manage to move
task-work trigger from nvme-function to blk_mq_complete_request.
io_uring's task-work infra is more baked than raw task-work infra.
It would not be good to repeat all that code elsewhere.
I tried raw one in the first attempt, and Jens suggested to move to baked
one. Here is the link that gave birth to this interface -
https://lore.kernel.org/linux-nvme/6d847f4a-65a5-bc62-1d36-52e222e3d142@xxxxxxxxx/


Deferring by ipi or softirq never occured. Neither for block nor for
char. Softirq is obvious since I was not running against scsi (or nvme with
single queue). I could not spot whether this is really a overhead, at
least for nvme.

This tends to kick in if you have less queues than cpu cores.  Quite
command with either a high core cound or a not very high end nvme
controller.
I will check that.
But swtiching (irq to task-work) is more generic and not about this series.
Triggering task-work anyway happens for regular read/write
completion too (in io_uring)...in the same return path involving
blk_mq_complete_request. For passthru, we are just triggering this
somewhat earlier in the completion path.