Re: [RFC PATCH 00/12] io_uring attached nvme queue

Kanchan Joshi <joshiiitr@xxxxxxxxx> · Mon, 1 May 2023 17:06:24 +0530

On Sat, Apr 29, 2023 at 10:55 PM Jens Axboe <axboe@xxxxxxxxx> wrote:
>
> On 4/29/23 3:39?AM, Kanchan Joshi wrote:
> > This series shows one way to do what the title says.
> > This puts up a more direct/lean path that enables
> >  - submission from io_uring SQE to NVMe SQE
> >  - completion from NVMe CQE to io_uring CQE
> > Essentially cutting the hoops (involving request/bio) for nvme io path.
> >
> > Also, io_uring ring is not to be shared among application threads.
> > Application is responsible for building the sharing (if it feels the
> > need). This means ring-associated exclusive queue can do away with some
> > synchronization costs that occur for shared queue.
> >
> > Primary objective is to amp up of efficiency of kernel io path further
> > (towards PCIe gen N, N+1 hardware).
> > And we are seeing some asks too [1].
> >
> > Building-blocks
> > ===============
> > At high level, series can be divided into following parts -
> >
> > 1. nvme driver starts exposing some queue-pairs (SQ+CQ) that can
> > be attached to other in-kernel user (not just to block-layer, which is
> > the case at the moment) on demand.
> >
> > Example:
> > insmod nvme.ko poll_queus=1 raw_queues=2
> >
> > nvme0: 24/0/1/2 default/read/poll queues/raw queues
> >
> > While driver registers other queues with block-layer, raw-queues are
> > rather reserved for exclusive attachment with other in-kernel users.
> > At this point, each raw-queue is interrupt-disabled (similar to
> > poll_queues). Maybe we need a better name for these (e.g. app/user queues).
> > [Refer: patch 2]
> >
> > 2. register/unregister queue interface
> > (a) one for io_uring application to ask for device-queue and register
> > with the ring. [Refer: patch 4]
> > (b) another at nvme so that other in-kernel users (io_uring for now) can
> > ask for a raw-queue. [Refer: patch 3, 5, 6]
> >
> > The latter returns a qid, that io_uring stores internally (not exposed
> > to user-space) in the ring ctx. At max one queue per ring is enabled.
> > Ring has no other special properties except the fact that it stores a
> > qid that it can use exclusively. So application can very well use the
> > ring to do other things than nvme io.
> >
> > 3. user-interface to send commands down this way
> > (a) uring-cmd is extended to support a new flag "IORING_URING_CMD_DIRECT"
> > that application passes in the SQE. That is all.
> > (b) the flag goes down to provider of ->uring_cmd which may choose to do
> >   things differently based on it (or ignore it).
> > [Refer: patch 7]
> >
> > 4. nvme uring-cmd understands the above flag. It submits the command
> > into the known pre-registered queue, and completes (polled-completion)
> > from it. Transformation from "struct io_uring_cmd" to "nvme command" is
> > done directly without building other intermediate constructs.
> > [Refer: patch 8, 10, 12]
> >
> > Testing and Performance
> > =======================
> > fio and t/io_uring is modified to exercise this path.
> > - fio: new "registerqueues" option
> > - t/io_uring: new "k" option
> >
> > Good part:
> > 2.96M -> 5.02M
> >
> > nvme io (without this):
> > # t/io_uring -b512 -d64 -c2 -s2 -p1 -F1 -B1 -O0 -n1 -u1 -r4 -k0 /dev/ng0n1
> > submitter=0, tid=2922, file=/dev/ng0n1, node=-1
> > polled=1, fixedbufs=1/0, register_files=1, buffered=1, register_queues=0 QD=64
> > Engine=io_uring, sq_ring=64, cq_ring=64
> > IOPS=2.89M, BW=1412MiB/s, IOS/call=2/1
> > IOPS=2.92M, BW=1426MiB/s, IOS/call=2/2
> > IOPS=2.96M, BW=1444MiB/s, IOS/call=2/1
> > Exiting on timeout
> > Maximum IOPS=2.96M
> >
> > nvme io (with this):
> > # t/io_uring -b512 -d64 -c2 -s2 -p1 -F1 -B1 -O0 -n1 -u1 -r4 -k1 /dev/ng0n1
> > submitter=0, tid=2927, file=/dev/ng0n1, node=-1
> > polled=1, fixedbufs=1/0, register_files=1, buffered=1, register_queues=1 QD=64
> > Engine=io_uring, sq_ring=64, cq_ring=64
> > IOPS=4.99M, BW=2.43GiB/s, IOS/call=2/1
> > IOPS=5.02M, BW=2.45GiB/s, IOS/call=2/1
> > IOPS=5.02M, BW=2.45GiB/s, IOS/call=2/1
> > Exiting on timeout
> > Maximum IOPS=5.02M
> >
> > Not so good part:
> > While single IO is fast this way, we do not have batching abilities for
> > multi-io scenario. Plugging, submission and completion batching are tied to
> > block-layer constructs. Things should look better if we could do something
> > about that.
> > Particularly something is off with the completion-batching.
> >
> > With -s32 and -c32, the numbers decline:
> >
> > # t/io_uring -b512 -d64 -c32 -s32 -p1 -F1 -B1 -O0 -n1 -u1 -r4 -k1 /dev/ng0n1
> > submitter=0, tid=3674, file=/dev/ng0n1, node=-1
> > polled=1, fixedbufs=1/0, register_files=1, buffered=1, register_queues=1 QD=64
> > Engine=io_uring, sq_ring=64, cq_ring=64
> > IOPS=3.70M, BW=1806MiB/s, IOS/call=32/31
> > IOPS=3.71M, BW=1812MiB/s, IOS/call=32/31
> > IOPS=3.71M, BW=1812MiB/s, IOS/call=32/32
> > Exiting on timeout
> > Maximum IOPS=3.71M
> >
> > And perf gets restored if we go back to -c2
> >
> > # t/io_uring -b512 -d64 -c2 -s32 -p1 -F1 -B1 -O0 -n1 -u1 -r4 -k1 /dev/ng0n1
> > submitter=0, tid=3677, file=/dev/ng0n1, node=-1
> > polled=1, fixedbufs=1/0, register_files=1, buffered=1, register_queues=1 QD=64
> > Engine=io_uring, sq_ring=64, cq_ring=64
> > IOPS=4.99M, BW=2.44GiB/s, IOS/call=5/5
> > IOPS=5.02M, BW=2.45GiB/s, IOS/call=5/5
> > IOPS=5.02M, BW=2.45GiB/s, IOS/call=5/5
> > Exiting on timeout
> > Maximum IOPS=5.02M
> >
> > Source
> > ======
> > Kernel: https://github.com/OpenMPDK/linux/tree/feat/directq-v1
> > fio: https://github.com/OpenMPDK/fio/commits/feat/rawq-v2
> >
> > Please take a look.
>
> This looks like a great starting point! Unfortunately I won't be at
> LSFMM this year to discuss it in person, but I'll be taking a closer
> look at this.

That will help, thanks.

> Some quick initial reactions:
>
> - I'd call them "user" queues rather than raw or whatever, I think that
>   more accurately describes what they are for.

Right, that is better.

> - I guess there's no way around needing to pre-allocate these user
>   queues, just like we do for polled_queues right now?

Right, we would need to allocate nvme sq/cq in the outset.
Changing the count at run-time is a bit murky. I will have another look though.

>In terms of user
>   API, it'd be nicer if you could just do IORING_REGISTER_QUEUE (insert
>   right name here...) and it'd allocate and return you an ID.

But this is the implemented API (new register code in io_uring) in the
patchset at the moment.
So it seems I am missing your point?

> - Need to take a look at the uring_cmd stuff again, but would be nice if
>   we did not have to add more stuff to fops for this. Maybe we can set
>   aside a range of "ioctl" type commands through uring_cmd for this
>   instead, and go that way for registering/unregistering queues.

Yes, I see your point in not having to add new fops.
But, a new uring_cmd opcode is only at the nvme-level.
It is a good way to allocate/deallocate a nvme queue, but it cannot
attach that with the io_uring's ring.
Or do you have a different view? Seems this is connected to the previous point.

> We do have some users that are CPU constrained, and while my testing
> easily maxes out a gen2 optane (actually 2 or 3) with the generic IO
> path, that's also with all the fat that adds overhead removed. Most
> people don't have this luxury, necessarily, or actually need some of
> this fat for their monitoring, for example. This would provide a nice
> way to have pretty consistent and efficient performance across distro
> type configs, which would be great, while still retaining the fattier
> bits for "normal" IO.
Makes total sense.