On Sat, Apr 29, 2023 at 10:55 PM Jens Axboe <axboe@xxxxxxxxx> wrote: > > On 4/29/23 3:39?AM, Kanchan Joshi wrote: > > This series shows one way to do what the title says. > > This puts up a more direct/lean path that enables > > - submission from io_uring SQE to NVMe SQE > > - completion from NVMe CQE to io_uring CQE > > Essentially cutting the hoops (involving request/bio) for nvme io path. > > > > Also, io_uring ring is not to be shared among application threads. > > Application is responsible for building the sharing (if it feels the > > need). This means ring-associated exclusive queue can do away with some > > synchronization costs that occur for shared queue. > > > > Primary objective is to amp up of efficiency of kernel io path further > > (towards PCIe gen N, N+1 hardware). > > And we are seeing some asks too [1]. > > > > Building-blocks > > =============== > > At high level, series can be divided into following parts - > > > > 1. nvme driver starts exposing some queue-pairs (SQ+CQ) that can > > be attached to other in-kernel user (not just to block-layer, which is > > the case at the moment) on demand. > > > > Example: > > insmod nvme.ko poll_queus=1 raw_queues=2 > > > > nvme0: 24/0/1/2 default/read/poll queues/raw queues > > > > While driver registers other queues with block-layer, raw-queues are > > rather reserved for exclusive attachment with other in-kernel users. > > At this point, each raw-queue is interrupt-disabled (similar to > > poll_queues). Maybe we need a better name for these (e.g. app/user queues). > > [Refer: patch 2] > > > > 2. register/unregister queue interface > > (a) one for io_uring application to ask for device-queue and register > > with the ring. [Refer: patch 4] > > (b) another at nvme so that other in-kernel users (io_uring for now) can > > ask for a raw-queue. [Refer: patch 3, 5, 6] > > > > The latter returns a qid, that io_uring stores internally (not exposed > > to user-space) in the ring ctx. At max one queue per ring is enabled. > > Ring has no other special properties except the fact that it stores a > > qid that it can use exclusively. So application can very well use the > > ring to do other things than nvme io. > > > > 3. user-interface to send commands down this way > > (a) uring-cmd is extended to support a new flag "IORING_URING_CMD_DIRECT" > > that application passes in the SQE. That is all. > > (b) the flag goes down to provider of ->uring_cmd which may choose to do > > things differently based on it (or ignore it). > > [Refer: patch 7] > > > > 4. nvme uring-cmd understands the above flag. It submits the command > > into the known pre-registered queue, and completes (polled-completion) > > from it. Transformation from "struct io_uring_cmd" to "nvme command" is > > done directly without building other intermediate constructs. > > [Refer: patch 8, 10, 12] > > > > Testing and Performance > > ======================= > > fio and t/io_uring is modified to exercise this path. > > - fio: new "registerqueues" option > > - t/io_uring: new "k" option > > > > Good part: > > 2.96M -> 5.02M > > > > nvme io (without this): > > # t/io_uring -b512 -d64 -c2 -s2 -p1 -F1 -B1 -O0 -n1 -u1 -r4 -k0 /dev/ng0n1 > > submitter=0, tid=2922, file=/dev/ng0n1, node=-1 > > polled=1, fixedbufs=1/0, register_files=1, buffered=1, register_queues=0 QD=64 > > Engine=io_uring, sq_ring=64, cq_ring=64 > > IOPS=2.89M, BW=1412MiB/s, IOS/call=2/1 > > IOPS=2.92M, BW=1426MiB/s, IOS/call=2/2 > > IOPS=2.96M, BW=1444MiB/s, IOS/call=2/1 > > Exiting on timeout > > Maximum IOPS=2.96M > > > > nvme io (with this): > > # t/io_uring -b512 -d64 -c2 -s2 -p1 -F1 -B1 -O0 -n1 -u1 -r4 -k1 /dev/ng0n1 > > submitter=0, tid=2927, file=/dev/ng0n1, node=-1 > > polled=1, fixedbufs=1/0, register_files=1, buffered=1, register_queues=1 QD=64 > > Engine=io_uring, sq_ring=64, cq_ring=64 > > IOPS=4.99M, BW=2.43GiB/s, IOS/call=2/1 > > IOPS=5.02M, BW=2.45GiB/s, IOS/call=2/1 > > IOPS=5.02M, BW=2.45GiB/s, IOS/call=2/1 > > Exiting on timeout > > Maximum IOPS=5.02M > > > > Not so good part: > > While single IO is fast this way, we do not have batching abilities for > > multi-io scenario. Plugging, submission and completion batching are tied to > > block-layer constructs. Things should look better if we could do something > > about that. > > Particularly something is off with the completion-batching. > > > > With -s32 and -c32, the numbers decline: > > > > # t/io_uring -b512 -d64 -c32 -s32 -p1 -F1 -B1 -O0 -n1 -u1 -r4 -k1 /dev/ng0n1 > > submitter=0, tid=3674, file=/dev/ng0n1, node=-1 > > polled=1, fixedbufs=1/0, register_files=1, buffered=1, register_queues=1 QD=64 > > Engine=io_uring, sq_ring=64, cq_ring=64 > > IOPS=3.70M, BW=1806MiB/s, IOS/call=32/31 > > IOPS=3.71M, BW=1812MiB/s, IOS/call=32/31 > > IOPS=3.71M, BW=1812MiB/s, IOS/call=32/32 > > Exiting on timeout > > Maximum IOPS=3.71M > > > > And perf gets restored if we go back to -c2 > > > > # t/io_uring -b512 -d64 -c2 -s32 -p1 -F1 -B1 -O0 -n1 -u1 -r4 -k1 /dev/ng0n1 > > submitter=0, tid=3677, file=/dev/ng0n1, node=-1 > > polled=1, fixedbufs=1/0, register_files=1, buffered=1, register_queues=1 QD=64 > > Engine=io_uring, sq_ring=64, cq_ring=64 > > IOPS=4.99M, BW=2.44GiB/s, IOS/call=5/5 > > IOPS=5.02M, BW=2.45GiB/s, IOS/call=5/5 > > IOPS=5.02M, BW=2.45GiB/s, IOS/call=5/5 > > Exiting on timeout > > Maximum IOPS=5.02M > > > > Source > > ====== > > Kernel: https://github.com/OpenMPDK/linux/tree/feat/directq-v1 > > fio: https://github.com/OpenMPDK/fio/commits/feat/rawq-v2 > > > > Please take a look. > > This looks like a great starting point! Unfortunately I won't be at > LSFMM this year to discuss it in person, but I'll be taking a closer > look at this. That will help, thanks. > Some quick initial reactions: > > - I'd call them "user" queues rather than raw or whatever, I think that > more accurately describes what they are for. Right, that is better. > - I guess there's no way around needing to pre-allocate these user > queues, just like we do for polled_queues right now? Right, we would need to allocate nvme sq/cq in the outset. Changing the count at run-time is a bit murky. I will have another look though. >In terms of user > API, it'd be nicer if you could just do IORING_REGISTER_QUEUE (insert > right name here...) and it'd allocate and return you an ID. But this is the implemented API (new register code in io_uring) in the patchset at the moment. So it seems I am missing your point? > - Need to take a look at the uring_cmd stuff again, but would be nice if > we did not have to add more stuff to fops for this. Maybe we can set > aside a range of "ioctl" type commands through uring_cmd for this > instead, and go that way for registering/unregistering queues. Yes, I see your point in not having to add new fops. But, a new uring_cmd opcode is only at the nvme-level. It is a good way to allocate/deallocate a nvme queue, but it cannot attach that with the io_uring's ring. Or do you have a different view? Seems this is connected to the previous point. > We do have some users that are CPU constrained, and while my testing > easily maxes out a gen2 optane (actually 2 or 3) with the generic IO > path, that's also with all the fat that adds overhead removed. Most > people don't have this luxury, necessarily, or actually need some of > this fat for their monitoring, for example. This would provide a nice > way to have pretty consistent and efficient performance across distro > type configs, which would be great, while still retaining the fattier > bits for "normal" IO. Makes total sense.