Background & Objective: ----------------------- New storage interfaces/features, especially in NVMe, are emerging fast. NVMe now has 3 command sets (NVM, ZNS and KV), and this is only going to grow further (e.g. computational storage). Many of these new commands do not fit well in the existing block abstraction and/or syscalls. Be it somewhat specialized operation, or even a new way of doing classical read/write (e.g. zone-append, copy command) - it takes a good deal of consensus/time for a new device interface to climb the ladders of kernel abstractions and become available for user-space consumption. This presents challenges for early adopters of tech, and leads to kernel-bypass at times. Passthrough interface cuts through the abstractions and allows applications to use any arbitrary nvme-command readily, similar to kernel-bypass solutions. But passthrough does not scale as it travels via sync ioctl interface, which is particularly painful for fast/parallel NVMe storage. Objective is to revamp the existing passthru interface and turn it into something that applications can readily use to play with new/emerging features of NVMe. Current state of work: ---------------------- 1. Block-interface is subject to compatibility of course. But now nvme exposes a generic char interface (/dev/ng) as well which is not subject to conditions [1]. When passthru is combined with this generic char interface, applications get a sure-fire way to operate nvme-device for any current/future command-set. This settles the availability problem. 2. For scalability problem, we are discussing this new facility “uring-cmd” that Jens proposed in io_uring [2]. This enables using io_uring for any arbitrary command (ioctl, fsctl etc.) exposed by the underlying component (driver, FS etc.). 3. I have posted patches combining nvme-passthru with uring-cmd [3]. This new uring-passthru path enables a bunch of capabilities – async transport, fixed-buffer, async-polling, bio-cache etc. This scales well. 512b randread KIOPS comparing uring-passthru-over-char (/dev/ng0n1) to uring-over-block (/dev/nvme0n1) QD uring pt uring-poll pt-poll 8 538 589 831 902 64 967 1131 1351 1378 256 1043 1230 1376 1429 Discussion points: ------------------ I'd like a propose a session to go over: - What are the issues in having the above work (uring-cmd and new nvme passthru) merged? - What would be other useful things to add in nvme-passthru. For example- lack of vectored-io for passthru was one such missing piece. That is covered from nvme 5.18 onwards [4]. But are there other things that user-space would need before it starts treating this path as a good alternative to kernel-bypass? - Despite the numbers above, nvme passthru has more room for efficiency e.g. unlike regular io, we do copy_to_user to fetch command, and put_user to return the result. Eliminating some of this may require new ioctl. There may be other opinions on what else needs overhaul in this path. - What would be a good way to upstream the tests? Nvme-cli may not be very useful. Should it be similar to fio’s sg ioengine. But unlike sg, here we are combining ng with io_uring, and one would want to retain all the tunables of io_uring (register/fixed buffers/sqpoll etc.) - All the above is for 2.0 passthru which essentially forms a direct path between io_uring and nvme. And io_uring and nvme programming model share many similarities. For 3.0 passthru, would it be crazy to think of trimming the path further by eliminating the block-layer and doing stuff without “struct request”. There is some interest in developing user-space block device [5] and FS anyway. [1] https://lore.kernel.org/linux-nvme/20210421074504.57750-1-minwoo.im.dev@xxxxxxxxx/ [2] https://lore.kernel.org/linux-nvme/20210317221027.366780-1-axboe@xxxxxxxxx/ [3] https://lore.kernel.org/linux-nvme/20211220141734.12206-1-joshi.k@xxxxxxxxxxx/ [4] https://lore.kernel.org/linux-nvme/20220216080208.GD10554@xxxxxx/ [5] https://lore.kernel.org/linux-block/87tucsf0sr.fsf@xxxxxxxxxxxxx/ -- 2.25.1