Re: another nvme pssthrough design based on nvme hardware queue file abstraction

Xiaoguang Wang <xiaoguang.wang@xxxxxxxxxxxxxxxxx> · Fri, 28 Apr 2023 10:42:32 +0800



hi,

> On Thu, Apr 27, 2023 at 08:17:30PM +0800, Xiaoguang Wang wrote:
>>> On Wed, Apr 26, 2023 at 09:19:57PM +0800, Xiaoguang Wang wrote:
>>>> hi all,
>>>>
>>>> Recently we start to test nvme passthrough feature, which is based on io_uring. Originally we
>>>> thought its performance would be much better than normal polled nvme test, but test results
>>>> show that it's not:
>>>> $ sudo taskset -c 1 /home/feiman.wxg/fio/t/io_uring -b512 -d128 -c32 -s32 -p1 -F1 -B1 -O0 -n1  -u1 /dev/ng1n1
>>>> IOPS=891.49K, BW=435MiB/s, IOS/call=32/31
>>>> IOPS=891.07K, BW=435MiB/s, IOS/call=31/31
>>>>
>>>> $ sudo taskset -c 1 /home/feiman.wxg/fio/t/io_uring -b512 -d128 -c32 -s32 -p1 -F1 -B1 -O1 -n1 /dev/nvme1n1
>>>> IOPS=807.81K, BW=394MiB/s, IOS/call=32/31
>>>> IOPS=808.13K, BW=394MiB/s, IOS/call=32/32
>>>>
>>>> about 10% iops improvement, I'm not saying its not good, just had thought it should
>>>> perform much better.
>>> What did you think it should be? What is the maximum 512b read IOPs your device
>>> is capable of producing?
>> From the naming of this feature, I thought it would bypass blocker thoroughly, hence
>> would gain much higher performance, for myself, if this feature can improves 25% higher
>> or more, that would be much more attractive, and users would like to try it. Again, I'm
>> not saying this feature is not good, just thought it would perform much better for small io.
> It does bypass the block layer. The driver just uses library functions provided
> by the block layer for things it doesn't want to duplicate. Reimplementing that
> functionality in driver isn't going to improve anything.
>
>>>> In our kernel config, no active q->stats->callbacks, but still has this overhead.
>>>>
>>>> 2. 0.97%  io_uring  [kernel.vmlinux]  [k] bio_associate_blkg_from_css
>>>>     0.85%  io_uring  [kernel.vmlinux]  [k] bio_associate_blkg
>>>>     0.74%  io_uring  [kernel.vmlinux]  [k] blkg_lookup_create
>>>> For nvme passthrough feature, it tries to dispatch nvme commands to nvme
>>>> controller directly, so should get rid of these overheads.
>>>>
>>>> 3. 3.19%  io_uring  [kernel.vmlinux]  [k] __rcu_read_unlock
>>>>     2.65%  io_uring  [kernel.vmlinux]  [k] __rcu_read_lock
>>>> Frequent rcu_read_lock/unlcok overheads, not sure whether we can improve a bit.
>>>>
>>>> 4. 7.90%  io_uring  [nvme]            [k] nvme_poll
>>>>     3.59%  io_uring  [nvme_core]       [k] nvme_ns_chr_uring_cmd_iopoll
>>>>     2.63%  io_uring  [kernel.vmlinux]  [k] blk_mq_poll_classic
>>>>     1.88%  io_uring  [nvme]            [k] nvme_poll_cq
>>>>     1.74%  io_uring  [kernel.vmlinux]  [k] bio_poll
>>>>     1.89%  io_uring  [kernel.vmlinux]  [k] xas_load
>>>>     0.86%  io_uring  [kernel.vmlinux]  [k] xas_start
>>>>     0.80%  io_uring  [kernel.vmlinux]  [k] xas_start
>>>> Seems that the block poll operation call chain is somewhat deep, also
>>> It's not really that deep, though the xarray lookups are unfortunate.
>>>
>>> And if you were to remove block layer, it looks like you'd end up just shifting
>>> the CPU utilization to a different polling function without increasing IOPs.
>>> Your hardware doesn't look fast enough for this software overhead to be a
>>> concern.
>> No, I may not agree with you here, sorry. Real products(not like t/io_uring tools,
>> which just polls block layer when ios are issued) will have many other work
>> to run, such as network work. If we can cut the nvme passthrough overhead more,
>> saved cpu will use to do other useful work.
> You initiated this thread with supposed underwhelming IOPs improvements from
> the io engine, but now you've shifted your criteria.
Sorry, but how did you come to this conclusion that I have shifted my criteria...
I'm not a native english speaker, may not express my thoughts clearly. And
I forgot to mention that indeed in real products, they may manage more than
one nvme ssd with one cpu(software is taskseted to corresponding cpu), so
I think software overhead would be a concern.

No offense at all, I initiated this thread just to discuss whether we can improve
nvme passthrough performance more. For myself, also need to understand
nvme codes more.
>
> You can always turn off the kernel's stats and cgroups if you don't find them
> useful.
In example of cgroups, do you mean disable CONFIG_BLK_CGROUP?
I'm not sure it will work, a physical machine may have many disk drives,
others drives may need blkcg.

Regards,
Xiaoguang Wang