Re: another nvme pssthrough design based on nvme hardware queue file abstraction

Keith Busch <kbusch@xxxxxxxxxx> · Thu, 27 Apr 2023 09:03:39 -0600

On Thu, Apr 27, 2023 at 08:17:30PM +0800, Xiaoguang Wang wrote:
> > On Wed, Apr 26, 2023 at 09:19:57PM +0800, Xiaoguang Wang wrote:
> >> hi all,
> >>
> >> Recently we start to test nvme passthrough feature, which is based on io_uring. Originally we
> >> thought its performance would be much better than normal polled nvme test, but test results
> >> show that it's not:
> >> $ sudo taskset -c 1 /home/feiman.wxg/fio/t/io_uring -b512 -d128 -c32 -s32 -p1 -F1 -B1 -O0 -n1  -u1 /dev/ng1n1
> >> IOPS=891.49K, BW=435MiB/s, IOS/call=32/31
> >> IOPS=891.07K, BW=435MiB/s, IOS/call=31/31
> >>
> >> $ sudo taskset -c 1 /home/feiman.wxg/fio/t/io_uring -b512 -d128 -c32 -s32 -p1 -F1 -B1 -O1 -n1 /dev/nvme1n1
> >> IOPS=807.81K, BW=394MiB/s, IOS/call=32/31
> >> IOPS=808.13K, BW=394MiB/s, IOS/call=32/32
> >>
> >> about 10% iops improvement, I'm not saying its not good, just had thought it should
> >> perform much better.
> > What did you think it should be? What is the maximum 512b read IOPs your device
> > is capable of producing?
> From the naming of this feature, I thought it would bypass blocker thoroughly, hence
> would gain much higher performance, for myself, if this feature can improves 25% higher
> or more, that would be much more attractive, and users would like to try it. Again, I'm
> not saying this feature is not good, just thought it would perform much better for small io.

It does bypass the block layer. The driver just uses library functions provided
by the block layer for things it doesn't want to duplicate. Reimplementing that
functionality in driver isn't going to improve anything.

> >> In our kernel config, no active q->stats->callbacks, but still has this overhead.
> >>
> >> 2. 0.97%  io_uring  [kernel.vmlinux]  [k] bio_associate_blkg_from_css
> >>     0.85%  io_uring  [kernel.vmlinux]  [k] bio_associate_blkg
> >>     0.74%  io_uring  [kernel.vmlinux]  [k] blkg_lookup_create
> >> For nvme passthrough feature, it tries to dispatch nvme commands to nvme
> >> controller directly, so should get rid of these overheads.
> >>
> >> 3. 3.19%  io_uring  [kernel.vmlinux]  [k] __rcu_read_unlock
> >>     2.65%  io_uring  [kernel.vmlinux]  [k] __rcu_read_lock
> >> Frequent rcu_read_lock/unlcok overheads, not sure whether we can improve a bit.
> >>
> >> 4. 7.90%  io_uring  [nvme]            [k] nvme_poll
> >>     3.59%  io_uring  [nvme_core]       [k] nvme_ns_chr_uring_cmd_iopoll
> >>     2.63%  io_uring  [kernel.vmlinux]  [k] blk_mq_poll_classic
> >>     1.88%  io_uring  [nvme]            [k] nvme_poll_cq
> >>     1.74%  io_uring  [kernel.vmlinux]  [k] bio_poll
> >>     1.89%  io_uring  [kernel.vmlinux]  [k] xas_load
> >>     0.86%  io_uring  [kernel.vmlinux]  [k] xas_start
> >>     0.80%  io_uring  [kernel.vmlinux]  [k] xas_start
> >> Seems that the block poll operation call chain is somewhat deep, also
> > It's not really that deep, though the xarray lookups are unfortunate.
> >
> > And if you were to remove block layer, it looks like you'd end up just shifting
> > the CPU utilization to a different polling function without increasing IOPs.
> > Your hardware doesn't look fast enough for this software overhead to be a
> > concern.
> No, I may not agree with you here, sorry. Real products(not like t/io_uring tools,
> which just polls block layer when ios are issued) will have many other work
> to run, such as network work. If we can cut the nvme passthrough overhead more,
> saved cpu will use to do other useful work.

You initiated this thread with supposed underwhelming IOPs improvements from
the io engine, but now you've shifted your criteria.

You can always turn off the kernel's stats and cgroups if you don't find them
useful.