hi, > On Thu, Apr 27, 2023 at 08:17:30PM +0800, Xiaoguang Wang wrote: >>> On Wed, Apr 26, 2023 at 09:19:57PM +0800, Xiaoguang Wang wrote: >>>> hi all, >>>> >>>> Recently we start to test nvme passthrough feature, which is based on io_uring. Originally we >>>> thought its performance would be much better than normal polled nvme test, but test results >>>> show that it's not: >>>> $ sudo taskset -c 1 /home/feiman.wxg/fio/t/io_uring -b512 -d128 -c32 -s32 -p1 -F1 -B1 -O0 -n1 -u1 /dev/ng1n1 >>>> IOPS=891.49K, BW=435MiB/s, IOS/call=32/31 >>>> IOPS=891.07K, BW=435MiB/s, IOS/call=31/31 >>>> >>>> $ sudo taskset -c 1 /home/feiman.wxg/fio/t/io_uring -b512 -d128 -c32 -s32 -p1 -F1 -B1 -O1 -n1 /dev/nvme1n1 >>>> IOPS=807.81K, BW=394MiB/s, IOS/call=32/31 >>>> IOPS=808.13K, BW=394MiB/s, IOS/call=32/32 >>>> >>>> about 10% iops improvement, I'm not saying its not good, just had thought it should >>>> perform much better. >>> What did you think it should be? What is the maximum 512b read IOPs your device >>> is capable of producing? >> From the naming of this feature, I thought it would bypass blocker thoroughly, hence >> would gain much higher performance, for myself, if this feature can improves 25% higher >> or more, that would be much more attractive, and users would like to try it. Again, I'm >> not saying this feature is not good, just thought it would perform much better for small io. > It does bypass the block layer. The driver just uses library functions provided > by the block layer for things it doesn't want to duplicate. Reimplementing that > functionality in driver isn't going to improve anything. > >>>> In our kernel config, no active q->stats->callbacks, but still has this overhead. >>>> >>>> 2. 0.97% io_uring [kernel.vmlinux] [k] bio_associate_blkg_from_css >>>> 0.85% io_uring [kernel.vmlinux] [k] bio_associate_blkg >>>> 0.74% io_uring [kernel.vmlinux] [k] blkg_lookup_create >>>> For nvme passthrough feature, it tries to dispatch nvme commands to nvme >>>> controller directly, so should get rid of these overheads. >>>> >>>> 3. 3.19% io_uring [kernel.vmlinux] [k] __rcu_read_unlock >>>> 2.65% io_uring [kernel.vmlinux] [k] __rcu_read_lock >>>> Frequent rcu_read_lock/unlcok overheads, not sure whether we can improve a bit. >>>> >>>> 4. 7.90% io_uring [nvme] [k] nvme_poll >>>> 3.59% io_uring [nvme_core] [k] nvme_ns_chr_uring_cmd_iopoll >>>> 2.63% io_uring [kernel.vmlinux] [k] blk_mq_poll_classic >>>> 1.88% io_uring [nvme] [k] nvme_poll_cq >>>> 1.74% io_uring [kernel.vmlinux] [k] bio_poll >>>> 1.89% io_uring [kernel.vmlinux] [k] xas_load >>>> 0.86% io_uring [kernel.vmlinux] [k] xas_start >>>> 0.80% io_uring [kernel.vmlinux] [k] xas_start >>>> Seems that the block poll operation call chain is somewhat deep, also >>> It's not really that deep, though the xarray lookups are unfortunate. >>> >>> And if you were to remove block layer, it looks like you'd end up just shifting >>> the CPU utilization to a different polling function without increasing IOPs. >>> Your hardware doesn't look fast enough for this software overhead to be a >>> concern. >> No, I may not agree with you here, sorry. Real products(not like t/io_uring tools, >> which just polls block layer when ios are issued) will have many other work >> to run, such as network work. If we can cut the nvme passthrough overhead more, >> saved cpu will use to do other useful work. > You initiated this thread with supposed underwhelming IOPs improvements from > the io engine, but now you've shifted your criteria. Sorry, but how did you come to this conclusion that I have shifted my criteria... I'm not a native english speaker, may not express my thoughts clearly. And I forgot to mention that indeed in real products, they may manage more than one nvme ssd with one cpu(software is taskseted to corresponding cpu), so I think software overhead would be a concern. No offense at all, I initiated this thread just to discuss whether we can improve nvme passthrough performance more. For myself, also need to understand nvme codes more. > > You can always turn off the kernel's stats and cgroups if you don't find them > useful. In example of cgroups, do you mean disable CONFIG_BLK_CGROUP? I'm not sure it will work, a physical machine may have many disk drives, others drives may need blkcg. Regards, Xiaoguang Wang