On 8/5/22 11:04 AM, Jens Axboe wrote: > On 8/5/22 9:42 AM, Kanchan Joshi wrote: >> Hi, >> >> Series enables async polling on io_uring command, and nvme passthrough >> (for io-commands) is wired up to leverage that. >> >> 512b randread performance (KIOP) below: >> >> QD_batch block passthru passthru-poll block-poll >> 1_1 80 81 158 157 >> 8_2 406 470 680 700 >> 16_4 620 656 931 920 >> 128_32 879 1056 1120 1132 > > Curious on why passthru is slower than block-poll? Are we missing > something here? I took a quick peek, running it here. List of items making it slower: - No fixedbufs support for passthru, each each request will go through get_user_pages() and put_pages() on completion. This is about a 10% change for me, by itself. - nvme_uring_cmd_io() -> nvme_alloc_user_request() -> blk_rq_map_user() -> blk_rq_map_user_iov() -> memset() is another ~4% for me. - The kmalloc+kfree per command is roughly 9% extra slowdown. There are other little things, but the above are the main ones. Even if I disable fixedbufs for non-passthru, passthru is about ~24% slower here using a single device and a single core, which is mostly the above mentioned items. This isn't specific to the iopoll support, that's obviously faster than IRQ driven for this test case. This is just comparing passthru with the regular block path for doing random 512b reads. -- Jens Axboe