On 6/19/24 1:18 AM, hexue wrote: > io_uring use polling mode could improve the IO performence, but it will > spend 100% of CPU resources to do polling. > > This set a signal "IORING_SETUP_HY_POLL" to application, aim to provide > a interface for user to enable a new hybrid polling at io_uring level. > > A new hybrid poll is implemented on the io_uring layer. Once IO issued, > it will not polling immediately, but block first and re-run before IO > complete, then poll to reap IO. This poll function could keep polling > high performance and free up some CPU resources. > > we considered about complex situations, such as multi-concurrency, > different processing speed of multi-disk, etc. > > Test results: > set 8 poll queues, fio-3.35, Gen5 SSD, 8 CPU VM > > per CPU utilization: > read(128k, QD64, 1Job) 53% write(128k, QD64, 1Job) 45% > randread(4k, QD64, 16Job) 70% randwrite(4k, QD64, 16Job) 16% > performance reduction: > read 0.92% write 0.92% randread 1.61% randwrite 0% Haven't tried this on slower storage yet, but my usual 122M IOPS polled test case (24 drives, each using a single thread to load up a drive) yields the following with hybrid polling enabled: IOPS=57.08M, BW=27.87GiB/s, IOS/call=32/31 IOPS=56.91M, BW=27.79GiB/s, IOS/call=32/32 IOPS=57.93M, BW=28.29GiB/s, IOS/call=31/31 IOPS=57.82M, BW=28.23GiB/s, IOS/call=32/32 which is even slower than IRQ driven. It does use less cpu, about 1900% compared to 2400% before as it's polling. And obviously this is not the best case for this scenario, as these devices have low latencies. Like I predicted in earlier replies, most of the added overhead here is TSC reading, outside of the obvious one of now having wakeups and context switches, about 1M/sec of the latter for this test. If we move to regular flash, here's another box I have with 32 flash drives in it. For a similar test, we get: IOPS=104.01M, BW=50.78GiB/s, IOS/call=31/31 IOPS=103.92M, BW=50.74GiB/s, IOS/call=31/31 IOPS=103.99M, BW=50.78GiB/s, IOS/call=31/31 IOPS=103.97M, BW=50.77GiB/s, IOS/call=31/31 IOPS=104.01M, BW=50.79GiB/s, IOS/call=31/31 IOPS=104.02M, BW=50.79GiB/s, IOS/call=31/31 IOPS=103.62M, BW=50.59GiB/s, IOS/call=31/31 using 3200% CPU (32 drives, 32 threads polling) with regular polling, and enabling hybrid polling: IOPS=53.62M, BW=26.18GiB/s, IOS/call=32/32 IOPS=53.37M, BW=26.06GiB/s, IOS/call=31/31 IOPS=53.45M, BW=26.10GiB/s, IOS/call=32/31 IOPS=53.43M, BW=26.09GiB/s, IOS/call=32/32 IOPS=53.11M, BW=25.93GiB/s, IOS/call=32/32 and again a lot of tsc overhead (> 10%), overhead from your extra allocations (8%). If we just do a single flash drive, it'll do 3.25M with 100% with normal polling, and 2.0M with 50% CPU usage with hybrid polling. While I do suspect there are cases where hybrid polling will be more efficient, not sure there are many of them. And you're most likely better off just doing IRQ driven IO at that point? Particularly with the fairly substantial overhead of maintaining the data you need, and time querying. -- Jens Axboe