On Thu, Mar 13, 2025 at 05:36:53PM +0100, Mikulas Patocka wrote: > On Wed, 12 Mar 2025, Ming Lei wrote: > > > > > It isn't perfect, sometime it may be slower than running on io-wq > > > > directly. > > > > > > > > But is there any better way for covering everything? > > > > > > Yes - fix the loop queue workers. > > > > What you suggested is threaded aio by submitting IO concurrently from > > different task context, this way is not the most efficient one, otherwise > > modern language won't invent async/.await. > > > > In my test VM, by running Mikulas's fio script on loop/nvme by the attached > > threaded_aio patch: > > > > NOWAIT with MQ 4 : 70K iops(read), 70K iops(write), cpu util: 40% > > threaded_aio with MQ 4 : 64k iops(read), 64K iops(write), cpu util: 52% > > in tree loop(SQ) : 58K iops(read), 58K iops(write) > > > > Mikulas, please feel free to run your tests with threaded_aio: > > > > modprobe loop nr_hw_queues=4 threaded_aio=1 > > > > by applying the attached the patch over the loop patchset. > > > > The performance gap could be more obvious in fast hardware. > > With "threaded_aio=1": > > Sync io > fio --direct=1 --bs=4k --runtime=10 --time_based --numjobs=12 --ioengine=psync --iodepth=1 --group_reporting=1 --filename=/mnt/test2/l -name=job --rw=rw > xfs/loop/xfs > READ: bw=300MiB/s (315MB/s), 300MiB/s-300MiB/s (315MB/s-315MB/s), io=3001MiB (3147MB), run=10001-10001msec > WRITE: bw=300MiB/s (315MB/s), 300MiB/s-300MiB/s (315MB/s-315MB/s), io=3004MiB (3149MB), run=10001-10001msec > > Async io > fio --direct=1 --bs=4k --runtime=10 --time_based --numjobs=12 --ioengine=libaio --iodepth=16 --group_reporting=1 --filename=/mnt/test2/l -name=job --rw=rw > xfs/loop/xfs > READ: bw=869MiB/s (911MB/s), 869MiB/s-869MiB/s (911MB/s-911MB/s), io=8694MiB (9116MB), run=10002-10002msec > WRITE: bw=870MiB/s (913MB/s), 870MiB/s-870MiB/s (913MB/s-913MB/s), io=8706MiB (9129MB), run=10002-10002msec The original numbers for the xfs/loop/xfs performance were 220MiB/s (sync) and 276MiB/s (async), so this is actually a very big step forward compared to the existing code. Yes, it's not quite as fast as the NOWAIT case for pure overwrites - 348MB/s (sync) and 1186MB/s (async), but we predicted (and expected) that this would be the case. However, this is still testing the static file, pure overwrite case only, so there is never any IO that blocks during submission. When IO will block (because there are allocating writes in progress) performance in the NOWAIT case will trend back towards the original performance levels because the single loop queue blocking submission will still be the limiting factor for all IO that needs to block. IOWs, these results show that to get decent, consistent performance out of the loop device we need threaded blocking submission so users do not have to care about optimising individual loop device instances for the layout of their image files. Yes, NOWAIT may then add an incremental performance improvement on top for optimal layout cases, but I'm still not yet convinced that it is a generally applicable loop device optimisation that everyone wants to always enable due to the potential for 100% NOWAIT submission failure on any given loop device..... -Dave. -- Dave Chinner david@xxxxxxxxxxxxx