On Wed, Mar 12, 2025 at 04:27:12PM +0800, Ming Lei wrote: > On Wed, Mar 12, 2025 at 01:34:02PM +1100, Dave Chinner wrote: ... > > block layer/storage has many optimization for batching handling, if IOs > are submitted from many contexts: > > - this batch handling optimization is gone > > - IO is re-ordered from underlying hardware viewpoint > > - more contention from FS write lock, because loop has single back file. > > That is why the single task context is taken from the beginning of loop aio, > and it performs pretty well for sequential IO workloads, as I shown > in the zloop example. > > > > > > It isn't perfect, sometime it may be slower than running on io-wq > > > directly. > > > > > > But is there any better way for covering everything? > > > > Yes - fix the loop queue workers. > > What you suggested is threaded aio by submitting IO concurrently from > different task context, this way is not the most efficient one, otherwise > modern language won't invent async/.await. > > In my test VM, by running Mikulas's fio script on loop/nvme by the attached > threaded_aio patch: > > NOWAIT with MQ 4 : 70K iops(read), 70K iops(write), cpu util: 40% > threaded_aio with MQ 4 : 64k iops(read), 64K iops(write), cpu util: 52% > in tree loop(SQ) : 58K iops(read), 58K iops(write) > > Mikulas, please feel free to run your tests with threaded_aio: > > modprobe loop nr_hw_queues=4 threaded_aio=1 > > by applying the attached the patch over the loop patchset. > > The performance gap could be more obvious in fast hardware. For the normal single job sequential WRITE workload, on same test VM, still loop over /dev/nvme0n1, and running fio over loop directly: fio --direct=1 --bs=4k --runtime=40 --time_based --numjobs=1 --ioengine=libaio \ --iodepth=16 --group_reporting=1 --filename=/dev/loop0 -name=job --rw=write threaded_aio(SQ) : 81k iops(write), cpu util: 20% in tree loop(SQ) : 100K iops(write), cpu util: 7% Thanks, Ming