Re: [PATCH] the dm-loop target

Dave Chinner <david@xxxxxxxxxxxxx> · Tue, 18 Mar 2025 15:27:48 +1100

On Thu, Mar 13, 2025 at 05:36:53PM +0100, Mikulas Patocka wrote:
> On Wed, 12 Mar 2025, Ming Lei wrote:
> 
> > > > It isn't perfect, sometime it may be slower than running on io-wq
> > > > directly.
> > > > 
> > > > But is there any better way for covering everything?
> > > 
> > > Yes - fix the loop queue workers.
> > 
> > What you suggested is threaded aio by submitting IO concurrently from
> > different task context, this way is not the most efficient one, otherwise
> > modern language won't invent async/.await.
> > 
> > In my test VM, by running Mikulas's fio script on loop/nvme by the attached
> > threaded_aio patch:
> > 
> > NOWAIT with MQ 4		:   70K iops(read), 70K iops(write), cpu util: 40%
> > threaded_aio with MQ 4	:	64k iops(read), 64K iops(write), cpu util: 52% 
> > in tree loop(SQ)		:   58K	iops(read), 58K iops(write)	
> > 
> > Mikulas, please feel free to run your tests with threaded_aio:
> > 
> > 	modprobe loop nr_hw_queues=4 threaded_aio=1
> > 
> > by applying the attached the patch over the loop patchset.
> > 
> > The performance gap could be more obvious in fast hardware.
> 
> With "threaded_aio=1":
> 
> Sync io
> fio --direct=1 --bs=4k --runtime=10 --time_based --numjobs=12 --ioengine=psync --iodepth=1 --group_reporting=1 --filename=/mnt/test2/l -name=job --rw=rw
> xfs/loop/xfs
>    READ: bw=300MiB/s (315MB/s), 300MiB/s-300MiB/s (315MB/s-315MB/s), io=3001MiB (3147MB), run=10001-10001msec
>   WRITE: bw=300MiB/s (315MB/s), 300MiB/s-300MiB/s (315MB/s-315MB/s), io=3004MiB (3149MB), run=10001-10001msec
> 
> Async io
> fio --direct=1 --bs=4k --runtime=10 --time_based --numjobs=12 --ioengine=libaio --iodepth=16 --group_reporting=1 --filename=/mnt/test2/l -name=job --rw=rw
> xfs/loop/xfs
>    READ: bw=869MiB/s (911MB/s), 869MiB/s-869MiB/s (911MB/s-911MB/s), io=8694MiB (9116MB), run=10002-10002msec
>   WRITE: bw=870MiB/s (913MB/s), 870MiB/s-870MiB/s (913MB/s-913MB/s), io=8706MiB (9129MB), run=10002-10002msec

The original numbers for the xfs/loop/xfs performance were 220MiB/s
(sync) and 276MiB/s (async), so this is actually a very big step
forward compared to the existing code.

Yes, it's not quite as fast as the NOWAIT case for pure overwrites -
348MB/s (sync) and 1186MB/s (async), but we predicted (and expected)
that this would be the case.

However, this is still testing the static file, pure overwrite case
only, so there is never any IO that blocks during submission. When
IO will block (because there are allocating writes in progress)
performance in the NOWAIT case will trend back towards the original
performance levels because the single loop queue blocking submission
will still be the limiting factor for all IO that needs to block.

IOWs, these results show that to get decent, consistent performance
out of the loop device we need threaded blocking submission so users
do not have to care about optimising individual loop device
instances for the layout of their image files.

Yes, NOWAIT may then add an incremental performance improvement on
top for optimal layout cases, but I'm still not yet convinced that
it is a generally applicable loop device optimisation that everyone
wants to always enable due to the potential for 100% NOWAIT
submission failure on any given loop device.....

-Dave.
-- 
Dave Chinner
david@xxxxxxxxxxxxx