Re: [PATCH] the dm-loop target

Dave Chinner <david@xxxxxxxxxxxxx> · Tue, 11 Mar 2025 12:27:23 +1100

On Mon, Mar 10, 2025 at 12:18:51PM +0100, Mikulas Patocka wrote:
> 
> 
> On Sun, 9 Mar 2025, Ming Lei wrote:
> 
> > On Fri, Mar 07, 2025 at 04:21:58PM +0100, Mikulas Patocka wrote:
> > > > I didn't say you were. I said the concept that dm-loop is based on
> > > > is fundamentally flawed and that your benchmark setup does not
> > > > reflect real world usage of loop devices.
> > > 
> > > > Where are the bug reports about the loop device being slow and the
> > > > analysis that indicates that it is unfixable?
> > > 
> > > So, I did benchmarks on an enterprise nvme drive (SAMSUNG 
> > > MZPLJ1T6HBJR-00007). I stacked ext4/loop/ext4, xfs/loop/xfs (using losetup 
> > > --direct-io=on), ext4/dm-loop/ext4 and xfs/dm-loop/xfs. And loop is slow.
> > > 
> > > synchronous I/O:
> > > fio --direct=1 --bs=4k --runtime=10 --time_based --numjobs=12 --ioengine=psync --iodepth=1 --group_reporting=1 --filename=/mnt/test2/l -name=job --rw=rw
> > > raw block device:
> > >    READ: bw=399MiB/s (418MB/s), 399MiB/s-399MiB/s (418MB/s-418MB/s), io=3985MiB (4179MB), run=10001-10001msec
> > >   WRITE: bw=399MiB/s (418MB/s), 399MiB/s-399MiB/s (418MB/s-418MB/s), io=3990MiB (4184MB), run=10001-10001msec
> > > ext4/loop/ext4:
> > >    READ: bw=223MiB/s (234MB/s), 223MiB/s-223MiB/s (234MB/s-234MB/s), io=2232MiB (2341MB), run=10002-10002msec
> > >   WRITE: bw=223MiB/s (234MB/s), 223MiB/s-223MiB/s (234MB/s-234MB/s), io=2231MiB (2339MB), run=10002-10002msec
> > > xfs/loop/xfs:
> > >    READ: bw=220MiB/s (230MB/s), 220MiB/s-220MiB/s (230MB/s-230MB/s), io=2196MiB (2303MB), run=10001-10001msec
> > >   WRITE: bw=219MiB/s (230MB/s), 219MiB/s-219MiB/s (230MB/s-230MB/s), io=2193MiB (2300MB), run=10001-10001msec
> > > ext4/dm-loop/ext4:
> > >    READ: bw=338MiB/s (355MB/s), 338MiB/s-338MiB/s (355MB/s-355MB/s), io=3383MiB (3547MB), run=10002-10002msec
> > >   WRITE: bw=338MiB/s (355MB/s), 338MiB/s-338MiB/s (355MB/s-355MB/s), io=3385MiB (3549MB), run=10002-10002msec
> > > xfs/dm-loop/xfs:
> > >    READ: bw=375MiB/s (393MB/s), 375MiB/s-375MiB/s (393MB/s-393MB/s), io=3752MiB (3934MB), run=10002-10002msec
> > >   WRITE: bw=376MiB/s (394MB/s), 376MiB/s-376MiB/s (394MB/s-394MB/s), io=3756MiB (3938MB), run=10002-10002msec
> > > 
> > > asynchronous I/O:
> > > fio --direct=1 --bs=4k --runtime=10 --time_based --numjobs=12 --ioengine=libaio --iodepth=16 --group_reporting=1 --filename=/mnt/test2/l -name=job --rw=rw
> > > raw block device:
> > >    READ: bw=1246MiB/s (1306MB/s), 1246MiB/s-1246MiB/s (1306MB/s-1306MB/s), io=12.2GiB (13.1GB), run=10001-10001msec
> > >   WRITE: bw=1247MiB/s (1308MB/s), 1247MiB/s-1247MiB/s (1308MB/s-1308MB/s), io=12.2GiB (13.1GB), run=10001-10001msec
> > > ext4/loop/ext4:
> > >    READ: bw=274MiB/s (288MB/s), 274MiB/s-274MiB/s (288MB/s-288MB/s), io=2743MiB (2877MB), run=10001-10001msec
> > >   WRITE: bw=275MiB/s (288MB/s), 275MiB/s-275MiB/s (288MB/s-288MB/s), io=2747MiB (2880MB), run=10001-10001msec
> > > xfs/loop/xfs:
> > >    READ: bw=276MiB/s (289MB/s), 276MiB/s-276MiB/s (289MB/s-289MB/s), io=2761MiB (2896MB), run=10002-10002msec
> > >   WRITE: bw=276MiB/s (290MB/s), 276MiB/s-276MiB/s (290MB/s-290MB/s), io=2765MiB (2899MB), run=10002-10002msec
> > > ext4/dm-loop/ext4:
> > >    READ: bw=1189MiB/s (1247MB/s), 1189MiB/s-1189MiB/s (1247MB/s-1247MB/s), io=11.6GiB (12.5GB), run=10002-10002msec
> > >   WRITE: bw=1190MiB/s (1248MB/s), 1190MiB/s-1190MiB/s (1248MB/s-1248MB/s), io=11.6GiB (12.5GB), run=10002-10002msec
> > > xfs/dm-loop/xfs:
> > >    READ: bw=1209MiB/s (1268MB/s), 1209MiB/s-1209MiB/s (1268MB/s-1268MB/s), io=11.8GiB (12.7GB), run=10001-10001msec
> > >   WRITE: bw=1210MiB/s (1269MB/s), 1210MiB/s-1210MiB/s (1269MB/s-1269MB/s), io=11.8GiB (12.7GB), run=10001-10001msec
> > 
> > Hi Mikulas,
> > 
> > Please try the following patchset:
> > 
> > https://lore.kernel.org/linux-block/20250308162312.1640828-1-ming.lei@xxxxxxxxxx/
> > 
> > which tries to handle IO in current context directly via NOWAIT, and
> > supports MQ for loop since 12 io jobs are applied in your test. With this
> > change, I can observe similar perf data on raw block device and loop/xfs
> > over mq-virtio-scsi & nvme in my test VM.

I'm not sure RWF_NOWAIT is a workable solution.

Why?

IO submission is queued to a different thread context because to
avoid a potential deadlock. That is, we are operating here in the
writeback context of another filesystem, and so we cannot do things
like depend on memory allocation making forwards progress for IO
submission.  RWF_NOWAIT is not a guarantee that memory allocation
will not occur in the IO submission path - it is implemented as best
effort non-blocking behaviour.

Further, if we have stacked loop devices (e.g.
xfs-loop-ext4-loop-btrfs-loop-xfs) we can will be stacking
RWF_NOWAIT IO submission contexts through multiple filesystems. This
is not a filesystem IO path I want to support - being in the middle
of such a stack creates all sorts of subtle constraints on behaviour
that otherwise wouldn't exist. We actually do this sort of multi-fs
stacking in fstests, so it's not a made up scenario.

I'm also concerned that RWF_NOWAIT submission is not an optimisation
at all for the common sparse/COW image file case, because in this
case RWF_NOWAIT failing with EAGAIN is just as common (if not
moreso) than it succeeding.

i.e. in this case, RWF_NOWAIT writes will fail with -EAGAIN very
frequently, so all that time spent doing IO submission is wasted
time.

Further, because allocation on write requires an exclusive lock and
it is held for some time, this will affect read performance from the
backing device as well. i.e. block mapping during a read while a
write is doing allocation will also return EAGAIN for RWF_NOWAIT.
This will push the read off to the background worker thread to be
serviced and so that will go much slower than a RWF_NOWAIT read that
hits the backing file between writes doing allocation. i.e. read
latency is going to become much, much more variable.

Hence I suspect RWF_NOWAIT is simply hiding the underlying issue
by providing this benchmark with a "pure overwrite" fast path that
avoids the overhead of queueing work through loop_queue_work()....

Can you run these same loop dev tests using a sparse image file and
a sparse fio test file so that the fio benchmark measures the impact
of loop device block allocation on the test? I suspect the results
with the RWF_NOWAIT patch will be somewhat different to the fully
allocated case...

> 
> Yes - with these patches, it is much better.
> 
> > 1) try single queue first by `modprobe loop`
> 
> fio --direct=1 --bs=4k --runtime=10 --time_based --numjobs=12 --ioengine=psync --iodepth=1 --group_reporting=1 --filename=/mnt/test2/l -name=job --rw=rw
> xfs/loop/xfs
>    READ: bw=302MiB/s (317MB/s), 302MiB/s-302MiB/s (317MB/s-317MB/s), io=3024MiB (3170MB), run=10001-10001msec
>   WRITE: bw=303MiB/s (317MB/s), 303MiB/s-303MiB/s (317MB/s-317MB/s), io=3026MiB (3173MB), run=10001-10001msec
> 
> fio --direct=1 --bs=4k --runtime=10 --time_based --numjobs=12 --ioengine=libaio --iodepth=16 --group_reporting=1 --filename=/mnt/test2/l -name=job --rw=rw
> xfs/loop/xfs
>    READ: bw=1055MiB/s (1106MB/s), 1055MiB/s-1055MiB/s (1106MB/s-1106MB/s), io=10.3GiB (11.1GB), run=10001-10001msec
>   WRITE: bw=1056MiB/s (1107MB/s), 1056MiB/s-1056MiB/s (1107MB/s-1107MB/s), io=10.3GiB (11.1GB), run=10001-10001msec

Yup, this sort of difference in performance simply from bypassing
loop_queue_work() implies the problem is the single threaded loop
device queue implementation needs to be fixed.

loop_queue_work()
{
	....
	spin_lock_irq(&lo->lo_work_lock);
	....

        } else {
                work = &lo->rootcg_work;
                cmd_list = &lo->rootcg_cmd_list;
        }
	list_add_tail(&cmd->list_entry, cmd_list);
	queue_work(lo->workqueue, work);
	spin_unlock_irq(&lo->lo_work_lock);
}

Not only does every IO that is queued takes this queue lock, there
is only one work instance for the loop device. Therefore there is
only one kworker process per control group that does IO submission
for the loop device. And that kworker thread also takes the work
lock to do dequeue as well.

That serialised queue with a single IO dispatcher thread looks to be
the performance bottleneck to me. We could get rid of the lock by
using a llist for this multi-producer/single consumer cmd list
pattern, though I suspect we can get rid of the list entirely...

i.e. we have a work queue that can run a
thousand concurrent works for this loop device, but the request
queue is depth limited to 128 requests. hence we can have a full
set of requests in flight and not run out of submission worker
concurrency. There's no need to isolate IO from different cgroups in
this situation - they will not get blocked behind IO submission
from a different cgroup that is throttled...

i.e. the cmd->list_entry list_head could be replaced with a struct
work_struct and that whole cmd list management and cgroup scheduling
thing could be replaced with a single call to
queue_work(cmd->io_work). i.e. the single point that all IO
submission is directed through goes away completely.

-Dave
-- 
Dave Chinner
david@xxxxxxxxxxxxx