Re: [PATCH] the dm-loop target

Ming Lei <ming.lei@xxxxxxxxxx> · Tue, 11 Mar 2025 18:43:22 +0800

On Tue, Mar 11, 2025 at 12:27:23PM +1100, Dave Chinner wrote:
> On Mon, Mar 10, 2025 at 12:18:51PM +0100, Mikulas Patocka wrote:
> > 
> > 
> > On Sun, 9 Mar 2025, Ming Lei wrote:
> > 
> > > On Fri, Mar 07, 2025 at 04:21:58PM +0100, Mikulas Patocka wrote:
> > > > > I didn't say you were. I said the concept that dm-loop is based on
> > > > > is fundamentally flawed and that your benchmark setup does not
> > > > > reflect real world usage of loop devices.
> > > > 
> > > > > Where are the bug reports about the loop device being slow and the
> > > > > analysis that indicates that it is unfixable?
> > > > 
> > > > So, I did benchmarks on an enterprise nvme drive (SAMSUNG 
> > > > MZPLJ1T6HBJR-00007). I stacked ext4/loop/ext4, xfs/loop/xfs (using losetup 
> > > > --direct-io=on), ext4/dm-loop/ext4 and xfs/dm-loop/xfs. And loop is slow.
> > > > 
> > > > synchronous I/O:
> > > > fio --direct=1 --bs=4k --runtime=10 --time_based --numjobs=12 --ioengine=psync --iodepth=1 --group_reporting=1 --filename=/mnt/test2/l -name=job --rw=rw
> > > > raw block device:
> > > >    READ: bw=399MiB/s (418MB/s), 399MiB/s-399MiB/s (418MB/s-418MB/s), io=3985MiB (4179MB), run=10001-10001msec
> > > >   WRITE: bw=399MiB/s (418MB/s), 399MiB/s-399MiB/s (418MB/s-418MB/s), io=3990MiB (4184MB), run=10001-10001msec
> > > > ext4/loop/ext4:
> > > >    READ: bw=223MiB/s (234MB/s), 223MiB/s-223MiB/s (234MB/s-234MB/s), io=2232MiB (2341MB), run=10002-10002msec
> > > >   WRITE: bw=223MiB/s (234MB/s), 223MiB/s-223MiB/s (234MB/s-234MB/s), io=2231MiB (2339MB), run=10002-10002msec
> > > > xfs/loop/xfs:
> > > >    READ: bw=220MiB/s (230MB/s), 220MiB/s-220MiB/s (230MB/s-230MB/s), io=2196MiB (2303MB), run=10001-10001msec
> > > >   WRITE: bw=219MiB/s (230MB/s), 219MiB/s-219MiB/s (230MB/s-230MB/s), io=2193MiB (2300MB), run=10001-10001msec
> > > > ext4/dm-loop/ext4:
> > > >    READ: bw=338MiB/s (355MB/s), 338MiB/s-338MiB/s (355MB/s-355MB/s), io=3383MiB (3547MB), run=10002-10002msec
> > > >   WRITE: bw=338MiB/s (355MB/s), 338MiB/s-338MiB/s (355MB/s-355MB/s), io=3385MiB (3549MB), run=10002-10002msec
> > > > xfs/dm-loop/xfs:
> > > >    READ: bw=375MiB/s (393MB/s), 375MiB/s-375MiB/s (393MB/s-393MB/s), io=3752MiB (3934MB), run=10002-10002msec
> > > >   WRITE: bw=376MiB/s (394MB/s), 376MiB/s-376MiB/s (394MB/s-394MB/s), io=3756MiB (3938MB), run=10002-10002msec
> > > > 
> > > > asynchronous I/O:
> > > > fio --direct=1 --bs=4k --runtime=10 --time_based --numjobs=12 --ioengine=libaio --iodepth=16 --group_reporting=1 --filename=/mnt/test2/l -name=job --rw=rw
> > > > raw block device:
> > > >    READ: bw=1246MiB/s (1306MB/s), 1246MiB/s-1246MiB/s (1306MB/s-1306MB/s), io=12.2GiB (13.1GB), run=10001-10001msec
> > > >   WRITE: bw=1247MiB/s (1308MB/s), 1247MiB/s-1247MiB/s (1308MB/s-1308MB/s), io=12.2GiB (13.1GB), run=10001-10001msec
> > > > ext4/loop/ext4:
> > > >    READ: bw=274MiB/s (288MB/s), 274MiB/s-274MiB/s (288MB/s-288MB/s), io=2743MiB (2877MB), run=10001-10001msec
> > > >   WRITE: bw=275MiB/s (288MB/s), 275MiB/s-275MiB/s (288MB/s-288MB/s), io=2747MiB (2880MB), run=10001-10001msec
> > > > xfs/loop/xfs:
> > > >    READ: bw=276MiB/s (289MB/s), 276MiB/s-276MiB/s (289MB/s-289MB/s), io=2761MiB (2896MB), run=10002-10002msec
> > > >   WRITE: bw=276MiB/s (290MB/s), 276MiB/s-276MiB/s (290MB/s-290MB/s), io=2765MiB (2899MB), run=10002-10002msec
> > > > ext4/dm-loop/ext4:
> > > >    READ: bw=1189MiB/s (1247MB/s), 1189MiB/s-1189MiB/s (1247MB/s-1247MB/s), io=11.6GiB (12.5GB), run=10002-10002msec
> > > >   WRITE: bw=1190MiB/s (1248MB/s), 1190MiB/s-1190MiB/s (1248MB/s-1248MB/s), io=11.6GiB (12.5GB), run=10002-10002msec
> > > > xfs/dm-loop/xfs:
> > > >    READ: bw=1209MiB/s (1268MB/s), 1209MiB/s-1209MiB/s (1268MB/s-1268MB/s), io=11.8GiB (12.7GB), run=10001-10001msec
> > > >   WRITE: bw=1210MiB/s (1269MB/s), 1210MiB/s-1210MiB/s (1269MB/s-1269MB/s), io=11.8GiB (12.7GB), run=10001-10001msec
> > > 
> > > Hi Mikulas,
> > > 
> > > Please try the following patchset:
> > > 
> > > https://lore.kernel.org/linux-block/20250308162312.1640828-1-ming.lei@xxxxxxxxxx/
> > > 
> > > which tries to handle IO in current context directly via NOWAIT, and
> > > supports MQ for loop since 12 io jobs are applied in your test. With this
> > > change, I can observe similar perf data on raw block device and loop/xfs
> > > over mq-virtio-scsi & nvme in my test VM.
> 
> I'm not sure RWF_NOWAIT is a workable solution.
> 
> Why?

It is just the sane implementation of Mikulas's static mapping
approach: no need to move to workqueue if the mapping is immutable
or sort of.

Also it matches with io_uring's FS read/write implementation:

- try to submit IO with NOWAIT first

- then fallback to io-wq in case of -EAGAIN

It isn't perfect, sometime it may be slower than running on io-wq
directly.

But is there any better way for covering everything?

I guess no, because FS can't tell us when the IO can be submitted
successfully via NOWAIT, and we can't know if it may succeed without
trying. And basically that is what the interface is designed.

> 
> IO submission is queued to a different thread context because to
> avoid a potential deadlock. That is, we are operating here in the
> writeback context of another filesystem, and so we cannot do things
> like depend on memory allocation making forwards progress for IO
> submission.  RWF_NOWAIT is not a guarantee that memory allocation
> will not occur in the IO submission path - it is implemented as best
> effort non-blocking behaviour.

Yes, that is why BLK_MQ_F_BLOCKING is added.

> 
> Further, if we have stacked loop devices (e.g.
> xfs-loop-ext4-loop-btrfs-loop-xfs) we can will be stacking
> RWF_NOWAIT IO submission contexts through multiple filesystems. This
> is not a filesystem IO path I want to support - being in the middle
> of such a stack creates all sorts of subtle constraints on behaviour
> that otherwise wouldn't exist. We actually do this sort of multi-fs
> stacking in fstests, so it's not a made up scenario.
> 
> I'm also concerned that RWF_NOWAIT submission is not an optimisation
> at all for the common sparse/COW image file case, because in this
> case RWF_NOWAIT failing with EAGAIN is just as common (if not
> moreso) than it succeeding.
> 
> i.e. in this case, RWF_NOWAIT writes will fail with -EAGAIN very
> frequently, so all that time spent doing IO submission is wasted
> time.

Right.

I saw that when I wrote ublk/zoned in which every write needs to
allocate space. It is workaround by preallocating space for one fixed
range or the whole zone.

> 
> Further, because allocation on write requires an exclusive lock and
> it is held for some time, this will affect read performance from the
> backing device as well. i.e. block mapping during a read while a
> write is doing allocation will also return EAGAIN for RWF_NOWAIT.

But that can't be avoided without using NOWAIT, and read is blocked
when write(WAIT) is in-progress.

> This will push the read off to the background worker thread to be
> serviced and so that will go much slower than a RWF_NOWAIT read that
> hits the backing file between writes doing allocation. i.e. read
> latency is going to become much, much more variable.
> 
> Hence I suspect RWF_NOWAIT is simply hiding the underlying issue
> by providing this benchmark with a "pure overwrite" fast path that
> avoids the overhead of queueing work through loop_queue_work()....
> 
> Can you run these same loop dev tests using a sparse image file and
> a sparse fio test file so that the fio benchmark measures the impact
> of loop device block allocation on the test? I suspect the results
> with the RWF_NOWAIT patch will be somewhat different to the fully
> allocated case...

Yes, it will be slower, and io_uring FS IO application is in
the same situation, and usually application doesn't have such
knowledge if RWF_NOWAIT can succeed.

However, usually meta IO is much less compared with normal IO, so most
times it will be a win to try NOWAIT first.

> 
> > 
> > Yes - with these patches, it is much better.
> > 
> > > 1) try single queue first by `modprobe loop`
> > 
> > fio --direct=1 --bs=4k --runtime=10 --time_based --numjobs=12 --ioengine=psync --iodepth=1 --group_reporting=1 --filename=/mnt/test2/l -name=job --rw=rw
> > xfs/loop/xfs
> >    READ: bw=302MiB/s (317MB/s), 302MiB/s-302MiB/s (317MB/s-317MB/s), io=3024MiB (3170MB), run=10001-10001msec
> >   WRITE: bw=303MiB/s (317MB/s), 303MiB/s-303MiB/s (317MB/s-317MB/s), io=3026MiB (3173MB), run=10001-10001msec
> > 
> > fio --direct=1 --bs=4k --runtime=10 --time_based --numjobs=12 --ioengine=libaio --iodepth=16 --group_reporting=1 --filename=/mnt/test2/l -name=job --rw=rw
> > xfs/loop/xfs
> >    READ: bw=1055MiB/s (1106MB/s), 1055MiB/s-1055MiB/s (1106MB/s-1106MB/s), io=10.3GiB (11.1GB), run=10001-10001msec
> >   WRITE: bw=1056MiB/s (1107MB/s), 1056MiB/s-1056MiB/s (1107MB/s-1107MB/s), io=10.3GiB (11.1GB), run=10001-10001msec
> 
> Yup, this sort of difference in performance simply from bypassing
> loop_queue_work() implies the problem is the single threaded loop
> device queue implementation needs to be fixed.
> 
> loop_queue_work()
> {
> 	....
> 	spin_lock_irq(&lo->lo_work_lock);
> 	....
> 
>         } else {
>                 work = &lo->rootcg_work;
>                 cmd_list = &lo->rootcg_cmd_list;
>         }
> 	list_add_tail(&cmd->list_entry, cmd_list);
> 	queue_work(lo->workqueue, work);
> 	spin_unlock_irq(&lo->lo_work_lock);
> }
> 
> Not only does every IO that is queued takes this queue lock, there
> is only one work instance for the loop device. Therefore there is
> only one kworker process per control group that does IO submission
> for the loop device. And that kworker thread also takes the work
> lock to do dequeue as well.
> 
> That serialised queue with a single IO dispatcher thread looks to be
> the performance bottleneck to me. We could get rid of the lock by
> using a llist for this multi-producer/single consumer cmd list
> pattern, though I suspect we can get rid of the list entirely...
> 
> i.e. we have a work queue that can run a
> thousand concurrent works for this loop device, but the request
> queue is depth limited to 128 requests. hence we can have a full
> set of requests in flight and not run out of submission worker
> concurrency. There's no need to isolate IO from different cgroups in
> this situation - they will not get blocked behind IO submission
> from a different cgroup that is throttled...
> 
> i.e. the cmd->list_entry list_head could be replaced with a struct
> work_struct and that whole cmd list management and cgroup scheduling
> thing could be replaced with a single call to
> queue_work(cmd->io_work). i.e. the single point that all IO

Then there will be many FS write contention, :-)

> submission is directed through goes away completely.

It has been shown many times that AIO submitted from single or much less
contexts is much more efficient than running IO concurrently from multiple
contexts, especially for sequential IO.

Please see the recent example of zloop vs. ublk/zoned:

https://lore.kernel.org/linux-block/d5e59531-c19b-4332-8f47-b380ab9678be@xxxxxxxxxx/

When zloop takes single dispatcher just like the in-tree loop, sequential
WRITE performs much better than initial version of ublk/zoned, which just
handles every IO in its own io-wq(one time NOWAIT & -EAGAIN and one time fallback
to io-wq). I tried to submit FS WRITE via io-wq directly and it becomes what
your suggested, the improvement is small, and still much worse than zloop's
single dispatcher.

Later, when I switch to pre-allocate space for each zone or one fixed range,
each write is submitted with NOWAIT successfully, then the sequential write perf
is improved much:

https://lore.kernel.org/linux-block/Z6QrceGGAJl_X_BM@fedora/

Thanks,
Ming