On Mon, Mar 10, 2025 at 12:18:51PM +0100, Mikulas Patocka wrote: > > > On Sun, 9 Mar 2025, Ming Lei wrote: > > > On Fri, Mar 07, 2025 at 04:21:58PM +0100, Mikulas Patocka wrote: > > > > I didn't say you were. I said the concept that dm-loop is based on > > > > is fundamentally flawed and that your benchmark setup does not > > > > reflect real world usage of loop devices. > > > > > > > Where are the bug reports about the loop device being slow and the > > > > analysis that indicates that it is unfixable? > > > > > > So, I did benchmarks on an enterprise nvme drive (SAMSUNG > > > MZPLJ1T6HBJR-00007). I stacked ext4/loop/ext4, xfs/loop/xfs (using losetup > > > --direct-io=on), ext4/dm-loop/ext4 and xfs/dm-loop/xfs. And loop is slow. > > > > > > synchronous I/O: > > > fio --direct=1 --bs=4k --runtime=10 --time_based --numjobs=12 --ioengine=psync --iodepth=1 --group_reporting=1 --filename=/mnt/test2/l -name=job --rw=rw > > > raw block device: > > > READ: bw=399MiB/s (418MB/s), 399MiB/s-399MiB/s (418MB/s-418MB/s), io=3985MiB (4179MB), run=10001-10001msec > > > WRITE: bw=399MiB/s (418MB/s), 399MiB/s-399MiB/s (418MB/s-418MB/s), io=3990MiB (4184MB), run=10001-10001msec > > > ext4/loop/ext4: > > > READ: bw=223MiB/s (234MB/s), 223MiB/s-223MiB/s (234MB/s-234MB/s), io=2232MiB (2341MB), run=10002-10002msec > > > WRITE: bw=223MiB/s (234MB/s), 223MiB/s-223MiB/s (234MB/s-234MB/s), io=2231MiB (2339MB), run=10002-10002msec > > > xfs/loop/xfs: > > > READ: bw=220MiB/s (230MB/s), 220MiB/s-220MiB/s (230MB/s-230MB/s), io=2196MiB (2303MB), run=10001-10001msec > > > WRITE: bw=219MiB/s (230MB/s), 219MiB/s-219MiB/s (230MB/s-230MB/s), io=2193MiB (2300MB), run=10001-10001msec > > > ext4/dm-loop/ext4: > > > READ: bw=338MiB/s (355MB/s), 338MiB/s-338MiB/s (355MB/s-355MB/s), io=3383MiB (3547MB), run=10002-10002msec > > > WRITE: bw=338MiB/s (355MB/s), 338MiB/s-338MiB/s (355MB/s-355MB/s), io=3385MiB (3549MB), run=10002-10002msec > > > xfs/dm-loop/xfs: > > > READ: bw=375MiB/s (393MB/s), 375MiB/s-375MiB/s (393MB/s-393MB/s), io=3752MiB (3934MB), run=10002-10002msec > > > WRITE: bw=376MiB/s (394MB/s), 376MiB/s-376MiB/s (394MB/s-394MB/s), io=3756MiB (3938MB), run=10002-10002msec > > > > > > asynchronous I/O: > > > fio --direct=1 --bs=4k --runtime=10 --time_based --numjobs=12 --ioengine=libaio --iodepth=16 --group_reporting=1 --filename=/mnt/test2/l -name=job --rw=rw > > > raw block device: > > > READ: bw=1246MiB/s (1306MB/s), 1246MiB/s-1246MiB/s (1306MB/s-1306MB/s), io=12.2GiB (13.1GB), run=10001-10001msec > > > WRITE: bw=1247MiB/s (1308MB/s), 1247MiB/s-1247MiB/s (1308MB/s-1308MB/s), io=12.2GiB (13.1GB), run=10001-10001msec > > > ext4/loop/ext4: > > > READ: bw=274MiB/s (288MB/s), 274MiB/s-274MiB/s (288MB/s-288MB/s), io=2743MiB (2877MB), run=10001-10001msec > > > WRITE: bw=275MiB/s (288MB/s), 275MiB/s-275MiB/s (288MB/s-288MB/s), io=2747MiB (2880MB), run=10001-10001msec > > > xfs/loop/xfs: > > > READ: bw=276MiB/s (289MB/s), 276MiB/s-276MiB/s (289MB/s-289MB/s), io=2761MiB (2896MB), run=10002-10002msec > > > WRITE: bw=276MiB/s (290MB/s), 276MiB/s-276MiB/s (290MB/s-290MB/s), io=2765MiB (2899MB), run=10002-10002msec > > > ext4/dm-loop/ext4: > > > READ: bw=1189MiB/s (1247MB/s), 1189MiB/s-1189MiB/s (1247MB/s-1247MB/s), io=11.6GiB (12.5GB), run=10002-10002msec > > > WRITE: bw=1190MiB/s (1248MB/s), 1190MiB/s-1190MiB/s (1248MB/s-1248MB/s), io=11.6GiB (12.5GB), run=10002-10002msec > > > xfs/dm-loop/xfs: > > > READ: bw=1209MiB/s (1268MB/s), 1209MiB/s-1209MiB/s (1268MB/s-1268MB/s), io=11.8GiB (12.7GB), run=10001-10001msec > > > WRITE: bw=1210MiB/s (1269MB/s), 1210MiB/s-1210MiB/s (1269MB/s-1269MB/s), io=11.8GiB (12.7GB), run=10001-10001msec > > > > Hi Mikulas, > > > > Please try the following patchset: > > > > https://lore.kernel.org/linux-block/20250308162312.1640828-1-ming.lei@xxxxxxxxxx/ > > > > which tries to handle IO in current context directly via NOWAIT, and > > supports MQ for loop since 12 io jobs are applied in your test. With this > > change, I can observe similar perf data on raw block device and loop/xfs > > over mq-virtio-scsi & nvme in my test VM. I'm not sure RWF_NOWAIT is a workable solution. Why? IO submission is queued to a different thread context because to avoid a potential deadlock. That is, we are operating here in the writeback context of another filesystem, and so we cannot do things like depend on memory allocation making forwards progress for IO submission. RWF_NOWAIT is not a guarantee that memory allocation will not occur in the IO submission path - it is implemented as best effort non-blocking behaviour. Further, if we have stacked loop devices (e.g. xfs-loop-ext4-loop-btrfs-loop-xfs) we can will be stacking RWF_NOWAIT IO submission contexts through multiple filesystems. This is not a filesystem IO path I want to support - being in the middle of such a stack creates all sorts of subtle constraints on behaviour that otherwise wouldn't exist. We actually do this sort of multi-fs stacking in fstests, so it's not a made up scenario. I'm also concerned that RWF_NOWAIT submission is not an optimisation at all for the common sparse/COW image file case, because in this case RWF_NOWAIT failing with EAGAIN is just as common (if not moreso) than it succeeding. i.e. in this case, RWF_NOWAIT writes will fail with -EAGAIN very frequently, so all that time spent doing IO submission is wasted time. Further, because allocation on write requires an exclusive lock and it is held for some time, this will affect read performance from the backing device as well. i.e. block mapping during a read while a write is doing allocation will also return EAGAIN for RWF_NOWAIT. This will push the read off to the background worker thread to be serviced and so that will go much slower than a RWF_NOWAIT read that hits the backing file between writes doing allocation. i.e. read latency is going to become much, much more variable. Hence I suspect RWF_NOWAIT is simply hiding the underlying issue by providing this benchmark with a "pure overwrite" fast path that avoids the overhead of queueing work through loop_queue_work().... Can you run these same loop dev tests using a sparse image file and a sparse fio test file so that the fio benchmark measures the impact of loop device block allocation on the test? I suspect the results with the RWF_NOWAIT patch will be somewhat different to the fully allocated case... > > Yes - with these patches, it is much better. > > > 1) try single queue first by `modprobe loop` > > fio --direct=1 --bs=4k --runtime=10 --time_based --numjobs=12 --ioengine=psync --iodepth=1 --group_reporting=1 --filename=/mnt/test2/l -name=job --rw=rw > xfs/loop/xfs > READ: bw=302MiB/s (317MB/s), 302MiB/s-302MiB/s (317MB/s-317MB/s), io=3024MiB (3170MB), run=10001-10001msec > WRITE: bw=303MiB/s (317MB/s), 303MiB/s-303MiB/s (317MB/s-317MB/s), io=3026MiB (3173MB), run=10001-10001msec > > fio --direct=1 --bs=4k --runtime=10 --time_based --numjobs=12 --ioengine=libaio --iodepth=16 --group_reporting=1 --filename=/mnt/test2/l -name=job --rw=rw > xfs/loop/xfs > READ: bw=1055MiB/s (1106MB/s), 1055MiB/s-1055MiB/s (1106MB/s-1106MB/s), io=10.3GiB (11.1GB), run=10001-10001msec > WRITE: bw=1056MiB/s (1107MB/s), 1056MiB/s-1056MiB/s (1107MB/s-1107MB/s), io=10.3GiB (11.1GB), run=10001-10001msec Yup, this sort of difference in performance simply from bypassing loop_queue_work() implies the problem is the single threaded loop device queue implementation needs to be fixed. loop_queue_work() { .... spin_lock_irq(&lo->lo_work_lock); .... } else { work = &lo->rootcg_work; cmd_list = &lo->rootcg_cmd_list; } list_add_tail(&cmd->list_entry, cmd_list); queue_work(lo->workqueue, work); spin_unlock_irq(&lo->lo_work_lock); } Not only does every IO that is queued takes this queue lock, there is only one work instance for the loop device. Therefore there is only one kworker process per control group that does IO submission for the loop device. And that kworker thread also takes the work lock to do dequeue as well. That serialised queue with a single IO dispatcher thread looks to be the performance bottleneck to me. We could get rid of the lock by using a llist for this multi-producer/single consumer cmd list pattern, though I suspect we can get rid of the list entirely... i.e. we have a work queue that can run a thousand concurrent works for this loop device, but the request queue is depth limited to 128 requests. hence we can have a full set of requests in flight and not run out of submission worker concurrency. There's no need to isolate IO from different cgroups in this situation - they will not get blocked behind IO submission from a different cgroup that is throttled... i.e. the cmd->list_entry list_head could be replaced with a struct work_struct and that whole cmd list management and cgroup scheduling thing could be replaced with a single call to queue_work(cmd->io_work). i.e. the single point that all IO submission is directed through goes away completely. -Dave -- Dave Chinner david@xxxxxxxxxxxxx