Re: [PATCH 0/2] New zoned loop block device driver

Ming Lei <ming.lei@xxxxxxxxxx> · Thu, 6 Feb 2025 11:24:33 +0800

On Wed, Feb 05, 2025 at 03:07:51PM +0900, Damien Le Moal wrote:
> On 2/5/25 12:43 PM, Ming Lei wrote:
> >>> Can you share how you create rublk/zoned and zloop and the underlying
> >>> device info? Especially queue depth and nr_queues(both rublk/zloop &
> >>> underlying disk) plays a big role.
> >>
> >> rublk:
> >>
> >> cargo run -r -- add zoned --size 524288 --zone-size 256 --conv-zones 0 \
> >> 		--logical-block-size 4096 --queue ${nrq} --depth 128 \
> >> 		--path /mnt/zloop/0
> >>
> >> zloop:
> >>
> >> echo "add conv_zones=0,capacity_mb=524288,zone_size_mb=256,\
> >> base_dir=/mnt/zloop,nr_queues=${nrq},queue_depth=128" > /dev/zloop-control
> > 
> > zone is actually stateful, maybe it is better to use standalone backing
> > directory/files.
> 
> I do not understand what you are saying... I reformat the backing FS and
> recreate the same /mnt/zloop/0 directory for every test, to be sure I am not
> seeing an artifact from the FS.

I meant same backfiles are shared for two devices.

But I guess it may not be big deal.

> 
> >> The backing storage is using XFS on a PCIe Gen4 4TB M.2 SSD (my Xeon machine is
> >> PCIe Gen3 though). This drive has a large enough max_qid to provide one IO queue
> >> pair per CPU for up to 32 CPUs (16-cores / 32-threads).
> > 
> > I just setup one XFS over nvme in real hardware, still can't reproduce the big gap in
> > your test result. Kernel is v6.13 with zloop patch v2.
> > 
> > `8 queues` should only make a difference for the test of "QD=32,   4K rnd wr, 8 jobs".
> > For other single job test, single queue supposes to be same with 8 queues.
> > 
> > The big gap is mainly in test of 'QD=32, 128K seq wr, 1 job ', maybe your local
> > change improves zloop's merge? In my test:
> > 
> > 	- ublk/zoned : 912 MiB/s
> > 	- zloop(v2) : 960 MiB/s.
> > 
> > BTW, my test is over btrfs, and follows the test script:
> > 
> >  fio --size=32G --time_based --bsrange=128K-128K --runtime=40 --numjobs=1 \
> >  	--ioengine=libaio --iodepth=32 --directory=./ublk --group_reporting=1 --direct=1 \
> > 	--fsync=0 --name=f1 --stonewall --rw=write
> 
> If you add an FS on top of the emulated zoned deive, you are testing the FS
> perf as much as the backing dev. I focused on the backing dev so I ran fio
> directly on top of the emulated drive. E.g.:
> 
> fio --name=test --filename=${dev} --rw=randwrite \
>                 --ioengine=libaio --iodepth=32 --direct=1 --bs=4096 \
>                 --zonemode=zbd --numjobs=8 --group_reporting --norandommap \
>                 --cpus_allowed=0-7 --cpus_allowed_policy=split \
>                 --runtime=${runtime} --ramp_time=5 --time_based
> 
> (you must use libaio here)

Thanks for sharing the '--zonemode=zbd'.

I can reproduce the perf issue with the above script, and the reason is related
to io-uring emulation and zone space pre-allocation.

When FS WRITE IO needs to allocate space, .write_iter() returns -EAGAIN
for each io-uring write, then the write is always fallback to io-wq, cause
very bad sequential write perf.

It can be fixed[1] simply by pre-allocating space before writing to the
beginning of each seq-zone.

Now follows result in my test over real nvme/XFS:

+ ./zfio /dev/zloop0 write 1 40
    write /dev/zloop0: jobs   1 io_depth   32 time   40sec
	BS   4k: IOPS   171383 BW   685535KiB/s fio_cpu_util(25% 38%)
	BS 128k: IOPS     7669 BW   981846KiB/s fio_cpu_util( 5% 11%)
+ ./zfio /dev/ublkb0 write 1 40
    write /dev/ublkb0: jobs   1 io_depth   32 time   40sec
	BS   4k: IOPS   179861 BW   719448KiB/s fio_cpu_util(29% 42%)
	BS 128k: IOPS     7239 BW   926786KiB/s fio_cpu_util( 6%  9%)

+ ./zfio /dev/zloop0 randwrite 1 40
randwrite /dev/zloop0: jobs   1 io_depth   32 time   40sec
	BS   4k: IOPS     8909 BW    35642KiB/s fio_cpu_util( 2%  5%)
	BS 128k: IOPS      210 BW    27035KiB/s fio_cpu_util( 0%  0%)
+ ./zfio /dev/ublkb0 randwrite 1 40
randwrite /dev/ublkb0: jobs   1 io_depth   32 time   40sec
	BS   4k: IOPS    20500 BW    82001KiB/s fio_cpu_util( 5% 12%)
	BS 128k: IOPS     5622 BW   719792KiB/s fio_cpu_util( 6%  8%)

[1] https://github.com/ublk-org/rublk/commit/fd01a87abb2f9b8e94c8da24e73683e4bb12659b

[2] `z` (zone fio test script) https://github.com/ublk-org/rublk/blob/main/scripts/zfio

Thanks,
Ming