Re: [PATCH 0/2] New zoned loop block device driver

Ming Lei <ming.lei@xxxxxxxxxx> · Fri, 24 Jan 2025 20:30:03 +0800

On Fri, Jan 24, 2025 at 06:30:19PM +0900, Damien Le Moal wrote:
> On 1/10/25 21:34, Ming Lei wrote:
> >> It is easy to extend rublk/zoned in this way with io_uring io emulation, :-)
> > 
> > Here it is:
> > 
> > https://github.com/ublk-org/rublk/commits/file-backed-zoned/
> > 
> > Top two commits implement the feature by command line `--path $zdir`:
> > 
> > 	[rublk]# git diff --stat=80 HEAD^^...
> > 	 src/zoned.rs   | 397 +++++++++++++++++++++++++++++++++++++++++++++++----------
> > 	 tests/basic.rs |  49 ++++---
> > 	 2 files changed, 363 insertions(+), 83 deletions(-)
> > 
> > It takes 280 new LoC:
> > 
> >     - support both ram-back and file-back
> >     - completely async io_uring IO emulation for zoned read/write IO
> >     - include selftest code for running mkfs.btrfs/mount/read & write IO/umount
> 
> Hi Ming,
> 
> My apologies for the late reply. Conference travel kept me busy.
> Thank you for doing this. I gave it a try and measured the performance for some
> write workloads (using current Linus tree which includes the block PR for 6.14).
> The zloop results shown here are with a slightly tweaked version (not posted)
> that changes to using a work item per command instead of having a single work
> for all commands.
> 
> 1 queue:
> ========
>                               +-------------------+-------------------+
>                               | ublk (IOPS / BW)  | zloop (IOPS / BW) |
>  +----------------------------+-------------------+-------------------+
>  | QD=1,    4K rnd wr, 1 job  | 11.7k / 47.8 MB/s | 15.8k / 53.0 MB/s |
>  | QD=32,   4K rnd wr, 8 jobs | 63.4k / 260 MB/s  | 101k / 413 MB/s   |

I can't reproduce the above two, actually not observe obvious difference
between rublk/zoned and zloop in my test VM.

Maybe rublk works at debug mode, which reduces perf by half usually.
And you need to add device via 'cargo run -r -- add zoned' for using
release mode.

Actually there is just single io_uring_enter() running in each ublk queue
pthread, perf should be similar with kernel IO handling, and the main extra
load is from the single syscall kernel/user context switch and IO data copy,
and data copy effect can be neglected in small io size usually(< 64KB).

>  | QD=32, 128K rnd wr, 1 job  | 5008 / 656 MB/s   | 5993 / 786 MB/s   |
>  | QD=32, 128K seq wr, 1 job  | 2636 / 346 MB/s   | 5393 / 707 MB/s   |

ublk 128K BS may be a little slower since there is one extra copy.

>  +----------------------------+-------------------+-------------------+
> 
> 8 queues:
> =========
>                               +-------------------+-------------------+
>                               | ublk (IOPS / BW)  | zloop (IOPS / BW) |
>  +----------------------------+-------------------+-------------------+
>  | QD=1,    4K rnd wr, 1 job  | 9699 / 39.7 MB/s  | 16.7k / 68.6 MB/s |
>  | QD=32,   4K rnd wr, 8 jobs | 58.2k / 238 MB/s  | 108k / 444 MB/s   |
>  | QD=32, 128K rnd wr, 1 job  | 4160 / 545 MB/s   | 5715 / 749 MB/s   |
>  | QD=32, 128K seq wr, 1 job  | 3274 / 429 MB/s   | 5934 / 778 MB/s   |
>  +----------------------------+-------------------+-------------------+
> 
> As you can see, zloop is generally much faster. This shows the best results from
> several runs as performance variation from one run to another can be significant
> (for both ublk and zloop).
> 
> But as mentioned before, since this is intended to be a test tool for file
> systems, performance is not the primary goal here (though the higher the better
> as that shortens test times). Simplicity is. And as Ted also stated, introducing
> a ublk and rust dependency in xfstests is far from ideal.

Simplicity need to be observed from multiple dimensions, 300 vs. 1500 LoC has
shown something already, IMO.

Thanks,
Ming