On Tue, Feb 04, 2025 at 12:22:53PM +0900, Damien Le Moal wrote: > On 1/31/25 12:54, Ming Lei wrote: > > On Wed, Jan 29, 2025 at 05:10:32PM +0900, Damien Le Moal wrote: > >> On 1/24/25 21:30, Ming Lei wrote: > >>>> 1 queue: > >>>> ======== > >>>> +-------------------+-------------------+ > >>>> | ublk (IOPS / BW) | zloop (IOPS / BW) | > >>>> +----------------------------+-------------------+-------------------+ > >>>> | QD=1, 4K rnd wr, 1 job | 11.7k / 47.8 MB/s | 15.8k / 53.0 MB/s | > >>>> | QD=32, 4K rnd wr, 8 jobs | 63.4k / 260 MB/s | 101k / 413 MB/s | > >>> > >>> I can't reproduce the above two, actually not observe obvious difference > >>> between rublk/zoned and zloop in my test VM. > >> > >> I am using bare-metal machines for these tests as I do not want any > >> noise from a VM/hypervisor in the numbers. And I did say that this is with a > >> tweaked version of zloop that I have not posted yet (I was waiting for rc1 to > >> repost as a rebase is needed to correct a compilation failure du to the nomerge > >> tage set flag being removed). I am attaching the patch I used here (it applies > >> on top of current Linus tree) > >> > >>> Maybe rublk works at debug mode, which reduces perf by half usually. > >>> And you need to add device via 'cargo run -r -- add zoned' for using > >>> release mode. > >> > >> Well, that is not an obvious thing for someone who does not know rust well. The > >> README file of rublk also does not mention that. So no, I did not run it like > >> this. I followed the README and call rublk directly. It would be great to > >> document that. > > > > OK, that is fine, and now you can install rublk/zoned with 'cargo > > install rublk' directly, which always build & install the binary of > > release version. > > > >> > >>> Actually there is just single io_uring_enter() running in each ublk queue > >>> pthread, perf should be similar with kernel IO handling, and the main extra > >>> load is from the single syscall kernel/user context switch and IO data copy, > >>> and data copy effect can be neglected in small io size usually(< 64KB). > >>> > >>>> | QD=32, 128K rnd wr, 1 job | 5008 / 656 MB/s | 5993 / 786 MB/s | > >>>> | QD=32, 128K seq wr, 1 job | 2636 / 346 MB/s | 5393 / 707 MB/s | > >>> > >>> ublk 128K BS may be a little slower since there is one extra copy. > >> > >> Here are newer numbers running rublk as you suggested (using cargo run -r). > >> The backend storage is on an XFS file system using a PCI gen4 4TB M.2 SSD that > >> is empty (the FS is empty on start). The emulated zoned disk has a capacity of > >> 512GB with sequential zones only of 256 MB (that is, there are 2048 > >> zones/files). Each data point is from a 1min run of fio. > > > > Can you share how you create rublk/zoned and zloop and the underlying > > device info? Especially queue depth and nr_queues(both rublk/zloop & > > underlying disk) plays a big role. > > rublk: > > cargo run -r -- add zoned --size 524288 --zone-size 256 --conv-zones 0 \ > --logical-block-size 4096 --queue ${nrq} --depth 128 \ > --path /mnt/zloop/0 > > zloop: > > echo "add conv_zones=0,capacity_mb=524288,zone_size_mb=256,\ > base_dir=/mnt/zloop,nr_queues=${nrq},queue_depth=128" > /dev/zloop-control zone is actually stateful, maybe it is better to use standalone backing directory/files. > > The backing storage is using XFS on a PCIe Gen4 4TB M.2 SSD (my Xeon machine is > PCIe Gen3 though). This drive has a large enough max_qid to provide one IO queue > pair per CPU for up to 32 CPUs (16-cores / 32-threads). I just setup one XFS over nvme in real hardware, still can't reproduce the big gap in your test result. Kernel is v6.13 with zloop patch v2. `8 queues` should only make a difference for the test of "QD=32, 4K rnd wr, 8 jobs". For other single job test, single queue supposes to be same with 8 queues. The big gap is mainly in test of 'QD=32, 128K seq wr, 1 job ', maybe your local change improves zloop's merge? In my test: - ublk/zoned : 912 MiB/s - zloop(v2) : 960 MiB/s. BTW, my test is over btrfs, and follows the test script: fio --size=32G --time_based --bsrange=128K-128K --runtime=40 --numjobs=1 \ --ioengine=libaio --iodepth=32 --directory=./ublk --group_reporting=1 --direct=1 \ --fsync=0 --name=f1 --stonewall --rw=write > > > I will take your setting on real hardware and re-run the test after I > > return from the Spring Festival holiday. > > > >> > >> On a 8-cores Intel Xeon test box, which has PCI gen 3 only, I get: > >> > >> Single queue: > >> ============= > >> +-------------------+-------------------+ > >> | ublk (IOPS / BW) | zloop (IOPS / BW) | > >> +----------------------------+-------------------+-------------------+ > >> | QD=1, 4K rnd wr, 1 job | 2859 / 11.7 MB/s | 5535 / 22.7 MB/s | > >> | QD=32, 4K rnd wr, 8 jobs | 24.5k / 100 MB/s | 24.6k / 101 MB/s | > >> | QD=32, 128K rnd wr, 1 job | 14.9k / 1954 MB/s | 19.6k / 2571 MB/s | > >> | QD=32, 128K seq wr, 1 job | 1516 / 199 MB/s | 10.6k / 1385 MB/s | > >> +----------------------------+-------------------+-------------------+ > >> > >> 8 queues: > >> ========= > >> +-------------------+-------------------+ > >> | ublk (IOPS / BW) | zloop (IOPS / BW) | > >> +----------------------------+-------------------+-------------------+ > >> | QD=1, 4K rnd wr, 1 job | 5387 / 22.1 MB/s | 5436 / 22.3 MB/s | > >> | QD=32, 4K rnd wr, 8 jobs | 16.4k / 67.0 MB/s | 26.3k / 108 MB/s | > >> | QD=32, 128K rnd wr, 1 job | 6101 / 800 MB/s | 19.8k / 2591 MB/s | > >> | QD=32, 128K seq wr, 1 job | 3987 / 523 MB/s | 10.6k / 1391 MB/s | > >> +----------------------------+-------------------+-------------------+ > >> > >> I have no idea why ublk is generally slower when setup with 8 I/O queues. The > >> qd=32 4K random write with 8 jobs is generally faster with ublk than zloop, but > >> that varies. I tracked that down to CPU utilization which is generally much > >> better (all CPUs used) with ublk compared to zloop, as zloop is at the mercy of > >> the workqueue code and how it schedules unbound work items. > > > > Maybe it is related with queue depth? The default ublk queue depth is > > 128, and 8jobs actually causes 256 in-flight IOs, and default ublk nr_queue > > is 1. > > See above: both rublk and zloop are setup with the exact same number of queues > and max qd. > > > Another thing I mentioned is that ublk has one extra IO data copy, which > > slows IO especially when IO size is > 64K usually. > > Yes. I do keep this in mind when looking at the results. > > [...] > > >>> Simplicity need to be observed from multiple dimensions, 300 vs. 1500 LoC has > >>> shown something already, IMO. > >> > >> Sure. But given the very complicated syntax of rust, a lower LoC for rust > >> compared to C is very subjective in my opinion. > >> > >> I said "simplicity" in the context of the driver use. And rublk is not as > >> simple to use as zloop as it needs rust/cargo installed which is not an > >> acceptable dependency for xfstests. Furthermore, it is very annoying to have to > > > > xfstests just need user to pass the zoned block device, so the same test can > > cover any zoned device. > > Sure. But the environment that allows that still needs to have the rust > dependency to pull-in and build rublk before using it to run the tests. That is > more dependencies for a CI system or minimal VMs that are not necessarilly based > on a full distro but used to run xfstests. OK, it isn't too hard to solve: - `install cargo` in the distribution if `cargo` doesn't exist - run 'cargo install rublk' if rublk isn't installed Thanks, Ming