Hi, > -----Original Message----- > From: Ming Lei <ming.lei@xxxxxxxxxx> > Sent: Friday, February 14, 2025 10:41 AM > To: lizetao <lizetao1@xxxxxxxxxx> > Cc: Keith Busch <kbusch@xxxxxxxx>; io-uring@xxxxxxxxxxxxxxx; > axboe@xxxxxxxxx; Ming Lei <ming.lei@xxxxxxxxxx> > Subject: Re: [PATCHv2 0/6] ublk zero-copy support > > On Thu, Feb 13, 2025 at 11:24 PM lizetao <lizetao1@xxxxxxxxxx> wrote: > > > > Hi, > > > > > -----Original Message----- > > > From: Keith Busch <kbusch@xxxxxxxx> > > > Sent: Tuesday, February 11, 2025 8:57 AM > > > To: ming.lei@xxxxxxxxxx; asml.silence@xxxxxxxxx; axboe@xxxxxxxxx; > > > linux- block@xxxxxxxxxxxxxxx; io-uring@xxxxxxxxxxxxxxx > > > Cc: bernd@xxxxxxxxxxx; Keith Busch <kbusch@xxxxxxxxxx> > > > Subject: [PATCHv2 0/6] ublk zero-copy support > > > > > > From: Keith Busch <kbusch@xxxxxxxxxx> > > > > > > Previous version was discussed here: > > > > > > https://lore.kernel.org/linux-block/20250203154517.937623-1- > > > kbusch@xxxxxxxx/ > > > > > > The same ublksrv reference code in that link was used to test the > > > kernel side changes. > > > > > > Before listing what has changed, I want to mention what is the same: > > > the reliance on the ring ctx lock to serialize the register ahead of > > > any use. I'm not ignoring the feedback; I just don't have a solid > > > answer right now, and want to progress on the other fronts in the > meantime. > > > > > > Here's what's different from the previous: > > > > > > - Introduced an optional 'release' callback when the resource node is > > > no longer referenced. The callback addresses any buggy applications > > > that may complete their request and unregister their index while IO > > > is in flight. This obviates any need to take extra page references > > > since it prevents the request from completing. > > > > > > - Removed peeking into the io_cache element size and instead use a > > > more intuitive bvec segment count limit to decide if we're caching > > > the imu (suggested by Pavel). > > > > > > - Dropped the const request changes; it's not needed. > > > > I tested this patch set. When I use null as the device, the test results are like > your v1. > > When the bs is 4k, there is a slight improvement; when the bs is 64k, there is > a significant improvement. > > Yes, the improvement is usually more obvious with a big IO size(>= 64K). > > > However, when I used loop as the device, I found that there was no > improvement, whether using 4k or 64k. As follow: > > > > ublk add -t loop -f ./ublk-loop.img > > ublk add -t loop -f ./ublk-loop-zerocopy.img > > > > fio -filename=/dev/ublkb0 -direct=1 -rw=read -iodepth=1 -ioengine=io_uring > -bs=128k -size=5G > > read: IOPS=2015, BW=126MiB/s (132MB/s)(1260MiB/10005msec) > > > > fio -filename=/dev/ublkb1 -direct=1 -rw=read -iodepth=1 -ioengine=io_uring > -bs=128k -size=5G > > read: IOPS=1998, BW=125MiB/s (131MB/s)(1250MiB/10005msec) > > > > > > So, this patch set is optimized for null type devices? Or if I've missed any key > information, please let me know. > > Latency may have decreased a bit. > > System sources can't be saturated in single queue depth, please run the same > test with high queue depth per Keith's suggestion: > > --iodepth=128 --iodepth_batch_submit=16 -- > iodepth_batch_complete_min=16 I tested it with these settings, but the result is similar to iodepth=1: fio -filename=/dev/ublkb0 -direct=1 -rw=read --iodepth=128 --iodepth_batch_submit=16 --iodepth_batch_complete_min=16 -ioengine=io_uring -bs=64k -size=8G -numjobs=10 read: IOPS=2182, BW=136MiB/s (143MB/s)(1440MiB/10558msec) fio -filename=/dev/ublkb1 -direct=1 -rw=read --iodepth=128 --iodepth_batch_submit=16 --iodepth_batch_complete_min=16 -ioengine=io_uring -bs=64k -size=8G -numjobs=10 read: IOPS=2174, BW=136MiB/s (143MB/s)(1438MiB/10580msec) So I believe this is limited by the performance limitations of the file system where ./ublk-loop.img is located. > > Also if you set up the backing file as ramfs image, the improvement should be > pretty obvious, I observed IOPS doubled in this way. This is true, I tested it in /tmp/ and got a large optimizations. The results as follow: fio -filename=/dev/ublkb0 -direct=1 -rw=read --iodepth=128 --iodepth_batch_submit=16 --iodepth_batch_complete_min=16 -ioengine=io_uring -bs=64k -size=8G -numjobs=10 read: IOPS=95.8k, BW=5985MiB/s (6276MB/s)(58.5GiB/10014msec) fio -filename=/dev/ublkb1 -direct=1 -rw=read --iodepth=128 --iodepth_batch_submit=16 --iodepth_batch_complete_min=16 -ioengine=io_uring -bs=64k -size=8G -numjobs=10 read: IOPS=170k, BW=10.4GiB/s (11.1GB/s)(80.0GiB/7721msec) So this test result is in line with expectations. --- Li Zetao