On 3/14/23 6:57?AM, Ming Lei wrote: > Basically userspace can specify any sub-buffer of the ublk block request > buffer from the fused command just by setting 'offset/len' > in the slave SQE for running slave OP. This way is flexible to implement > io mapping: mirror, stripped, ... > > The 3th & 4th patches enable fused slave support for the following OPs: > > OP_READ/OP_WRITE > OP_SEND/OP_RECV/OP_SEND_ZC > > The other ublk patches cleans ublk driver and implement fused command > for supporting zero copy. > > Follows userspace code: > > https://github.com/ming1/ubdsrv/tree/fused-cmd-zc-v2 Ran some quick testing here with qcow2. This is just done on my laptop in kvm, so take them with a grain of salt, results may be better elsewhere. Basline: 64k reads 98-100K IOPS 6-6.1GB/sec (ublk 100%, io_uring 9%) 4k reads 670-680K IOPS 2.6GB/sec (ublk 65%, io_uring 44%) and with zerocopy enabled: 64k reads 184K IOPS 11.5GB/sec (ublk 91%, io_uring 12%) 4k reads 730K IOPS 2.8GB/sec (ublk 73%, io_uring 48%) and with zerocopy and using SINGLE_ISSUER|COOP_TASKRUN for the ring: 64k reads 205K IOPS 12.8GB/sec (ublk 91%, io_uring 12%) 4k reads 730K IOPS 2.8GB/sec (ublk 66%, io_uring 42%) Don't put too much into the CPU utilization numbers, they are just indicative and not super accurate. But overall a nice win for larger block sizes with zero copy. We seem to be IOPS limited on this particular setup, which is most likely why 4k isn't showing any major wins here. Eg running 8k with zero copy, I get the same IOPS limit, just obviously doubling the bandwidth of the 4k run: IOPS=732.26K, BW=5.72GiB/s, IOS/call=32/32 IOPS=733.38K, BW=5.73GiB/s, IOS/call=32/32 I also tried using DEFER_TASKRUN, but it stalls on setup. Most likely something trivial, didn't poke any further at that. -- Jens Axboe