Re: io_uring performance with block sizes > 128k

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 3/2/2020 9:01 PM, Jens Axboe wrote:
On 3/2/20 4:57 PM, Jens Axboe wrote:
On 3/2/20 4:55 PM, Bijan Mottahedeh wrote:
I'm seeing a sizeable drop in perf with polled fio tests for block sizes
  > 128k:

filename=/dev/nvme0n1
rw=randread
direct=1
time_based=1
randrepeat=1
gtod_reduce=1

fio --readonly --ioengine=io_uring --iodepth 1024 --fixedbufs --hipri
--numjobs=16
fio --readonly --ioengine=pvsync2 --iodepth 1024 --hipri --numjobs=16


Compared with the pvsync2 engine, the only major difference I could see
was the dio path, __blkdev_direct_IO() for io_uring vs.
__blkdev_direct_IO_simple() for pvsync2 because of the is_sync_kiocb()
check.


static ssize_t
blkdev_direct_IO(struct kiocb *iocb, struct iov_iter *iter)
{
          ...
          if (is_sync_kiocb(iocb) && nr_pages <= BIO_MAX_PAGES)
                  return __blkdev_direct_IO_simple(iocb, iter, nr_pages);

          return __blkdev_direct_IO(iocb, iter, min(nr_pages,
BIO_MAX_PAGES));
}

Just for an experiment, I hacked io_uring code to force it through the
_simple() path and I get better numbers though the variance is fairly
high, but the drop at bs > 128k seems consistent:


# baseline
READ: bw=3167MiB/s (3321MB/s), 186MiB/s-208MiB/s (196MB/s-219MB/s)   #128k
READ: bw=898MiB/s (941MB/s), 51.2MiB/s-66.1MiB/s (53.7MB/s-69.3MB/s) #144k
READ: bw=1576MiB/s (1652MB/s), 81.8MiB/s-109MiB/s (85.8MB/s-114MB/s) #256k

# hack
READ: bw=2705MiB/s (2836MB/s), 157MiB/s-174MiB/s (165MB/s-183MB/s) #128k
READ: bw=2901MiB/s (3042MB/s), 174MiB/s-194MiB/s (183MB/s-204MB/s) #144k
READ: bw=4194MiB/s (4398MB/s), 252MiB/s-271MiB/s (265MB/s-284MB/s) #256k
A quick guess would be that the IO is being split above 128K, and hence
the polling only catches one of the parts?
Can you try and see if this makes a difference?


diff --git a/fs/io_uring.c b/fs/io_uring.c
index 571b510ef0e7..cf7599a2c503 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -1725,8 +1725,10 @@ static int io_do_iopoll(struct io_ring_ctx *ctx, unsigned int *nr_events,
  		if (ret < 0)
  			break;
+#if 0
  		if (ret && spin)
  			spin = false;
+#endif
  		ret = 0;
  	}
I didn't see a difference.

If the request is split into two bios, is REQ_F_IOPOLL_COMPLETED set only when the 2nd bio completes?

I think you mentioned before that the request is split with __blk_queue_split() but I haven't yet been able to see how that happens exactly.  I see that the request size nvme_queue_rq() is the same as the original (e.g. 256k), is that expected?




[Index of Archives]     [Linux Samsung SoC]     [Linux Rockchip SoC]     [Linux Actions SoC]     [Linux for Synopsys ARC Processors]     [Linux NFS]     [Linux NILFS]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]


  Powered by Linux