io_uring performance with block sizes > 128k

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



I'm seeing a sizeable drop in perf with polled fio tests for block sizes > 128k:

filename=/dev/nvme0n1
rw=randread
direct=1
time_based=1
randrepeat=1
gtod_reduce=1

fio --readonly --ioengine=io_uring --iodepth 1024 --fixedbufs --hipri --numjobs=16
fio --readonly --ioengine=pvsync2 --iodepth 1024 --hipri --numjobs=16


Compared with the pvsync2 engine, the only major difference I could see was the dio path, __blkdev_direct_IO() for io_uring vs. __blkdev_direct_IO_simple() for pvsync2 because of the is_sync_kiocb() check.


static ssize_t
blkdev_direct_IO(struct kiocb *iocb, struct iov_iter *iter)
{
        ...
        if (is_sync_kiocb(iocb) && nr_pages <= BIO_MAX_PAGES)
                return __blkdev_direct_IO_simple(iocb, iter, nr_pages);

        return __blkdev_direct_IO(iocb, iter, min(nr_pages, BIO_MAX_PAGES));
}

Just for an experiment, I hacked io_uring code to force it through the _simple() path and I get better numbers though the variance is fairly high, but the drop at bs > 128k seems consistent:


# baseline
READ: bw=3167MiB/s (3321MB/s), 186MiB/s-208MiB/s (196MB/s-219MB/s)   #128k
READ: bw=898MiB/s (941MB/s), 51.2MiB/s-66.1MiB/s (53.7MB/s-69.3MB/s) #144k
READ: bw=1576MiB/s (1652MB/s), 81.8MiB/s-109MiB/s (85.8MB/s-114MB/s) #256k

# hack
READ: bw=2705MiB/s (2836MB/s), 157MiB/s-174MiB/s (165MB/s-183MB/s) #128k
READ: bw=2901MiB/s (3042MB/s), 174MiB/s-194MiB/s (183MB/s-204MB/s) #144k
READ: bw=4194MiB/s (4398MB/s), 252MiB/s-271MiB/s (265MB/s-284MB/s) #256k


--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -1972,12 +1972,12 @@ static int io_prep_rw(struct io_kiocb *req, const struct
                        return -EOPNOTSUPP;

                kiocb->ki_flags |= IOCB_HIPRI;
-               kiocb->ki_complete = io_complete_rw_iopoll;
+               kiocb->ki_complete = NULL;
                req->result = 0;
        } else {
                if (kiocb->ki_flags & IOCB_HIPRI)
                        return -EINVAL;
-               kiocb->ki_complete = io_complete_rw;
+               kiocb->ki_complete = NULL;
        }

        req->rw.addr = READ_ONCE(sqe->addr);
@@ -2005,7 +2005,12 @@ static inline void io_rw_done(struct kiocb *kiocb, ssize_
                ret = -EINTR;
                /* fall through */
        default:
-               kiocb->ki_complete(kiocb, ret, 0);
+               if (kiocb->ki_complete)
+                       kiocb->ki_complete(kiocb, ret, 0);
+               else if (kiocb->ki_flags & IOCB_HIPRI)
+                       io_complete_rw_iopoll(kiocb, ret, 0);
+               else
+                       io_complete_rw(kiocb, ret, 0);
        }
 }


With the baseline version, perf top shows a significant amount of time for lock contention.  I *think* it is nvmeq->sq_lock.

Does that make sense?  I do realize the hack defeats the io_uring purpose but I though it might provide some clues as to what is going on.  Let me know if there is something else I can try.

Thanks.

--bijan





[Index of Archives]     [Linux Samsung SoC]     [Linux Rockchip SoC]     [Linux Actions SoC]     [Linux for Synopsys ARC Processors]     [Linux NFS]     [Linux NILFS]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]


  Powered by Linux