io_uring performance with block sizes > 128k

Bijan Mottahedeh <bijan.mottahedeh@xxxxxxxxxx> · Mon, 2 Mar 2020 15:55:46 -0800

I'm seeing a sizeable drop in perf with polled fio tests for block sizes 
> 128k:

filename=/dev/nvme0n1
rw=randread
direct=1
time_based=1
randrepeat=1
gtod_reduce=1

fio --readonly --ioengine=io_uring --iodepth 1024 --fixedbufs --hipri 
--numjobs=16
fio --readonly --ioengine=pvsync2 --iodepth 1024 --hipri --numjobs=16


Compared with the pvsync2 engine, the only major difference I could see 
was the dio path, __blkdev_direct_IO() for io_uring vs. 
__blkdev_direct_IO_simple() for pvsync2 because of the is_sync_kiocb() 
check.


static ssize_t
blkdev_direct_IO(struct kiocb *iocb, struct iov_iter *iter)
{
        ...
        if (is_sync_kiocb(iocb) && nr_pages <= BIO_MAX_PAGES)
                return __blkdev_direct_IO_simple(iocb, iter, nr_pages);

        return __blkdev_direct_IO(iocb, iter, min(nr_pages, 
BIO_MAX_PAGES));
}

Just for an experiment, I hacked io_uring code to force it through the 
_simple() path and I get better numbers though the variance is fairly 
high, but the drop at bs > 128k seems consistent:


# baseline
READ: bw=3167MiB/s (3321MB/s), 186MiB/s-208MiB/s (196MB/s-219MB/s)   #128k
READ: bw=898MiB/s (941MB/s), 51.2MiB/s-66.1MiB/s (53.7MB/s-69.3MB/s) #144k
READ: bw=1576MiB/s (1652MB/s), 81.8MiB/s-109MiB/s (85.8MB/s-114MB/s) #256k

# hack
READ: bw=2705MiB/s (2836MB/s), 157MiB/s-174MiB/s (165MB/s-183MB/s) #128k
READ: bw=2901MiB/s (3042MB/s), 174MiB/s-194MiB/s (183MB/s-204MB/s) #144k
READ: bw=4194MiB/s (4398MB/s), 252MiB/s-271MiB/s (265MB/s-284MB/s) #256k

--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -1972,12 +1972,12 @@ static int io_prep_rw(struct io_kiocb *req, 
const struct
                        return -EOPNOTSUPP;

                kiocb->ki_flags |= IOCB_HIPRI;
-               kiocb->ki_complete = io_complete_rw_iopoll;
+               kiocb->ki_complete = NULL;
                req->result = 0;
        } else {
                if (kiocb->ki_flags & IOCB_HIPRI)
                        return -EINVAL;
-               kiocb->ki_complete = io_complete_rw;
+               kiocb->ki_complete = NULL;
        }

        req->rw.addr = READ_ONCE(sqe->addr);
@@ -2005,7 +2005,12 @@ static inline void io_rw_done(struct kiocb 
*kiocb, ssize_
                ret = -EINTR;
                /* fall through */
        default:
-               kiocb->ki_complete(kiocb, ret, 0);
+               if (kiocb->ki_complete)
+                       kiocb->ki_complete(kiocb, ret, 0);
+               else if (kiocb->ki_flags & IOCB_HIPRI)
+                       io_complete_rw_iopoll(kiocb, ret, 0);
+               else
+                       io_complete_rw(kiocb, ret, 0);
        }
 }


With the baseline version, perf top shows a significant amount of time 
for lock contention.  I *think* it is nvmeq->sq_lock.

Does that make sense?  I do realize the hack defeats the io_uring 
purpose but I though it might provide some clues as to what is going 
on.  Let me know if there is something else I can try.

Thanks.

--bijan