On 7/29/19 12:17 PM, Jens Axboe wrote:
On 7/29/19 12:15 PM, Greg Kroah-Hartman wrote:
On Mon, Jul 29, 2019 at 12:08:28PM -0600, Jens Axboe wrote:
Hi,
I forgot to mark a few patches for io_uring as stable. In order
of how to apply, can you add the following commits for 5.2?
f7b76ac9d17e16e44feebb6d2749fec92bfd6dd4
0ef67e605d2b1e8300d04fd9134d283bbbf441b9
Does not apply :(
c0e48f9dea9129aa11bec3ed13803bcc26e96e49
Now queued up.
bd11b3a391e3df6fa958facbe4b3f9f4cca9bd49
Does not apply :(
36703247d5f52a679df9da51192b6950fe81689f
Now queued up.
You are 2 out of 4 :)
Care to send backported versions of the 2 that did not apply? I'll be
glad to queue them up then.
Huh strange, I applied them to our internal 5.2 tree without conflict.
Maybe I had backported more...
I'll send versions for 5.2 in a bit for you.
Here you go, those two on top of the others. Ran it through the
regressions tests here, works for me.
--
Jens Axboe
>From b00254467e0d2fa90a82b5ffb7d8e990f6fee8df Mon Sep 17 00:00:00 2001
From: Jens Axboe <axboe@xxxxxxxxx>
Date: Sat, 20 Jul 2019 08:37:31 -0600
Subject: [PATCH 2/2] io_uring: don't use iov_iter_advance() for fixed buffers
Hrvoje reports that when a large fixed buffer is registered and IO is
being done to the latter pages of said buffer, the IO submission time
is much worse:
reading to the start of the buffer: 11238 ns
reading to the end of the buffer: 1039879 ns
In fact, it's worse by two orders of magnitude. The reason for that is
how io_uring figures out how to setup the iov_iter. We point the iter
at the first bvec, and then use iov_iter_advance() to fast-forward to
the offset within that buffer we need.
However, that is abysmally slow, as it entails iterating the bvecs
that we setup as part of buffer registration. There's really no need
to use this generic helper, as we know it's a BVEC type iterator, and
we also know that each bvec is PAGE_SIZE in size, apart from possibly
the first and last. Hence we can just use a shift on the offset to
find the right index, and then adjust the iov_iter appropriately.
After this fix, the timings are:
reading to the start of the buffer: 10135 ns
reading to the end of the buffer: 1377 ns
Or about an 755x improvement for the tail page.
Reported-by: Hrvoje Zeba <zeba.hrvoje@xxxxxxxxx>
Tested-by: Hrvoje Zeba <zeba.hrvoje@xxxxxxxxx>
Signed-off-by: Jens Axboe <axboe@xxxxxxxxx>
(cherry picked from commit bd11b3a391e3df6fa958facbe4b3f9f4cca9bd49)
---
fs/io_uring.c | 39 +++++++++++++++++++++++++++++++++++++--
1 file changed, 37 insertions(+), 2 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c
index c47f6bca760f..15e264e57f6c 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -999,8 +999,43 @@ static int io_import_fixed(struct io_ring_ctx *ctx, int rw,
*/
offset = buf_addr - imu->ubuf;
iov_iter_bvec(iter, rw, imu->bvec, imu->nr_bvecs, offset + len);
- if (offset)
- iov_iter_advance(iter, offset);
+
+ if (offset) {
+ /*
+ * Don't use iov_iter_advance() here, as it's really slow for
+ * using the latter parts of a big fixed buffer - it iterates
+ * over each segment manually. We can cheat a bit here, because
+ * we know that:
+ *
+ * 1) it's a BVEC iter, we set it up
+ * 2) all bvecs are PAGE_SIZE in size, except potentially the
+ * first and last bvec
+ *
+ * So just find our index, and adjust the iterator afterwards.
+ * If the offset is within the first bvec (or the whole first
+ * bvec, just use iov_iter_advance(). This makes it easier
+ * since we can just skip the first segment, which may not
+ * be PAGE_SIZE aligned.
+ */
+ const struct bio_vec *bvec = imu->bvec;
+
+ if (offset <= bvec->bv_len) {
+ iov_iter_advance(iter, offset);
+ } else {
+ unsigned long seg_skip;
+
+ /* skip first vec */
+ offset -= bvec->bv_len;
+ seg_skip = 1 + (offset >> PAGE_SHIFT);
+
+ iter->bvec = bvec + seg_skip;
+ iter->nr_segs -= seg_skip;
+ iter->count -= (seg_skip << PAGE_SHIFT);
+ iter->iov_offset = offset & ~PAGE_MASK;
+ if (iter->iov_offset)
+ iter->count -= iter->iov_offset;
+ }
+ }
/* don't drop a reference to these pages */
iter->type |= ITER_BVEC_FLAG_NO_REF;
--
2.17.1
>From 879d1e652332740de25ecc6091e7c1b82e7a3b24 Mon Sep 17 00:00:00 2001
From: Zhengyuan Liu <liuzhengyuan@xxxxxxxxxx>
Date: Tue, 16 Jul 2019 23:26:14 +0800
Subject: [PATCH 1/2] io_uring: fix counter inc/dec mismatch in async_list
We could queue a work for each req in defer and link list without
increasing async_list->cnt, so we shouldn't decrease it while exiting
from workqueue as well if we didn't process the req in async list.
Thanks to Jens Axboe <axboe@xxxxxxxxx> for his guidance.
Fixes: 31b515106428 ("io_uring: allow workqueue item to handle multiple buffered requests")
Signed-off-by: Zhengyuan Liu <liuzhengyuan@xxxxxxxxxx>
Signed-off-by: Jens Axboe <axboe@xxxxxxxxx>
(cherry picked from commit f7b76ac9d17e16e44feebb6d2749fec92bfd6dd4)
---
fs/io_uring.c | 8 ++++++++
1 file changed, 8 insertions(+)
diff --git a/fs/io_uring.c b/fs/io_uring.c
index c6598fb786c3..c47f6bca760f 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -330,6 +330,9 @@ struct io_kiocb {
#define REQ_F_SEQ_PREV 8 /* sequential with previous */
#define REQ_F_IO_DRAIN 16 /* drain existing IO first */
#define REQ_F_IO_DRAINED 32 /* drain done */
+#define REQ_F_LINK 64 /* linked sqes */
+#define REQ_F_LINK_DONE 128 /* linked sqes done */
+#define REQ_F_FAIL_LINK 256 /* fail rest of links */
u64 user_data;
u32 error; /* iopoll result from callback */
u32 sequence;
@@ -1696,6 +1699,10 @@ static void io_sq_wq_submit_work(struct work_struct *work)
/* async context always use a copy of the sqe */
kfree(sqe);
+ /* req from defer and link list needn't decrease async cnt */
+ if (req->flags & (REQ_F_IO_DRAINED | REQ_F_LINK_DONE))
+ goto out;
+
if (!async_list)
break;
if (!list_empty(&req_list)) {
@@ -1743,6 +1750,7 @@ static void io_sq_wq_submit_work(struct work_struct *work)
}
}
+out:
if (cur_mm) {
set_fs(old_fs);
unuse_mm(cur_mm);
--
2.17.1