Re: [PATCH 2/3] io_uring: use iov_iter state save/restore helpers

Jens Axboe <axboe@xxxxxxxxx> · Tue, 14 Sep 2021 17:02:27 -0600

On 9/14/21 1:37 PM, Jens Axboe wrote:
> On 9/14/21 12:45 PM, Linus Torvalds wrote:
>> On Tue, Sep 14, 2021 at 7:18 AM Jens Axboe <axboe@xxxxxxxxx> wrote:
>>>
>>>
>>> +       iov_iter_restore(iter, state);
>>> +
>> ...
>>>                 rw->bytes_done += ret;
>>> +               iov_iter_advance(iter, ret);
>>> +               if (!iov_iter_count(iter))
>>> +                       break;
>>> +               iov_iter_save_state(iter, state);
>>
>> Ok, so now you keep iovb_iter and the state always in sync by just
>> always resetting the iter back and then walking it forward explicitly
>> - and re-saving the state.
>>
>> That seems safe, if potentially unnecessarily expensive.
> 
> Right, it's not ideal if it's a big range of IO, then it'll definitely
> be noticeable. But not too worried about it, at least not for now...
> 
>> I guess re-walking lots of iovec entries is actually very unlikely in
>> practice, so maybe this "stupid brute-force" model is the right one.
> 
> Not sure what the alternative is here. We could do something similar to
> __io_import_fixed() as we're only dealing with iter types where we can
> do that, but probably best left as a later optimization if it's deemed
> necessary.
> 
>> I do find the odd "use __state vs rw->state" to be very confusing,
>> though. Particularly in io_read(), where you do this:
>>
>> +       iov_iter_restore(iter, state);
>> +
>>         ret2 = io_setup_async_rw(req, iovec, inline_vecs, iter, true);
>>         if (ret2)
>>                 return ret2;
>>
>>         iovec = NULL;
>>         rw = req->async_data;
>> -       /* now use our persistent iterator, if we aren't already */
>> -       iter = &rw->iter;
>> +       /* now use our persistent iterator and state, if we aren't already */
>> +       if (iter != &rw->iter) {
>> +               iter = &rw->iter;
>> +               state = &rw->iter_state;
>> +       }
>>
>>         do {
>> -               io_size -= ret;
>>                 rw->bytes_done += ret;
>> +               iov_iter_advance(iter, ret);
>> +               if (!iov_iter_count(iter))
>> +                       break;
>> +               iov_iter_save_state(iter, state);
>>
>>
>> Note how it first does that iov_iter_restore() on iter/state, buit
>> then it *replaces&* the iter/state pointers, and then it does
>> iov_iter_advance() on the replacement ones.
> 
> We restore the iter so it's the same as before we did the read_iter
> call, and then setup a consistent copy of the iov/iter in case we need
> to punt this request for retry. rw->iter should have the same state as
> iter at this point, and since rw->iter is the copy we'll use going
> forward, we're advancing that one in case ret > 0.
> 
> The other case is that no persistent state is needed, and then iter
> remains the same.
> 
> I'll take a second look at this part and see if I can make it a bit more
> straight forward, or at least comment it properly.

I hacked up something that shortens the iter for the initial IO, so we
could more easily test the retry path and the state. It really is a
hack, but the idea was to issue 64K io from fio, and then the initial
attempt would be anywhere from 4K-60K truncated. That forces retry.
I ran this with both 16 segments and 8 segments, verifying that it
hits both the UIO_FASTIOV and alloc path.

I did find one issue with that, see the last hunk in the hack. We
need to increment rw->bytes_done if we don't break, or set ret to
0 if we do. Otherwise that last ret ends up being accounted twice.
But apart from that, it passes data verification runs.

diff --git a/fs/io_uring.c b/fs/io_uring.c
index dc1ff47e3221..484c86252f9d 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -744,6 +744,7 @@ enum {
 	REQ_F_NOWAIT_READ_BIT,
 	REQ_F_NOWAIT_WRITE_BIT,
 	REQ_F_ISREG_BIT,
+	REQ_F_TRUNCATED_BIT,
 
 	/* not a real bit, just to check we're not overflowing the space */
 	__REQ_F_LAST_BIT,
@@ -797,6 +798,7 @@ enum {
 	REQ_F_REFCOUNT		= BIT(REQ_F_REFCOUNT_BIT),
 	/* there is a linked timeout that has to be armed */
 	REQ_F_ARM_LTIMEOUT	= BIT(REQ_F_ARM_LTIMEOUT_BIT),
+	REQ_F_TRUNCATED		= BIT(REQ_F_TRUNCATED_BIT),
 };
 
 struct async_poll {
@@ -3454,11 +3456,12 @@ static int io_read(struct io_kiocb *req, unsigned int issue_flags)
 {
 	struct iovec inline_vecs[UIO_FASTIOV], *iovec = inline_vecs;
 	struct kiocb *kiocb = &req->rw.kiocb;
-	struct iov_iter __iter, *iter = &__iter;
+	struct iov_iter __i, __iter, *iter = &__iter;
 	struct io_async_rw *rw = req->async_data;
 	bool force_nonblock = issue_flags & IO_URING_F_NONBLOCK;
 	struct iov_iter_state __state, *state;
 	ssize_t ret, ret2;
+	bool do_restore = false;
 
 	if (rw) {
 		iter = &rw->iter;
@@ -3492,8 +3495,25 @@ static int io_read(struct io_kiocb *req, unsigned int issue_flags)
 		return ret;
 	}
 
+	if (!(req->flags & REQ_F_TRUNCATED) && !(iov_iter_count(iter) & 4095)) {
+		int nr_vecs;
+
+		__i = *iter;
+		nr_vecs = 1 + (prandom_u32() % iter->nr_segs);
+		iter->nr_segs = nr_vecs;
+		iter->count = nr_vecs * 8192;
+		req->flags |= REQ_F_TRUNCATED;
+		do_restore = true;
+	}
+
 	ret = io_iter_do_read(req, iter);
 
+	if (ret == -EAGAIN) {
+		req->flags &= ~REQ_F_TRUNCATED;
+		*iter = __i;
+		do_restore = false;
+	}
+
 	if (ret == -EAGAIN || (req->flags & REQ_F_REISSUE)) {
 		req->flags &= ~REQ_F_REISSUE;
 		/* IOPOLL retry should happen for io-wq threads */
@@ -3513,6 +3533,9 @@ static int io_read(struct io_kiocb *req, unsigned int issue_flags)
 
 	iov_iter_restore(iter, state);
 
+	if (do_restore)
+		*iter = __i;
+
 	ret2 = io_setup_async_rw(req, iovec, inline_vecs, iter, true);
 	if (ret2)
 		return ret2;
@@ -3526,10 +3549,10 @@ static int io_read(struct io_kiocb *req, unsigned int issue_flags)
 	}
 
 	do {
-		rw->bytes_done += ret;
 		iov_iter_advance(iter, ret);
 		if (!iov_iter_count(iter))
 			break;
+		rw->bytes_done += ret;
 		iov_iter_save_state(iter, state);
 
 		/* if we can retry, do so with the callbacks armed */

-- 
Jens Axboe