Re: [PATCH RFC] iomap: only return IO error if no data has been transferred

Jens Axboe <axboe@xxxxxxxxx> · Wed, 18 Nov 2020 14:19:30 -0700

On 11/18/20 2:15 PM, Dave Chinner wrote:
> On Wed, Nov 18, 2020 at 02:00:06PM -0700, Jens Axboe wrote:
>> On 11/18/20 1:37 PM, Dave Chinner wrote:
>>> On Wed, Nov 18, 2020 at 08:26:50AM -0700, Jens Axboe wrote:
>>>> On 11/18/20 12:19 AM, Dave Chinner wrote:
>>>>> On Tue, Nov 17, 2020 at 03:17:18PM -0700, Jens Axboe wrote:
>>>>>> If we've successfully transferred some data in __iomap_dio_rw(),
>>>>>> don't mark an error for a latter segment in the dio.
>>>>>>
>>>>>> Signed-off-by: Jens Axboe <axboe@xxxxxxxxx>
>>>>>>
>>>>>> ---
>>>>>>
>>>>>> Debugging an issue with io_uring, which uses IOCB_NOWAIT for the
>>>>>> IO. If we do parts of an IO, then once that completes, we still
>>>>>> return -EAGAIN if we ran into a problem later on. That seems wrong,
>>>>>> normal convention would be to return the short IO instead. For the
>>>>>> -EAGAIN case, io_uring will retry later parts without IOCB_NOWAIT
>>>>>> and complete it successfully.
>>>>>
>>>>> So you are getting a write IO that is split across an allocated
>>>>> extent and a hole, and the second mapping is returning EAGAIN
>>>>> because allocation would be required? This sort of split extent IO
>>>>> is fairly common, so I'm not sure that splitting them into two
>>>>> separate IOs may not be the best approach.
>>>>
>>>> The case I seem to be hitting is this one:
>>>>
>>>> if (iocb->ki_flags & IOCB_NOWAIT) {
>>>> 	if (filemap_range_has_page(mapping, pos, end)) {
>>>>                   ret = -EAGAIN;
>>>>                   goto out_free_dio;
>>>> 	}
>>>> 	flags |= IOMAP_NOWAIT;
>>>> }
>>>>
>>>> in __iomap_dio_rw(), which isn't something we can detect upfront like IO
>>>> over a multiple extents...
>>>
>>> This specific situation cannot result in the partial IO behaviour
>>> you described.  It is an -upfront check- that is done before any IO
>>> is mapped or issued so results in the entire IO being skipped and we
>>> don't get anywhere near the code you changed.
>>>
>>> IOWs, this doesn't explain why you saw a partial IO, or why changing
>>> partial IO return values avoids -EAGAIN from a range we apparently
>>> just did a partial IO over and -didn't have page cache pages-
>>> sitting over it.
>>
>> You are right, I double checked and recreated my debugging. What's
>> triggering is that we're hitting this in xfs_direct_write_iomap_begin()
>> after we've already done some IO:
>>
>> allocate_blocks:
>> 	error = -EAGAIN;
>> 	if (flags & IOMAP_NOWAIT)
>> 		goto out_unlock;
> 
> Ok, that's exactly the case the reproducer I wrote triggers.

OK good, then we're on the same page :-)

>>> Can you provide an actual event trace of the IOs in question that
>>> are failing in your tests (e.g. from something like `trace-cmd
>>> record -e xfs_file\* -e xfs_i\* -e xfs_\*write -e iomap\*` over the
>>> sequential that reproduces the issue) so that there's no ambiguity
>>> over how this problem is occurring in your systems?
>>
>> Let me know if you still want this!
> 
> No, it makes sense now :)

What's the next step here? Are you working on an XFS fix for this?

Was looking at other potential -EAGAIN during the loop, and seems like
we'd be able to hit this if we fail xfs_ilock_for_iomap() as well. And
not sure how that would be solvable, without accepting that IOCB_NOWAIT
reads/writes can be short.

-- 
Jens Axboe