Re: [PATCH v4 11/11] iomap: Complete partial direct I/O writes synchronously

Andreas Gruenbacher <agruenba@xxxxxxxxxx> · Wed, 16 May 2018 22:27:18 +0200

On 15 May 2018 at 09:24, Christoph Hellwig <hch@xxxxxx> wrote:
> On Mon, May 14, 2018 at 05:36:24PM +0200, Andreas Gruenbacher wrote:
>> According to xfstest generic/240, applications see, to expect direct I/O
>> writes to either complete as a whole or to fail; short direct I/O writes
>> are apparently not appreciated.  This means that when only part of an
>> asynchronous direct I/O write succeeds, we can either fail the entire
>> write, or we can wait wait for the partial write to complete and retry
>> the remaining write using buffered I/O.  The old __blockdev_direct_IO
>> helper has code for waiting for partial writes to complete; the new
>> iomap_dio_rw iomap helper does not.
>>
>> The above mentioned fallback mode is used by gfs2, which doesn't allow
>> block allocations under direct I/O to avoid taking cluster-wide
>> exclusive locks.  As a consequence, an asynchronous direct I/O write to
>> a file range that ends in a hole will result in a short write.  When
>> that happens, we want to retry the remaining write using buffered I/O.
>>
>> To allow that, change iomap_dio_rw to wait for short direct I/O writes
>> like __blockdev_direct_IO does instead of returning -EIOCBQUEUED.
>>
>> This fixes xfstest generic/240 on gfs2.
>
> The code looks pretty racy to me.

I/O completion triggers when dio->ref reaches zero, so
iomap_dio_bio_end_io will never evaluate submit.waiter before the
atomic_dec_and_test in iomap_dio_rw. We probably need an
smp_mb__before_atomic() before that atomic_dec_and_test to make sure
that iomap_dio_bio_end_io sees the right value of submit.waiter
though. Any concerns beyond that?

> Why would gfs2 cause a short direct I/O write to start with? I suspect that is
> where the problem that needs fixing is burried.

This is caused by the fact that gfs2 takes a deferred rather than an
exclusive cluster-wide lock during direct I/O operations, which allows
concurrent operations. This is critical for performance. In that
locking mode, however, it's only possible to write to preallocated
space, and new allocations are not possible. So user-space requests an
asynchronous direct I/O write into a file range that happens to
contain a hole. iomap_dio_rw is walf-way done with the write when
ops->iomap_begin hits the hole. In that state, the only choice is to
wait for the completion of the partial write and to switch locking
modes before completing the rest of the write. That's what we did
before iomap; the old code did support that. Leaving things unchanged
and completing partial direct I/O writes asynchronously breaks xfstest
generic/240, and likely also user-space.

The only other workaround I can think of would be to check all
asynchronous direct I/O write file ranges for holes before calling
iomap_dio_rw. That would be a major nightmare though.

Thanks,
Andreas