On 15 May 2018 at 09:24, Christoph Hellwig <hch@xxxxxx> wrote: > On Mon, May 14, 2018 at 05:36:24PM +0200, Andreas Gruenbacher wrote: >> According to xfstest generic/240, applications see, to expect direct I/O >> writes to either complete as a whole or to fail; short direct I/O writes >> are apparently not appreciated. This means that when only part of an >> asynchronous direct I/O write succeeds, we can either fail the entire >> write, or we can wait wait for the partial write to complete and retry >> the remaining write using buffered I/O. The old __blockdev_direct_IO >> helper has code for waiting for partial writes to complete; the new >> iomap_dio_rw iomap helper does not. >> >> The above mentioned fallback mode is used by gfs2, which doesn't allow >> block allocations under direct I/O to avoid taking cluster-wide >> exclusive locks. As a consequence, an asynchronous direct I/O write to >> a file range that ends in a hole will result in a short write. When >> that happens, we want to retry the remaining write using buffered I/O. >> >> To allow that, change iomap_dio_rw to wait for short direct I/O writes >> like __blockdev_direct_IO does instead of returning -EIOCBQUEUED. >> >> This fixes xfstest generic/240 on gfs2. > > The code looks pretty racy to me. I/O completion triggers when dio->ref reaches zero, so iomap_dio_bio_end_io will never evaluate submit.waiter before the atomic_dec_and_test in iomap_dio_rw. We probably need an smp_mb__before_atomic() before that atomic_dec_and_test to make sure that iomap_dio_bio_end_io sees the right value of submit.waiter though. Any concerns beyond that? > Why would gfs2 cause a short direct I/O write to start with? I suspect that is > where the problem that needs fixing is burried. This is caused by the fact that gfs2 takes a deferred rather than an exclusive cluster-wide lock during direct I/O operations, which allows concurrent operations. This is critical for performance. In that locking mode, however, it's only possible to write to preallocated space, and new allocations are not possible. So user-space requests an asynchronous direct I/O write into a file range that happens to contain a hole. iomap_dio_rw is walf-way done with the write when ops->iomap_begin hits the hole. In that state, the only choice is to wait for the completion of the partial write and to switch locking modes before completing the rest of the write. That's what we did before iomap; the old code did support that. Leaving things unchanged and completing partial direct I/O writes asynchronously breaks xfstest generic/240, and likely also user-space. The only other workaround I can think of would be to check all asynchronous direct I/O write file ranges for holes before calling iomap_dio_rw. That would be a major nightmare though. Thanks, Andreas