Re: [bug report] ext4 misses final i_size meta sync under O_DIRECT | O_SYNC semantics after iomap DIO conversion

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 





On 2023/9/20 15:29, Dave Chinner wrote:
On Tue, Sep 19, 2023 at 09:47:34PM +0800, Gao Xiang wrote:

(sorry... add Darrick here...)

Hi Jan,

On 2023/9/19 20:05, Jan Kara wrote:
Hello!

On Tue 19-09-23 14:00:04, Gao Xiang wrote:
Our consumer reports a behavior change between pre-iomap and iomap
direct io conversion:

If the system crashes after an appending write to a file open with
O_DIRECT | O_SYNC flag set, file i_size won't be updated even if
O_SYNC was marked before.

It can be reproduced by a test program in the attachment with
gcc -o repro repro.c && ./repro testfile && echo c > /proc/sysrq-trigger

After some analysis, we found that before iomap direct I/O conversion,
the timing was roughly (taking Linux 3.10 codebase as an example):

	..
	- ext4_file_dio_write
	  - __generic_file_aio_write
	      ..
	    - ext4_direct_IO  # generic_file_direct_write
	      - ext4_ext_direct_IO
	        - ext4_ind_direct_IO  # final_size > inode->i_size
	          - ..
	          - ret = blockdev_direct_IO()
	          - i_size_write(inode, end) # orphan && ret > 0 &&
	                                   # end > inode->i_size
	          - ext4_mark_inode_dirty()
	          - ...
	  - generic_write_sync  # handling O_SYNC

So the dirty inode meta will be committed into journal immediately
if O_SYNC is set.  However, After commit 569342dc2485 ("ext4: move
inode extension/truncate code out from ->iomap_end() callback"),
the new behavior seems as below:

	..
	- ext4_dio_write_iter
	  - ext4_dio_write_checks  # extend = 1
	  - iomap_dio_rw
	      - __iomap_dio_rw
	      - iomap_dio_complete
	        - generic_write_sync
	  - ext4_handle_inode_extension  # extend = 1

So that i_size will be recorded only after generic_write_sync() is
called.  So O_SYNC won't flush the update i_size to the disk.

Indeed, that looks like a bug. Thanks for report!

Thanks for the confirmation!


On the other side, after a quick look of XFS side, it will record
i_size changes in xfs_dio_write_end_io() so it seems that it doesn't
have this problem.

Yes, I'm a bit hazy on the details but I think we've decided to call
ext4_handle_inode_extension() directly from ext4_dio_write_iter() because
from ext4_dio_write_end_io() it was difficult to test in a race-free way
whether extending i_size (and i_disksize) is needed or not (we don't
necessarily hold i_rwsem there). I'll think how we could fix the problem
you've reported.

Given that ext4 can run extent conversion in IO completion, it can
run file extension in IO completion. Yes, that might require
additional synchronisation of file size updates to co-ordinate
submission and completion size checks. XFS just uses a spinlock for
this....

Yes, another concern is O_DSYNC, I'm quite not sure if the behavior
is changed too.

For O_DSYNC, the file size change needs to be covered by the
call to generic_write_sync() as well. O_DSYNC should be thought of
as being essentially the same as O_SYNC except for minor details.

I had a rough feeling that currently iomap DIO behaviors on these are
too strict and might not fit in each specific fs detailed
implementation, tho.


In what way?  iomap implements exactly the data integrity semantics
that are required for O_DSYNC and O_SYNC writes, and it requires
filesystem end_io method to finalize all metadata modifications
needed for data integrity purposes.

Keep in mind that iomap is designed around the requirements async IO
(AIO and io_uring) place on individual IOs: there is no waiting
context to "finish" the IO before userspace is signalled that it is
complete. Hence everything related to data integrity needs to be
done by the filesystem in ->end_io before the iomap completion runs
generic_write_sync() and signals IO completion....

Yes, I understand iomap is well-suited in this model.  Anyway, it's
somewhat out of scope on my side.

Thanks,
Gao Xiang


Cheers,

Dave.



[Index of Archives]     [Reiser Filesystem Development]     [Ceph FS]     [Kernel Newbies]     [Security]     [Netfilter]     [Bugtraq]     [Linux FS]     [Yosemite National Park]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Samba]     [Device Mapper]     [Linux Media]

  Powered by Linux