Re: [bug report] ext4 misses final i_size meta sync under O_DIRECT | O_SYNC semantics after iomap DIO conversion

Dave Chinner <david@xxxxxxxxxxxxx> · Wed, 20 Sep 2023 17:29:01 +1000

On Tue, Sep 19, 2023 at 09:47:34PM +0800, Gao Xiang wrote:
> 
> (sorry... add Darrick here...)
> 
> Hi Jan,
> 
> On 2023/9/19 20:05, Jan Kara wrote:
> > Hello!
> > 
> > On Tue 19-09-23 14:00:04, Gao Xiang wrote:
> > > Our consumer reports a behavior change between pre-iomap and iomap
> > > direct io conversion:
> > > 
> > > If the system crashes after an appending write to a file open with
> > > O_DIRECT | O_SYNC flag set, file i_size won't be updated even if
> > > O_SYNC was marked before.
> > > 
> > > It can be reproduced by a test program in the attachment with
> > > gcc -o repro repro.c && ./repro testfile && echo c > /proc/sysrq-trigger
> > > 
> > > After some analysis, we found that before iomap direct I/O conversion,
> > > the timing was roughly (taking Linux 3.10 codebase as an example):
> > > 
> > > 	..
> > > 	- ext4_file_dio_write
> > > 	  - __generic_file_aio_write
> > > 	      ..
> > > 	    - ext4_direct_IO  # generic_file_direct_write
> > > 	      - ext4_ext_direct_IO
> > > 	        - ext4_ind_direct_IO  # final_size > inode->i_size
> > > 	          - ..
> > > 	          - ret = blockdev_direct_IO()
> > > 	          - i_size_write(inode, end) # orphan && ret > 0 &&
> > > 	                                   # end > inode->i_size
> > > 	          - ext4_mark_inode_dirty()
> > > 	          - ...
> > > 	  - generic_write_sync  # handling O_SYNC
> > > 
> > > So the dirty inode meta will be committed into journal immediately
> > > if O_SYNC is set.  However, After commit 569342dc2485 ("ext4: move
> > > inode extension/truncate code out from ->iomap_end() callback"),
> > > the new behavior seems as below:
> > > 
> > > 	..
> > > 	- ext4_dio_write_iter
> > > 	  - ext4_dio_write_checks  # extend = 1
> > > 	  - iomap_dio_rw
> > > 	      - __iomap_dio_rw
> > > 	      - iomap_dio_complete
> > > 	        - generic_write_sync
> > > 	  - ext4_handle_inode_extension  # extend = 1
> > > 
> > > So that i_size will be recorded only after generic_write_sync() is
> > > called.  So O_SYNC won't flush the update i_size to the disk.
> > 
> > Indeed, that looks like a bug. Thanks for report!
> 
> Thanks for the confirmation!
> 
> > 
> > > On the other side, after a quick look of XFS side, it will record
> > > i_size changes in xfs_dio_write_end_io() so it seems that it doesn't
> > > have this problem.
> > 
> > Yes, I'm a bit hazy on the details but I think we've decided to call
> > ext4_handle_inode_extension() directly from ext4_dio_write_iter() because
> > from ext4_dio_write_end_io() it was difficult to test in a race-free way
> > whether extending i_size (and i_disksize) is needed or not (we don't
> > necessarily hold i_rwsem there). I'll think how we could fix the problem
> > you've reported.

Given that ext4 can run extent conversion in IO completion, it can
run file extension in IO completion. Yes, that might require
additional synchronisation of file size updates to co-ordinate
submission and completion size checks. XFS just uses a spinlock for
this....

> Yes, another concern is O_DSYNC, I'm quite not sure if the behavior
> is changed too.

For O_DSYNC, the file size change needs to be covered by the
call to generic_write_sync() as well. O_DSYNC should be thought of
as being essentially the same as O_SYNC except for minor details.

> I had a rough feeling that currently iomap DIO behaviors on these are
> too strict and might not fit in each specific fs detailed
> implementation, tho.

In what way?  iomap implements exactly the data integrity semantics
that are required for O_DSYNC and O_SYNC writes, and it requires
filesystem end_io method to finalize all metadata modifications
needed for data integrity purposes.

Keep in mind that iomap is designed around the requirements async IO
(AIO and io_uring) place on individual IOs: there is no waiting
context to "finish" the IO before userspace is signalled that it is
complete. Hence everything related to data integrity needs to be
done by the filesystem in ->end_io before the iomap completion runs
generic_write_sync() and signals IO completion....

Cheers,

Dave.
-- 
Dave Chinner
david@xxxxxxxxxxxxx