Re: [bug report] ext4 misses final i_size meta sync under O_DIRECT | O_SYNC semantics after iomap DIO conversion

Jan Kara <jack@xxxxxxx> · Wed, 20 Sep 2023 17:20:05 +0200

On Wed 20-09-23 17:08:19, Ritesh Harjani wrote:
> Jan Kara <jack@xxxxxxx> writes:
> 
> > Hello!
> >
> > On Tue 19-09-23 14:00:04, Gao Xiang wrote:
> >> Our consumer reports a behavior change between pre-iomap and iomap
> >> direct io conversion:
> >> 
> >> If the system crashes after an appending write to a file open with
> >> O_DIRECT | O_SYNC flag set, file i_size won't be updated even if
> >> O_SYNC was marked before.
> >> 
> >> It can be reproduced by a test program in the attachment with
> >> gcc -o repro repro.c && ./repro testfile && echo c > /proc/sysrq-trigger
> >> 
> >> After some analysis, we found that before iomap direct I/O conversion,
> >> the timing was roughly (taking Linux 3.10 codebase as an example):
> >> 
> >> 	..
> >> 	- ext4_file_dio_write
> >> 	  - __generic_file_aio_write
> >> 	      ..
> >> 	    - ext4_direct_IO  # generic_file_direct_write
> >> 	      - ext4_ext_direct_IO
> >> 	        - ext4_ind_direct_IO  # final_size > inode->i_size
> >> 	          - ..
> >> 	          - ret = blockdev_direct_IO()
> >> 	          - i_size_write(inode, end) # orphan && ret > 0 &&
> >> 	                                   # end > inode->i_size
> >> 	          - ext4_mark_inode_dirty()
> >> 	          - ...
> >> 	  - generic_write_sync  # handling O_SYNC
> >> 
> >> So the dirty inode meta will be committed into journal immediately
> >> if O_SYNC is set.  However, After commit 569342dc2485 ("ext4: move
> >> inode extension/truncate code out from ->iomap_end() callback"),
> >> the new behavior seems as below:
> >> 
> >> 	..
> >> 	- ext4_dio_write_iter
> >> 	  - ext4_dio_write_checks  # extend = 1
> >> 	  - iomap_dio_rw
> >> 	      - __iomap_dio_rw
> >> 	      - iomap_dio_complete
> >> 	        - generic_write_sync
> >> 	  - ext4_handle_inode_extension  # extend = 1
> 
> Yes, since ext4_handle_inode_extension() will handle inode i_disksize
> update and mark the inode dirty, generic_write_sync() call should happen
> after that.
> 
> That also means then we don't have any generic FS testcase which can
> validate this behaviour. 

Yeah.

> >> So that i_size will be recorded only after generic_write_sync() is
> >> called.  So O_SYNC won't flush the update i_size to the disk.
> >
> > Indeed, that looks like a bug. Thanks for report!
> >
> >> On the other side, after a quick look of XFS side, it will record
> >> i_size changes in xfs_dio_write_end_io() so it seems that it doesn't
> >> have this problem.
> >
> > Yes, I'm a bit hazy on the details but I think we've decided to call
> > ext4_handle_inode_extension() directly from ext4_dio_write_iter() because
> > from ext4_dio_write_end_io() it was difficult to test in a race-free way
> > whether extending i_size (and i_disksize) is needed or not (we don't
> > necessarily hold i_rwsem there).
> 
> We do hold i_rwsem in exclusive write mode for file extend case.
> (ext4_dio_write_checks()).

Yes, the case I'm a bit concerned about is that for AIO overwrites we don't
hold i_rwsem at all in iomap_dio_complete() but we will still be performing
checks for file extension so we have to be careful they cannot have false
positives (or some other races) in the unlocked case.

> IIUC, ext4_handle_inode_extension() takes "written" and "count" as it's
> argument. This means that "count" bytes were mapped, but only "written"
> bytes were written. This information is used in
> ext4_handle_inode_extension() case for truncating blocks beyond EOF. 
> 
> I also found this discussion here [1].
> 
> [1]:
> https://lore.kernel.org/linux-ext4/20191008151238.GK5078@xxxxxxxxxxxxxx/

Yeah, thanks for finding this.

> From this thread it looks like we decided to move
> ext4_handle_inode_extension() case out of ->end_io callback after v4
> series (in v5) to handle above case. Right?

Yes, the lack of original IO length in the ->end_io callback was the final
problem that made us move the ext4_handle_inode_extension() call out of
->end_io callback.

> > I'll think how we could fix the problem you've reported.
> >
> 
> 1. I was thinking why do we need to truncate those blocks which are beyond
> EOF for DIO case? Wasn't there an argument, that for short DIO writes,
> we can use the remaining blocks allocated to be written by buffered-io,
> right? Do we risk exposing anything in doing so?
> We do fallback to buffered-io for short writes in ext4_dio_write_iter().

Yes, but as I mentioned in the thread you've referenced if we crash at
unfortunate moment, we will have inodes with blocks beyond EOF which is
not nice as we are "leaking" blocks.

> 2. Either ways let's say we still would like to call truncate. Then can we
> move ext4_truncate() logic out of ext4_handle_inode_extension() and call
> ext4_handle_inode_extension() from within ->end_io callback. 
> ext4_truncate() can be then done in the main routine directly i.e. in
> ext4_dio_write_iter() where we do have both "count" and "written" information.

The truncate itself is not a big deal, as you say that can happen later.
The real question for which we need both "count" and "written" is whether
we can remove the inode from the orphan list in ->end_io callback or not.
For the common case of successful write, we want to do the removal from the
orphan list in the same transaction as the i_disksize update for
performance reasons. So that's why we have to do the decision about
truncation at the place where we update i_disksize.

But I think it shouldn't be a big deal to actually propagate the original
IO size to iomap_dio_end() and ->end_io callback. Let's try that.

								Honza
-- 
Jan Kara <jack@xxxxxxxx>
SUSE Labs, CR