On Tue 23-01-18 16:37:23, Eryu Guan wrote: > Currently in ext4 direct write path, we update i_disksize only when > new eof is greater than i_size, and don't update it even when new > eof is greater than i_disksize but less than i_size. This doesn't > work well with delalloc buffer write, which updates i_size and > i_disksize only when delalloc blocks are resolved (at writeback > time), the i_disksize from direct write can be lost if a previous > buffer write succeeded at write time but failed at writeback time, > then results in corrupted ondisk inode size. > > Consider this case, first buffer write 4k data to a new file at > offset 16k with delayed allocation, then direct write 4k data to the > same file at offset 4k before delalloc blocks are resolved, which > doesn't update i_disksize because it writes within i_size(20k), but > the extent tree metadata has been committed in journal. Then > writeback of the delalloc blocks fails (due to device error etc.), > and i_size/i_disksize from buffer write can't be written to disk > (still zero). A subsequent umount/mount cycle recovers journal and > writes extent tree metadata from direct write to disk, but with > i_disksize being zero. > > Fix it by updating i_disksize too in direct write path when new eof > is greater than i_disksize but less than i_size, so i_disksize is > always consistent with direct write. > > This fixes occasional i_size corruption in fstests generic/475. > > Signed-off-by: Eryu Guan <eguan@xxxxxxxxxx> > --- > I think this matches what XFS does in direct write too. > > I've tested it by looping generic/475 200 times without hitting a > corruption, usually it fails within 5 iterations for me. Also tested by > full fstests runs on ext2_4k, ext3_2k, ext4_1k configurations and all > results looked good. Thanks for the patch! It looks good to me. Just when looking at these i_disksize updates and thinking about mixing them with page writeback there seems to be another bug that these i_disksize updates are not protected by ei->i_data_sem (which is what protects i_disksize update in the writeback path). So probably that should be fixed up as well as otherwise I'm not sure we cannot corrupt i_disksize in some funny way when writeback and dio write race... Honza > > fs/ext4/inode.c | 7 ++++--- > 1 file changed, 4 insertions(+), 3 deletions(-) > > diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c > index 534a9130f625..2a75b0aafd31 100644 > --- a/fs/ext4/inode.c > +++ b/fs/ext4/inode.c > @@ -3668,7 +3668,7 @@ static ssize_t ext4_direct_IO_write(struct kiocb *iocb, struct iov_iter *iter) > int orphan = 0; > handle_t *handle; > > - if (final_size > inode->i_size) { > + if (final_size > inode->i_size || final_size > ei->i_disksize) { > /* Credits for sb + inode write */ > handle = ext4_journal_start(inode, EXT4_HT_INODE, 2); > if (IS_ERR(handle)) { > @@ -3780,9 +3780,10 @@ static ssize_t ext4_direct_IO_write(struct kiocb *iocb, struct iov_iter *iter) > ext4_orphan_del(handle, inode); > if (ret > 0) { > loff_t end = offset + ret; > - if (end > inode->i_size) { > + if (end > inode->i_size || end > ei->i_disksize) { > ei->i_disksize = end; > - i_size_write(inode, end); > + if (end > inode->i_size) > + i_size_write(inode, end); > /* > * We're going to return a positive `ret' > * here due to non-zero-length I/O, so there's > -- > 2.14.3 > -- Jan Kara <jack@xxxxxxxx> SUSE Labs, CR