Re: [PATCH] ext4: update i_disksize if direct write past ondisk size

Jan Kara <jack@xxxxxxx> · Thu, 8 Mar 2018 10:00:49 +0100

On Thu 08-03-18 08:58:17, Eryu Guan wrote:
> On Wed, Mar 07, 2018 at 11:11:10AM +0100, Jan Kara wrote:
> > On Tue 23-01-18 16:37:23, Eryu Guan wrote:
> > > Currently in ext4 direct write path, we update i_disksize only when
> > > new eof is greater than i_size, and don't update it even when new
> > > eof is greater than i_disksize but less than i_size. This doesn't
> > > work well with delalloc buffer write, which updates i_size and
> > > i_disksize only when delalloc blocks are resolved (at writeback
> > > time), the i_disksize from direct write can be lost if a previous
> > > buffer write succeeded at write time but failed at writeback time,
> > > then results in corrupted ondisk inode size.
> > > 
> > > Consider this case, first buffer write 4k data to a new file at
> > > offset 16k with delayed allocation, then direct write 4k data to the
> > > same file at offset 4k before delalloc blocks are resolved, which
> > > doesn't update i_disksize because it writes within i_size(20k), but
> > > the extent tree metadata has been committed in journal. Then
> > > writeback of the delalloc blocks fails (due to device error etc.),
> > > and i_size/i_disksize from buffer write can't be written to disk
> > > (still zero). A subsequent umount/mount cycle recovers journal and
> > > writes extent tree metadata from direct write to disk, but with
> > > i_disksize being zero.
> > > 
> > > Fix it by updating i_disksize too in direct write path when new eof
> > > is greater than i_disksize but less than i_size, so i_disksize is
> > > always consistent with direct write.
> > > 
> > > This fixes occasional i_size corruption in fstests generic/475.
> > > 
> > > Signed-off-by: Eryu Guan <eguan@xxxxxxxxxx>
> > > ---
> > > I think this matches what XFS does in direct write too.
> > > 
> > > I've tested it by looping generic/475 200 times without hitting a
> > > corruption, usually it fails within 5 iterations for me. Also tested by
> > > full fstests runs on ext2_4k, ext3_2k, ext4_1k configurations and all
> > > results looked good.
> > 
> > Thanks for the patch! It looks good to me. Just when looking at these
> 
> Thanks for the review!
> 
> > i_disksize updates and thinking about mixing them with page writeback there
> > seems to be another bug that these i_disksize updates are not protected by
> > ei->i_data_sem (which is what protects i_disksize update in the writeback
> > path). So probably that should be fixed up as well as otherwise I'm not
> > sure we cannot corrupt i_disksize in some funny way when writeback and dio
> > write race...
> 
> It's been a while and I'll get myself refamiliar with this code path and
> look into it. Do you think we can apply this patch as is for now and fix
> the race you mentioned in a follow-up patch? Or I should fix both of
> them in v2?

IMO it would be better to fix the locking in patch 1/2 (the best would be
to make that code use ext4_update_i_disksize() which uses proper locking)
and then apply your patch as 2/2. Because your patch makes i_disksize
updates more frequent especially in the problematic situations where both
DIO and writeback code want to update it and so chances these two updates
race in a wrong way are made higher by your fix. Thanks!

								Honza
-- 
Jan Kara <jack@xxxxxxxx>
SUSE Labs, CR