On 2018/12/10 13:10, Theodore Y. Ts'o Wrote: > On Sat, Dec 08, 2018 at 10:25:58PM +0800, zhangyi (F) wrote: >> Current ext4 will call ext4_ext_convert_to_initialized() to split and >> initialize an unwritten extent if someone write something to it. It may >> also zeroout the nearby blocks and expand the split extent if the >> allocated extent is fully inside i_size or new_size. But it may lead to >> inode inconsistency when system crash or the power fails. >> >> Consider the following case: >> - Create an empty file and buffer write from block A to D (with delay >> allocate). It will update the i_size to D. >> - Zero range from part of block B to D. It will allocate an unwritten >> extent from B to D. >> - The write back worker write block B and initialize the unwritten >> extent from B to D, and then update the i_disksize to B. >> - System crash. >> - Remount and fsck complain about the extent size exceeds the inode >> size. >> >> This patch add checking i_disksize and chose the small one between >> i_size to make sure it's safe to convert extent to initialized. >> >> --------------------- >> >> This problem can reproduce by xfstests generic/482 with fsstress seed >> 1544025012. > > Hmm, your explanation is great and the patch makes sense. I haven't > been able to reproduce the problem by adding -s 1544025012 to the > fsstress arguments. This may be because fsstress being run with two > processes (-p 2) and the failure may be timing dependent? > Yes, it is timing dependent and not quite easy to make a simple fast reproducer. The default parameter of fsstress (tested by generic/482) on my machine is: ./ltp/fsstress -w -d /mnt/scratch -n 512 -p 8 -s 1544025012 -v > How easily can you replicate the problem? It is about 2~5% probability to replicate this problem under generic/482 on my machine, and it can also appear in generic/019 and generic/455. After reproducing and checking this problem again, I think the above explanation is not accurate. I simplify the running process in generic/482 and the real case could be: - Create an empty file and buffer write from block A to D (with delay allocate). - Buffer write from X to Z, now the i_size of this inode is updated to Z. - Zero range from part of block B to D, it will allocate an unwritten extent from B to D. Note that it also will skip disksize update in ext4_zero_range() -> ext4_update_disksize_before_punch() because the i_size is large than the end of this zero range. - The write back kworker write block B and initialize the whole unwritten extent from B to D, and then update the i_disksize to the end of B. - ext4_journal_stop() - kjournald2 process weakup and call jbd2_journal_commit_transaction() to commit journal and send FUA. - System crash. - System reboot and fsck complain about the extent size exceeds the inode size. Thanks, Yi.