On Tue 05-09-23 15:58:01, Ojaswin Mujoo wrote: > ** Short Version ** > > In ext4 with dioread_nolock, we could have a scenario where the bh returned by > get_blocks (ext4_get_block_unwritten()) in __block_write_begin_int() has > UNWRITTEN and MAPPED flag set. Since such a bh does not have NEW flag set we > never zero out the range of bh that is not under write, causing whatever stale > data is present in the folio at that time to be written out to disk. To fix this > mark the buffer as new in _ext4_get_block(), in case it is unwritten. > > ----- > ** Long Version ** > > The issue mentioned above was resulting in two different bugs: > > 1. On block size < page size case in ext4, generic/269 was reliably > failing with dioread_nolock. The state of the write was as follows: > > * The write was extending i_size. > * The last block of the file was fallocated and had an unwritten extent > * We were near ENOSPC and hence we were switching to non-delayed alloc > allocation. > > In this case, the back trace that triggers the bug is as follows: > > ext4_da_write_begin() > /* switch to nodelalloc due to low space */ > ext4_write_begin() > ext4_should_dioread_nolock() // true since mount flags still have delalloc > __block_write_begin(..., ext4_get_block_unwritten) > __block_write_begin_int() > for(each buffer head in page) { > /* first iteration, this is bh1 which contains i_size */ > if (!buffer_mapped) > get_block() /* returns bh with only UNWRITTEN and MAPPED */ > /* second iteration, bh2 */ > if (!buffer_mapped) > get_block() /* we fail here, could be ENOSPC */ > } > if (err) > /* > * this would zero out all new buffers and mark them uptodate. > * Since bh1 was never marked new, we skip it here which causes > * the bug later. > */ > folio_zero_new_buffers(); > /* ext4_wrte_begin() error handling */ > ext4_truncate_failed_write() > ext4_truncate() > ext4_block_truncate_page() > __ext4_block_zero_page_range() > if(!buffer_uptodate()) > ext4_read_bh_lock() > ext4_read_bh() -> ... ext4_submit_bh_wbc() > BUG_ON(buffer_unwritten(bh)); /* !!! */ > > 2. The second issue is stale data exposure with page size >= blocksize > with dioread_nolock. The conditions needed for it to happen are same as > the previous issue ie dioread_nolock around ENOSPC condition. The issue > is also similar where in __block_write_begin_int() when we call > ext4_get_block_unwritten() on the buffer_head and the underlying extent > is unwritten, we get an unwritten and mapped buffer head. Since it is > not new, we never zero out the partial range which is not under write, > thus writing stale data to disk. This can be easily observed with the > following reporducer: > > fallocate -l 4k testfile > xfs_io -c "pwrite 2k 2k" testfile > # hexdump output will have stale data in from byte 0 to 2k in testfile > hexdump -C testfile > > NOTE: To trigger this, we need dioread_nolock enabled and write > happening via ext4_write_begin(), which is usually used when we have -o > nodealloc. Since dioread_nolock is disabled with nodelalloc, the only > alternate way to call ext4_write_begin() is to fill make sure dellayed > alloc switches to nodelalloc (ext4_da_write_begin() calls > ext4_write_begin()). This will usually happen when FS is almost full > like the way generic/269 was triggering it in Issue 1 above. This might > make this issue harder to replicate hence for reliable replicate, I used > the below patch to temporarily allow dioread_nolock with nodelalloc and > then mount the disk with -o nodealloc,dioread_nolock. With this you can > hit the stale data issue 100% of times: > > @@ -508,8 +508,8 @@ static inline int ext4_should_dioread_nolock(struct inode *inode) > if (ext4_should_journal_data(inode)) > return 0; > /* temporary fix to prevent generic/422 test failures */ > - if (!test_opt(inode->i_sb, DELALLOC)) > - return 0; > + // if (!test_opt(inode->i_sb, DELALLOC)) > + // return 0; > return 1; > } > > ------- > > After applying this patch to mark buffer as NEW, both the above issues are > fixed. > > Signed-off-by: Ojaswin Mujoo <ojaswin@xxxxxxxxxxxxx> Good catch! But I'm wondering whether this is really the right fix. For example in ext4_block_truncate_page() shouldn't we rather be checking whether the buffer isn't unwritten and if yes then bail because there's nothing to zero out in the block? That would seem like a more logical and robust solution of the first problem to me. Regarding the second issue I agree that using buffer_new flag makes the most sense. But it would make most sense to me to put this special logic directly into ext4_get_block_unwritten() because it is really special logic when preparing buffered write via unwritten extent (and it relies on __block_write_begin_int() logic to interpret buffer_new flag in the right way). Putting in _ext4_get_block() seems confusing to me because it raises questions like why should we set it for reads? And why not set it already in ext4_map_blocks() which is also used by iomap? Honza > --- > fs/ext4/inode.c | 4 ++++ > 1 file changed, 4 insertions(+) > > diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c > index 6c490f05e2ba..a30bfec0b525 100644 > --- a/fs/ext4/inode.c > +++ b/fs/ext4/inode.c > @@ -765,6 +765,10 @@ static int _ext4_get_block(struct inode *inode, sector_t iblock, > if (ret > 0) { > map_bh(bh, inode->i_sb, map.m_pblk); > ext4_update_bh_state(bh, map.m_flags); > + > + if (buffer_unwritten(bh)) > + set_buffer_new(bh); > + > bh->b_size = inode->i_sb->s_blocksize * map.m_len; > ret = 0; > } else if (ret == 0) { > -- > 2.31.1 > -- Jan Kara <jack@xxxxxxxx> SUSE Labs, CR