On 2024/12/4 19:13, Jan Kara wrote: > I'm sorry for the huge delay here... > It's fine, I know you're probably been busy lately, and this series has undergone significant modifications, which should require considerable time for review. Thanks a lot for taking time to review this series! > On Tue 22-10-24 19:10:32, Zhang Yi wrote: >> From: Zhang Yi <yi.zhang@xxxxxxxxxx> >> >> When zeroing a range of folios on the filesystem which block size is >> less than the page size, the file's mapped partial blocks within one >> page will be marked as unwritten, we should remove writable userspace >> mappings to ensure that ext4_page_mkwrite() can be called during >> subsequent write access to these folios. Otherwise, data written by >> subsequent mmap writes may not be saved to disk. >> >> $mkfs.ext4 -b 1024 /dev/vdb >> $mount /dev/vdb /mnt >> $xfs_io -t -f -c "pwrite -S 0x58 0 4096" -c "mmap -rw 0 4096" \ >> -c "mwrite -S 0x5a 2048 2048" -c "fzero 2048 2048" \ >> -c "mwrite -S 0x59 2048 2048" -c "close" /mnt/foo >> >> $od -Ax -t x1z /mnt/foo >> 000000 58 58 58 58 58 58 58 58 58 58 58 58 58 58 58 58 >> * >> 000800 59 59 59 59 59 59 59 59 59 59 59 59 59 59 59 59 >> * >> 001000 >> >> $umount /mnt && mount /dev/vdb /mnt >> $od -Ax -t x1z /mnt/foo >> 000000 58 58 58 58 58 58 58 58 58 58 58 58 58 58 58 58 >> * >> 000800 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 >> * >> 001000 >> >> Signed-off-by: Zhang Yi <yi.zhang@xxxxxxxxxx> > > This is a great catch! I think this may be source of the sporadic data > corruption issues we observe with blocksize < pagesize. > >> +static inline void ext4_truncate_folio(struct inode *inode, >> + loff_t start, loff_t end) >> +{ >> + unsigned long blocksize = i_blocksize(inode); >> + struct folio *folio; >> + >> + if (round_up(start, blocksize) >= round_down(end, blocksize)) >> + return; >> + >> + folio = filemap_lock_folio(inode->i_mapping, start >> PAGE_SHIFT); >> + if (IS_ERR(folio)) >> + return; >> + >> + if (folio_mkclean(folio)) >> + folio_mark_dirty(folio); >> + folio_unlock(folio); >> + folio_put(folio); > > I don't think this is enough. In your example from the changelog, this would > leave the page at index 0 dirty and still with 0x5a values in 2048-4096 range. > Then truncate_pagecache_range() does nothing, ext4_alloc_file_blocks() > converts blocks under 2048-4096 to unwritten state. But what handles > zeroing of page cache in 2048-4096 range? ext4_zero_partial_blocks() zeroes > only partial blocks, not full blocks. Am I missing something? > Sorry, I don't understand why truncate_pagecache_range() does nothing? In my example, the variable 'start' is 2048, the variable 'end' is 4096, and the call process truncate_pagecache_range(inode, 2048, 4096-1)->..-> truncate_inode_partial_folio()->folio_zero_range() does zeroing the 2048-4096 range. I also tested it below, it was zeroed. xfs_io -t -f -c "pwrite -S 0x58 0 4096" -c "mmap -rw 0 4096" \ -c "mwrite -S 0x5a 2048 2048" \ -c "fzero 2048 2048" -c "close" /mnt/foo od -Ax -t x1z /mnt/foo 000000 58 58 58 58 58 58 58 58 58 58 58 58 58 58 58 58 >XXXXXXXXXXXXXXXX< * 000800 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 >................< * 001000 > If I'm right, I'd keep it simple and just writeout these partial folios with > filemap_write_and_wait_range() and expand the range > truncate_pagecache_range() removes to include these partial folios. The What I mean is the truncate_pagecache_range() has already covered the partial folios. right? > overhead won't be big and it isn't like this is some very performance > sensitive path. > >> +} >> + >> +/* >> + * When truncating a range of folios, if the block size is less than the >> + * page size, the file's mapped partial blocks within one page could be >> + * freed or converted to unwritten. We should call this function to remove >> + * writable userspace mappings so that ext4_page_mkwrite() can be called >> + * during subsequent write access to these folios. >> + */ >> +void ext4_truncate_folios_range(struct inode *inode, loff_t start, loff_t end) > > Maybe call this ext4_truncate_page_cache_block_range()? And assert that > start & end are block aligned. Because this essentially prepares page cache > for manipulation with a block range. Ha, it's a good idea, I agree with you that move truncate_pagecache_range() and the hunk of flushing in journal data mode into this function. But I don't understand why assert that 'start & end' are block aligned? I think ext4_truncate_page_cache_block_range() should allow passing unaligned input parameters and aligned them itself, especially, after patch 04 and 05, ext4_zero_range() and ext4_punch_hole() will passing offset and offset+len directly, which may block unaligned. Thanks, Yi. > >> +{ >> + unsigned long blocksize = i_blocksize(inode); >> + >> + if (end > inode->i_size) >> + end = inode->i_size; >> + if (start >= end || blocksize >= PAGE_SIZE) >> + return; >> + >> + ext4_truncate_folio(inode, start, min(round_up(start, PAGE_SIZE), end)); >> + if (end > round_up(start, PAGE_SIZE)) >> + ext4_truncate_folio(inode, round_down(end, PAGE_SIZE), end); >> +} > > So I'd move the following truncate_pagecache_range() into > ext4_truncate_folios_range(). And also the preceding: > > /* > * For journalled data we need to write (and checkpoint) pages > * before discarding page cache to avoid inconsitent data on > * disk in case of crash before zeroing trans is committed. > */ > if (ext4_should_journal_data(inode)) { > ret = filemap_write_and_wait_range(mapping, start, > end - 1); > ... > > into this function. So that it can be self-contained "do the right thing > with page cache to prepare for block range manipulations". > > Honza