I'm sorry for the huge delay here... On Tue 22-10-24 19:10:32, Zhang Yi wrote: > From: Zhang Yi <yi.zhang@xxxxxxxxxx> > > When zeroing a range of folios on the filesystem which block size is > less than the page size, the file's mapped partial blocks within one > page will be marked as unwritten, we should remove writable userspace > mappings to ensure that ext4_page_mkwrite() can be called during > subsequent write access to these folios. Otherwise, data written by > subsequent mmap writes may not be saved to disk. > > $mkfs.ext4 -b 1024 /dev/vdb > $mount /dev/vdb /mnt > $xfs_io -t -f -c "pwrite -S 0x58 0 4096" -c "mmap -rw 0 4096" \ > -c "mwrite -S 0x5a 2048 2048" -c "fzero 2048 2048" \ > -c "mwrite -S 0x59 2048 2048" -c "close" /mnt/foo > > $od -Ax -t x1z /mnt/foo > 000000 58 58 58 58 58 58 58 58 58 58 58 58 58 58 58 58 > * > 000800 59 59 59 59 59 59 59 59 59 59 59 59 59 59 59 59 > * > 001000 > > $umount /mnt && mount /dev/vdb /mnt > $od -Ax -t x1z /mnt/foo > 000000 58 58 58 58 58 58 58 58 58 58 58 58 58 58 58 58 > * > 000800 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 > * > 001000 > > Signed-off-by: Zhang Yi <yi.zhang@xxxxxxxxxx> This is a great catch! I think this may be source of the sporadic data corruption issues we observe with blocksize < pagesize. > +static inline void ext4_truncate_folio(struct inode *inode, > + loff_t start, loff_t end) > +{ > + unsigned long blocksize = i_blocksize(inode); > + struct folio *folio; > + > + if (round_up(start, blocksize) >= round_down(end, blocksize)) > + return; > + > + folio = filemap_lock_folio(inode->i_mapping, start >> PAGE_SHIFT); > + if (IS_ERR(folio)) > + return; > + > + if (folio_mkclean(folio)) > + folio_mark_dirty(folio); > + folio_unlock(folio); > + folio_put(folio); I don't think this is enough. In your example from the changelog, this would leave the page at index 0 dirty and still with 0x5a values in 2048-4096 range. Then truncate_pagecache_range() does nothing, ext4_alloc_file_blocks() converts blocks under 2048-4096 to unwritten state. But what handles zeroing of page cache in 2048-4096 range? ext4_zero_partial_blocks() zeroes only partial blocks, not full blocks. Am I missing something? If I'm right, I'd keep it simple and just writeout these partial folios with filemap_write_and_wait_range() and expand the range truncate_pagecache_range() removes to include these partial folios. The overhead won't be big and it isn't like this is some very performance sensitive path. > +} > + > +/* > + * When truncating a range of folios, if the block size is less than the > + * page size, the file's mapped partial blocks within one page could be > + * freed or converted to unwritten. We should call this function to remove > + * writable userspace mappings so that ext4_page_mkwrite() can be called > + * during subsequent write access to these folios. > + */ > +void ext4_truncate_folios_range(struct inode *inode, loff_t start, loff_t end) Maybe call this ext4_truncate_page_cache_block_range()? And assert that start & end are block aligned. Because this essentially prepares page cache for manipulation with a block range. > +{ > + unsigned long blocksize = i_blocksize(inode); > + > + if (end > inode->i_size) > + end = inode->i_size; > + if (start >= end || blocksize >= PAGE_SIZE) > + return; > + > + ext4_truncate_folio(inode, start, min(round_up(start, PAGE_SIZE), end)); > + if (end > round_up(start, PAGE_SIZE)) > + ext4_truncate_folio(inode, round_down(end, PAGE_SIZE), end); > +} So I'd move the following truncate_pagecache_range() into ext4_truncate_folios_range(). And also the preceding: /* * For journalled data we need to write (and checkpoint) pages * before discarding page cache to avoid inconsitent data on * disk in case of crash before zeroing trans is committed. */ if (ext4_should_journal_data(inode)) { ret = filemap_write_and_wait_range(mapping, start, end - 1); ... into this function. So that it can be self-contained "do the right thing with page cache to prepare for block range manipulations". Honza -- Jan Kara <jack@xxxxxxxx> SUSE Labs, CR