Re: [PATCH 01/27] ext4: remove writable userspace mappings before truncating page cache

Jan Kara <jack@xxxxxxx> · Wed, 4 Dec 2024 12:13:10 +0100

I'm sorry for the huge delay here...

On Tue 22-10-24 19:10:32, Zhang Yi wrote:
> From: Zhang Yi <yi.zhang@xxxxxxxxxx>
> 
> When zeroing a range of folios on the filesystem which block size is
> less than the page size, the file's mapped partial blocks within one
> page will be marked as unwritten, we should remove writable userspace
> mappings to ensure that ext4_page_mkwrite() can be called during
> subsequent write access to these folios. Otherwise, data written by
> subsequent mmap writes may not be saved to disk.
> 
>  $mkfs.ext4 -b 1024 /dev/vdb
>  $mount /dev/vdb /mnt
>  $xfs_io -t -f -c "pwrite -S 0x58 0 4096" -c "mmap -rw 0 4096" \
>                -c "mwrite -S 0x5a 2048 2048" -c "fzero 2048 2048" \
>                -c "mwrite -S 0x59 2048 2048" -c "close" /mnt/foo
> 
>  $od -Ax -t x1z /mnt/foo
>  000000 58 58 58 58 58 58 58 58 58 58 58 58 58 58 58 58
>  *
>  000800 59 59 59 59 59 59 59 59 59 59 59 59 59 59 59 59
>  *
>  001000
> 
>  $umount /mnt && mount /dev/vdb /mnt
>  $od -Ax -t x1z /mnt/foo
>  000000 58 58 58 58 58 58 58 58 58 58 58 58 58 58 58 58
>  *
>  000800 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
>  *
>  001000
> 
> Signed-off-by: Zhang Yi <yi.zhang@xxxxxxxxxx>

This is a great catch! I think this may be source of the sporadic data
corruption issues we observe with blocksize < pagesize.

> +static inline void ext4_truncate_folio(struct inode *inode,
> +				       loff_t start, loff_t end)
> +{
> +	unsigned long blocksize = i_blocksize(inode);
> +	struct folio *folio;
> +
> +	if (round_up(start, blocksize) >= round_down(end, blocksize))
> +		return;
> +
> +	folio = filemap_lock_folio(inode->i_mapping, start >> PAGE_SHIFT);
> +	if (IS_ERR(folio))
> +		return;
> +
> +	if (folio_mkclean(folio))
> +		folio_mark_dirty(folio);
> +	folio_unlock(folio);
> +	folio_put(folio);

I don't think this is enough. In your example from the changelog, this would
leave the page at index 0 dirty and still with 0x5a values in 2048-4096 range.
Then truncate_pagecache_range() does nothing, ext4_alloc_file_blocks()
converts blocks under 2048-4096 to unwritten state. But what handles
zeroing of page cache in 2048-4096 range? ext4_zero_partial_blocks() zeroes
only partial blocks, not full blocks. Am I missing something?

If I'm right, I'd keep it simple and just writeout these partial folios with
filemap_write_and_wait_range() and expand the range
truncate_pagecache_range() removes to include these partial folios. The
overhead won't be big and it isn't like this is some very performance
sensitive path.

> +}
> +
> +/*
> + * When truncating a range of folios, if the block size is less than the
> + * page size, the file's mapped partial blocks within one page could be
> + * freed or converted to unwritten. We should call this function to remove
> + * writable userspace mappings so that ext4_page_mkwrite() can be called
> + * during subsequent write access to these folios.
> + */
> +void ext4_truncate_folios_range(struct inode *inode, loff_t start, loff_t end)

Maybe call this ext4_truncate_page_cache_block_range()? And assert that
start & end are block aligned. Because this essentially prepares page cache
for manipulation with a block range.

> +{
> +	unsigned long blocksize = i_blocksize(inode);
> +
> +	if (end > inode->i_size)
> +		end = inode->i_size;
> +	if (start >= end || blocksize >= PAGE_SIZE)
> +		return;
> +
> +	ext4_truncate_folio(inode, start, min(round_up(start, PAGE_SIZE), end));
> +	if (end > round_up(start, PAGE_SIZE))
> +		ext4_truncate_folio(inode, round_down(end, PAGE_SIZE), end);
> +}

So I'd move the following truncate_pagecache_range() into
ext4_truncate_folios_range(). And also the preceding:

                /*
                 * For journalled data we need to write (and checkpoint) pages
                 * before discarding page cache to avoid inconsitent data on
                 * disk in case of crash before zeroing trans is committed.
                 */
                if (ext4_should_journal_data(inode)) {
                        ret = filemap_write_and_wait_range(mapping, start,
                                                           end - 1);
		...

into this function. So that it can be self-contained "do the right thing
with page cache to prepare for block range manipulations".

								Honza
-- 
Jan Kara <jack@xxxxxxxx>
SUSE Labs, CR