Re: [PATCH 14/16] xfs: align writepages to large block sizes

Brian Foster <bfoster@xxxxxxxxxx> · Wed, 14 Nov 2018 09:19:26 -0500

On Wed, Nov 07, 2018 at 05:31:25PM +1100, Dave Chinner wrote:
> From: Dave Chinner <dchinner@xxxxxxxxxx>
> 
> For data integrity purposes, we need to write back the entire
> filesystem block when asked to sync a sub-block range of the file.
> When the filesystem block size is larger than the page size, this
> means we need to convert single page integrity writes into whole
> block integrity writes. We do this by extending the writepage range
> to filesystem block granularity and alignment.
> 
> Signed-off-by: Dave Chinner <dchinner@xxxxxxxxxx>
> ---
>  fs/xfs/xfs_aops.c | 14 ++++++++++++++
>  1 file changed, 14 insertions(+)
> 
> diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
> index f6ef9e0a7312..5334f16be166 100644
> --- a/fs/xfs/xfs_aops.c
> +++ b/fs/xfs/xfs_aops.c
> @@ -900,6 +900,7 @@ xfs_vm_writepages(
>  		.io_type = XFS_IO_HOLE,
>  	};
>  	int			ret;
> +	unsigned		bsize =	i_blocksize(mapping->host);
>  
>  	/*
>  	 * Refuse to write pages out if we are called from reclaim context.
> @@ -922,6 +923,19 @@ xfs_vm_writepages(
>  	if (WARN_ON_ONCE(current->flags & PF_MEMALLOC_NOFS))
>  		return 0;
>  
> +	/*
> +	 * If the block size is larger than page size, extent the incoming write
> +	 * request to fsb granularity and alignment. This is a requirement for
> +	 * data integrity operations and it doesn't hurt for other write
> +	 * operations, so do it unconditionally.
> +	 */
> +	if (wbc->range_start)
> +		wbc->range_start = round_down(wbc->range_start, bsize);
> +	if (wbc->range_end != LLONG_MAX)
> +		wbc->range_end = round_up(wbc->range_end, bsize);
> +	if (wbc->nr_to_write < wbc->range_end - wbc->range_start)
> +		wbc->nr_to_write = round_up(wbc->nr_to_write, bsize);
> +

This latter bit causes endless writeback loops in tests such as
generic/475 (I think I reproduced it with xfs/141 as well). The
writeback infrastructure samples ->nr_to_write before and after
->writepages() calls to identify progress. Unconditionally bumping it to
something larger than the original value can lead to an underflow in the
writeback code that seems to throw things off. E.g., see the following
wb tracepoints (w/ 4k block and page size):

   kworker/u8:13-189   [003] ...1   317.968147: writeback_single_inode_start: bdi 253:9: ino=8389005 state=I_DIRTY_PAGES|I_SYNC dirtied_when=4294773087 age=211 index=0 to_write=1024 wrote=0 cgroup_ino=4294967295
   kworker/u8:13-189   [003] ...1   317.968150: writeback_single_inode: bdi 253:9: ino=8389005 state=I_DIRTY_PAGES|I_SYNC dirtied_when=4294773087 age=211 index=0 to_write=1024 wrote=18446744073709548544 cgroup_ino=4294967295

The wrote value goes from 0 to garbage and writeback_sb_inodes() uses
the same basic calculation for 'wrote.'

BTW, I haven't gone through the broader set, but just looking at this
bit what's the purpose of rounding ->nr_to_write (which is a page count)
to a block size in the first place?

Brian

>  	xfs_iflags_clear(XFS_I(mapping->host), XFS_ITRUNCATED);
>  	ret = write_cache_pages(mapping, wbc, xfs_do_writepage, &wpc);
>  	if (wpc.ioend)
> -- 
> 2.19.1
>