Re: [PATCH v4] xfs: rework zero range to prevent invalid i_size updates

Brian Foster <bfoster@xxxxxxxxxx> · Fri, 17 Oct 2014 16:20:20 -0400

On Mon, Oct 13, 2014 at 01:11:25PM -0400, Brian Foster wrote:
> The zero range operation is analogous to fallocate with the exception of
> converting the range to zeroes. E.g., it attempts to allocate zeroed
> blocks over the range specified by the caller. The XFS implementation
> kills all delalloc blocks currently over the aligned range, converts the
> range to allocated zero blocks (unwritten extents) and handles the
> partial pages at the ends of the range by sending writes through the
> pagecache.
> 
> The current implementation suffers from several problems associated with
> inode size. If the aligned range covers an extending I/O, said I/O is
> discarded and an inode size update from a previous write never makes it
> to disk. Further, if an unaligned zero range extends beyond eof, the
> page write induced for the partial end page can itself increase the
> inode size, even if the zero range request is not supposed to update
> i_size (via KEEP_SIZE, similar to an fallocate beyond EOF).
> 
> The latter behavior not only incorrectly increases the inode size, but
> can lead to stray delalloc blocks on the inode. Typically, post-eof
> preallocation blocks are either truncated on release or inode eviction
> or explicitly written to by xfs_zero_eof() on natural file size
> extension. If the inode size increases due to zero range, however,
> associated blocks leak into the address space having never been
> converted or mapped to pagecache pages. A direct I/O to such an
> uncovered range cannot convert the extent via writeback and will BUG().
> For example:
> 
> $ xfs_io -fc "pwrite 0 128k" -c "fzero -k 1m 54321" <file>
> ...
> $ xfs_io -d -c "pread 128k 128k" <file>
> <BUG>
> 
> If the entire delalloc extent happens to not have page coverage
> whatsoever (e.g., delalloc conversion couldn't find a large enough free
> space extent), even a full file writeback won't convert what's left of
> the extent and we'll assert on inode eviction.
> 
> Rework xfs_zero_file_space() to avoid buffered I/O for partial pages.
> Use the existing hole punch and prealloc mechanisms as primitives for
> zero range. We punch out the pagecache beforehand to eliminate
> unnecessary writeback. The hole punch mechanism handles partial block
> zeroing for us and facilitates the use of a single prealloc call over
> the entire range, which increases the odds of contiguous allocation.
> 
> Signed-off-by: Brian Foster <bfoster@xxxxxxxxxx>
> ---
> 
> This is v4, but effectively an alternate implementation to v3 that
> reduces zero range to a hole punch and preallocate operation. I
> originally reproduced a hang with v3 + my unwritten conversion helper
> patch that I thought somehow related to said helper, but I have seen
> what appears to be the same thing with this patch after a couple hundred
> cycles of generic/269. That leads me to believe that there is a more
> general problem perhaps related to the increased extent manipulation
> going on in either implementation (whether for conversion or removal). I
> don't reproduce such a failure with focused hole punch testing on tot,
> but I do reproduce other assert failures that I suspect just mask the
> ability to reproduce this one.
> 
> In short, all three of these configurations pass basic correctness tests
> for me. v3 has been the most reliable in stress tests. v3+unwritten
> conversion and v4 seem to reproduce this hang after a while. v4 is
> probably the most simple implementation of the three.
> 
> Brian
> 
> v4:
> - Simplify the implementation to use hole punch.
> v3: http://oss.sgi.com/archives/xfs/2014-10/msg00149.html
> - Pass length to xfs_alloc_file_space() rather than end offset.
> - Split up start/end page writeback branches.
> - Fix up a bunch of comments.
> v2: http://oss.sgi.com/archives/xfs/2014-10/msg00138.html
> - Refactor the logic to punch out pagecache/delalloc first and do
>   allocation last to prevent stray delalloc on ENOSPC.
> v1: http://oss.sgi.com/archives/xfs/2014-10/msg00052.html
> 
>  fs/xfs/xfs_bmap_util.c | 95 +++++++++++++++++++++++++++++---------------------
>  1 file changed, 55 insertions(+), 40 deletions(-)
> 
> diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c
> index 92e8f99..8d178fc 100644
> --- a/fs/xfs/xfs_bmap_util.c
> +++ b/fs/xfs/xfs_bmap_util.c
> @@ -1338,7 +1338,10 @@ xfs_free_file_space(
>  	goto out;
>  }
>  
> -
> +/*
> + * Preallocate and zero a range of a file. This mechanism has the allocation
> + * semantics of fallocate and in addition converts data in the range to zeroes.
> + */
>  int
>  xfs_zero_file_space(
>  	struct xfs_inode	*ip,
> @@ -1346,65 +1349,77 @@ xfs_zero_file_space(
>  	xfs_off_t		len)
>  {
>  	struct xfs_mount	*mp = ip->i_mount;
> -	uint			granularity;
> +	uint			blksize;
>  	xfs_off_t		start_boundary;
>  	xfs_off_t		end_boundary;
>  	int			error;
> +	loff_t			eof;
>  
>  	trace_xfs_zero_file_space(ip);
>  
> -	granularity = max_t(uint, 1 << mp->m_sb.sb_blocklog, PAGE_CACHE_SIZE);
> +	blksize = 1 << mp->m_sb.sb_blocklog;
>  
>  	/*
> -	 * Round the range of extents we are going to convert inwards.  If the
> -	 * offset is aligned, then it doesn't get changed so we zero from the
> -	 * start of the block offset points to.
> +	 * Align the range inward to page size. This represents the range of
> +	 * pages that can be tossed, even if dirty.
>  	 */
> -	start_boundary = round_up(offset, granularity);
> -	end_boundary = round_down(offset + len, granularity);
> +	start_boundary = round_up(offset, PAGE_CACHE_SIZE);
> +	end_boundary = round_down(offset + len, PAGE_CACHE_SIZE);
>  
>  	ASSERT(start_boundary >= offset);
>  	ASSERT(end_boundary <= offset + len);
>  
> -	if (start_boundary < end_boundary - 1) {
> +	/*
> +	 * If the range covers one or more full pages, punch out the pagecache
> +	 * and any delalloc blocks over the range. This is an optimization to
> +	 * prevent writeback and delalloc extent conversion over the zeroed
> +	 * range via the hole punch or prealloc calls.
> +	 *
> +	 * We only handle the page aligned range here because this function does
> +	 * not handle the partial block zeroing necessary to keep the cache and
> +	 * on-disk data consistent. That is the responsibility of the
> +	 * xfs_free_file_space() call below.
> +	 */
> +	if (end_boundary > start_boundary) {
>  		/*
> -		 * Writeback the range to ensure any inode size updates due to
> -		 * appending writes make it to disk (otherwise we could just
> -		 * punch out the delalloc blocks).
> +		 * Flush the eof page first if it falls within the range so we
> +		 * do not lose i_size updates.
>  		 */
> -		error = filemap_write_and_wait_range(VFS_I(ip)->i_mapping,
> -				start_boundary, end_boundary - 1);
> -		if (error)
> -			goto out;
> +		eof = round_down(i_size_read(VFS_I(ip)) - 1, PAGE_CACHE_SIZE);
> +		if (i_size_read(VFS_I(ip)) > ip->i_d.di_size &&
> +		    eof >= start_boundary && eof <= end_boundary)
> +			filemap_write_and_wait_range(VFS_I(ip)->i_mapping, eof,
> +						     -1);
>  		truncate_pagecache_range(VFS_I(ip), start_boundary,
>  					 end_boundary - 1);
>  
> -		/* convert the blocks */
> -		error = xfs_alloc_file_space(ip, start_boundary,
> -					end_boundary - start_boundary - 1,
> -					XFS_BMAPI_PREALLOC | XFS_BMAPI_CONVERT);
> -		if (error)
> -			goto out;
> -
> -		/* We've handled the interior of the range, now for the edges */
> -		if (start_boundary != offset) {
> -			error = xfs_iozero(ip, offset, start_boundary - offset);
> -			if (error)
> -				goto out;
> -		}
> -
> -		if (end_boundary != offset + len)
> -			error = xfs_iozero(ip, end_boundary,
> -					   offset + len - end_boundary);
> -
> -	} else {
> -		/*
> -		 * It's either a sub-granularity range or the range spanned lies
> -		 * partially across two adjacent blocks.
> -		 */
> -		error = xfs_iozero(ip, offset, len);
> +		xfs_ilock(ip, XFS_ILOCK_EXCL);
> +		error = xfs_bmap_punch_delalloc_range(ip,
> +				XFS_B_TO_FSBT(mp, start_boundary),
> +				XFS_B_TO_FSB(mp, end_boundary - start_boundary));
> +		xfs_iunlock(ip, XFS_ILOCK_EXCL);
>  	}

It appears that the tp->t_blk_res overrun assert is also related to
delalloc punching. I'm basically seeing a writepage attempted for a page
over a delalloc extent (imap in writepage reflects this), but at some
point before we get into the allocation (I suspect between where we
release and reacquire ilock in xfs_map_blocks() to
iomap_write_allocate()) the delalloc blocks that relate this particular
page to the extent disappear. An in-core extent dump after doing the
allocation but before inserting into the extent tree shows the range to
be allocated as a hole. Indeed, the allocator saw a hole (wasdel == 0)
and ended up doing the full allocation while the transaction had a
reservation for a delalloc conversion, which we end up overrunning.

It's not quite clear to me how this is happening. I would expect
truncate_pagecache_range() to serialize against writeback via page lock.
Regardless, the scenario of writing back pages into unallocated space
that writeback thinks is covered by delayed allocation suggests
something racing with delalloc punching. The problem seems to disappear
if I comment it out from zero range, so we might want to kill the above
hunk entirely for the time being.

Brian

>  
> +	/*
> +	 * Punch a hole and prealloc the range. We use hole punch rather than
> +	 * unwritten extent conversion for two reasons:
> +	 *
> +	 * 1.) Hole punch handles partial block zeroing for us. Note that we've
> +	 * already tossed pagecache over the aligned range so we won't write
> +	 * back too much data (only unaligned start/end pages to avoid races
> +	 * with uncached I/O).
> +	 *
> +	 * 2.) If prealloc returns ENOSPC, the file range is still zero-valued
> +	 * by virtue of the hole punch.
> +	 */
> +	error = xfs_free_file_space(ip, offset, len);
> +	if (error)
> +		goto out;
> +
> +	error = xfs_alloc_file_space(ip, round_down(offset, blksize),
> +				     round_up(offset + len, blksize) -
> +				     round_down(offset, blksize),
> +				     XFS_BMAPI_PREALLOC);
>  out:
>  	return error;
>  
> -- 
> 1.8.3.1
> 
> _______________________________________________
> xfs mailing list
> xfs@xxxxxxxxxxx
> http://oss.sgi.com/mailman/listinfo/xfs

_______________________________________________
xfs mailing list
xfs@xxxxxxxxxxx
http://oss.sgi.com/mailman/listinfo/xfs