Re: [PATCH 2/3] xfs: punch new delalloc blocks out of failed writes inside EOF.

Ben Myers <bpm@xxxxxxx> · Mon, 7 May 2012 17:00:59 -0500

On Fri, Apr 27, 2012 at 07:45:21PM +1000, Dave Chinner wrote:
> From: Dave Chinner <dchinner@xxxxxxxxxx>
> 
> When a partial write inside EOF fails, it can leave delayed
> allocation blocks lying around because they don't get punched back
> out. This leads to assert failures like:
> 
> XFS: Assertion failed: XFS_FORCED_SHUTDOWN(ip->i_mount) || ip->i_delayed_blks == 0, file: fs/xfs/xfs_super.c, line: 847
> 
> when evicting inodes from the cache. This can be trivially triggered
> by xfstests 083, which takes between 5 and 15 executions on a 512
> byte block size filesystem to trip over this. Debugging shows a
> failed write due to ENOSPC calling xfs_vm_write_failed such as:
> 
> [ 5012.329024] ino 0xa0026: vwf to 0x17000, sze 0x1c85ae
> 
> and no action is taken on it. This leaves behind a delayed
> allocation extent that has no page covering it and no data in it:
> 
> [ 5015.867162] ino 0xa0026: blks: 0x83 delay blocks 0x1, size 0x2538c0
> [ 5015.868293] ext 0: off 0x4a, fsb 0x50306, len 0x1
> [ 5015.869095] ext 1: off 0x4b, fsb 0x7899, len 0x6b
> [ 5015.869900] ext 2: off 0xb6, fsb 0xffffffffe0008, len 0x1
>                                     ^^^^^^^^^^^^^^^
> [ 5015.871027] ext 3: off 0x36e, fsb 0x7a27, len 0xd
> [ 5015.872206] ext 4: off 0x4cf, fsb 0x7a1d, len 0xa
> 
> So the delayed allocation extent is one block long at offset
> 0x16c00. Tracing shows that a bigger write:
> 
> xfs_file_buffered_write: size 0x1c85ae offset 0x959d count 0x1ca3f ioflags
> 
> allocates the block, and then fails with ENOSPC trying to allocate
> the last block on the page, leading to a failed write with stale
> delalloc blocks on it.
> 
> Because we've had an ENOSPC when trying to allocate 0x16e00, it
> means that we are never goinge to call ->write_end on the page and
			  going
> so the allocated new buffer will not get marked dirty or have the
> buffer_new state cleared. In other works, what the above write is
> supposed to end up with is this mapping for the page:
> 
>     +------+------+------+------+------+------+------+------+
>       UMA    UMA    UMA    UMA    UMA    UMA    UND    FAIL
> 
> where:  U = uptodate
>         M = mapped
>         N = new
>         A = allocated
>         D = delalloc
>         FAIL = block we ENOSPC'd on.
> 
> and the key point being the buffer_new() state for the newly
> allocated delayed allocation block. Except it doesn't - we're not
> marking buffers new correctly.
> 
> That buffer_new() problem goes back to the xfs_iomap removal days,
> where xfs_iomap() used to return a "new" status for any map with
> newly allocated blocks, so that __xfs_get_blocks() could call
> set_buffer_new() on it. We still have the "new" variable and the
> check for it in the set_buffer_new() logic - except we never set it
> now!
> 
> Hence that newly allocated delalloc block doesn't have the new flag
> set on it, so when the write fails we cannot tell which blocks we
> are supposed to punch out. WHy do we need the buffer_new flag? Well,
			     Why
> that's because we can have this case:
> 
>     +------+------+------+------+------+------+------+------+
>       UMD    UMD    UMD    UMD    UMD    UMD    UND    FAIL
> 
> where all the UMD buffers contain valid data from a previously
> successful write() system call. We only want to punch the UND buffer
> because that's the only one that we added in this write and it was
> only this write that failed.
> 
> That implies that even the old buffer_new() logic was wrong -
> because it would result in all those UMD buffers on the page having
> set_buffer_new() called on them even though they aren't new. Hence
> we shoul donly be calling set_buffer_new() for delalloc buffers that
     should only
> were allocated (i.e. were a hole before xfs_iomap_write_delay() was
> called).
> 
> So, fix this set_buffer_new logic according to how we need it to
> work for handling failed writes correctly. Also, restore the new
> buffer logic handling for blocks allocated via
> xfs_iomap_write_direct(), because it should still set the buffer_new
> flag appropriately for newly allocated blocks, too.
> 
> SO, now we have the buffer_new() being set appropriately in
> __xfs_get_blocks(), we can detect the exact delalloc ranges that
> we allocated in a failed write, and hence can now do a walk of the
> buffers on a page to find them.
> 
> Except, it's not that easy. When block_write_begin() fails, it
> unlocks and releases the page that we just had an error on, so we
> can't use that page to handle errors anymore. We have to get access
> to the page while it is still locked to walk the buffers. Hence we
> have to open code block_write_begin() in xfs_vm_write_begin() to be
> able to insert xfs_vm_write_failed() is the right place.
> 
> With that, we can pass the page and write range to
> xfs_vm_write_failed() and walk the buffers on the page, looking for
> delalloc buffers that are either new or beyond EOF and punch them
> out. Handling buffers beyond EOF ensures we still handle the
> existing case that xfs_vm_write_failed() handles.
> 
> Of special note is the truncate_pagecache() handling - that only
> should be done for pages outside EOF - pages within EOF can still
> contain valid, dirty data so we must not punch them out of the
> cache.
> 
> That just leaves the xfs_vm_write_end() failure handling.
> The only failure case here is that we didn't copy the entire range,
> and generic_write_end() handles that by zeroing the region of the
> page that wasn't copied,

Are you referring to 
xfs_vm_write_end
  generic_write_end
    block_write_end
      page_zero_new_buffers?

>	we don't have to punch out blocks within
> the file because they are guaranteed to contain zeros. Hence we only
> have to handle the existing "beyond EOF" case and don't need access
> to the buffers on the page. Hence it remains largely unchanged.
> 
> Note that xfs_getbmap() can still trip over delalloc blocks beyond
> EOF that are left there by speculative delayed allocation. Hence
> this bug fix does not solve all known issues with bmap vs delalloc,
> but it does fix all the the known accidental occurances of the
> problem.
> 
> Signed-off-by: Dave Chinner <david@xxxxxxxxxxxxx>
> ---
>  fs/xfs/xfs_aops.c |  173 +++++++++++++++++++++++++++++++++++++++--------------
>  1 file changed, 127 insertions(+), 46 deletions(-)
> 
> diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
> index 64ed87a..ae31c31 100644
> --- a/fs/xfs/xfs_aops.c
> +++ b/fs/xfs/xfs_aops.c
> @@ -1184,11 +1184,18 @@ __xfs_get_blocks(
>  						       &imap, nimaps);
>  			if (error)
>  				return -error;
> +			new = 1;
>  		} else {
>  			/*
>  			 * Delalloc reservations do not require a transaction,
> -			 * we can go on without dropping the lock here.
> +			 * we can go on without dropping the lock here. If we
> +			 * are allocating a new delalloc block, make sure that
> +			 * we set the new flag so that we mark the buffer new so
> +			 * that we know that it is newly allocated if the write
> +			 * fails.
>  			 */
> +			if (nimaps && imap.br_startblock == HOLESTARTBLOCK)
> +				new = 1;
>  			error = xfs_iomap_write_delay(ip, offset, size, &imap);
>  			if (error)
>  				goto out_unlock;
> @@ -1405,52 +1412,91 @@ out_destroy_ioend:
>  	return ret;
>  }
>  
> +/*
> + * Punch out the delalloc blocks we have already allocated.

This language is confusing.  I suggest that delay blocks are reserved and real
blocks are allocated.  Tomato Tomato.

> + *
> + * Don't bother with xfs_setattr given that nothing can have made it to disk yet
> + * as the page is still locked at this point.
> + */
> +STATIC void
> +xfs_vm_kill_delalloc_range(
> +	struct inode		*inode,
> +	loff_t			start,
> +	loff_t			end)
> +{
> +	struct xfs_inode	*ip = XFS_I(inode);
> +	xfs_fileoff_t		start_fsb;
> +	xfs_fileoff_t		end_fsb;
> +	int			error;
> +
> +	start_fsb = XFS_B_TO_FSB(ip->i_mount, start);
> +	end_fsb = XFS_B_TO_FSB(ip->i_mount, end);
> +	if (end_fsb <= start_fsb)
> +		return;
> +
> +	xfs_ilock(ip, XFS_ILOCK_EXCL);
> +	error = xfs_bmap_punch_delalloc_range(ip, start_fsb,
> +						end_fsb - start_fsb);
> +	if (error) {
> +		/* something screwed, just bail */
> +		if (!XFS_FORCED_SHUTDOWN(ip->i_mount)) {
> +			xfs_alert(ip->i_mount,
> +		"xfs_vm_write_failed: unable to clean up ino %lld",

Consider updating the function name in this error message and printing out the
value of error.

> +					ip->i_ino);
> +		}
> +	}
> +	xfs_iunlock(ip, XFS_ILOCK_EXCL);
> +}
> +
>  STATIC void
>  xfs_vm_write_failed(
> -	struct address_space	*mapping,
> -	loff_t			to)
> +	struct inode		*inode,
> +	struct page		*page,
> +	loff_t			pos,
> +	unsigned		len)
>  {
> -	struct inode		*inode = mapping->host;
> +	loff_t			block_offset = pos & PAGE_MASK;
> +	loff_t			block_start;
> +	loff_t			block_end;
> +	loff_t			from = pos & (PAGE_CACHE_SIZE - 1);
> +	loff_t			to = from + len;
> +	struct buffer_head	*bh, *head;
>  
> -	if (to > inode->i_size) {
> -		/*
> -		 * Punch out the delalloc blocks we have already allocated.
> -		 *
> -		 * Don't bother with xfs_setattr given that nothing can have
> -		 * made it to disk yet as the page is still locked at this
> -		 * point.
> -		 */
> -		struct xfs_inode	*ip = XFS_I(inode);
> -		xfs_fileoff_t		start_fsb;
> -		xfs_fileoff_t		end_fsb;
> -		int			error;
> +	ASSERT(block_offset + from == pos);
>  
> -		truncate_pagecache(inode, to, inode->i_size);
> +	head = page_buffers(page);
> +	block_start = 0;
> +	for (bh = head; bh != head || !block_start;
> +	     bh = bh->b_this_page, block_start = block_end,
> +				   block_offset += bh->b_size) {
> +		block_end = block_start + bh->b_size;
>  
> -		/*
> -		 * Check if there are any blocks that are outside of i_size
> -		 * that need to be trimmed back.
> -		 */
> -		start_fsb = XFS_B_TO_FSB(ip->i_mount, inode->i_size);
> -		end_fsb = XFS_B_TO_FSB(ip->i_mount, to);
> -		if (end_fsb <= start_fsb)
> -			return;
> -
> -		xfs_ilock(ip, XFS_ILOCK_EXCL);
> -		error = xfs_bmap_punch_delalloc_range(ip, start_fsb,
> -							end_fsb - start_fsb);
> -		if (error) {
> -			/* something screwed, just bail */
> -			if (!XFS_FORCED_SHUTDOWN(ip->i_mount)) {
> -				xfs_alert(ip->i_mount,
> -			"xfs_vm_write_failed: unable to clean up ino %lld",
> -						ip->i_ino);
> -			}
> -		}
> -		xfs_iunlock(ip, XFS_ILOCK_EXCL);
> +		/* skip buffers before the write */
> +		if (block_end <= from)
> +			continue;
> +
> +		/* if the buffer is after the write, we're done */
> +		if (block_start >= to)

*blink*  I was looking pretty hard at that for an off-by-one.  Mark
straightened me out.  Eesh.

> +			break;
> +
> +		if (!buffer_delay(bh))
> +			continue;
> +
> +		if (!buffer_new(bh) && block_offset < i_size_read(inode))
> +			continue;
> +
> +		xfs_vm_kill_delalloc_range(inode, block_offset,
> +					   block_offset + bh->b_size);
>  	}
> +
>  }
>  
> +/*
> + * This used to call block_write_begin(), but it unlocks and releases the page
> + * on error, and we need that page to be able to punch stale delalloc blocks out
> + * on failure. hence we copy-n-waste it here and call xfs_vm_write_failed() at
> + * the appropriate point.
> + */
>  STATIC int
>  xfs_vm_write_begin(
>  	struct file		*file,
> @@ -1461,15 +1507,40 @@ xfs_vm_write_begin(
>  	struct page		**pagep,
>  	void			**fsdata)
>  {
> -	int			ret;
> +	pgoff_t			index = pos >> PAGE_CACHE_SHIFT;
> +	struct page		*page;
> +	int			status;
>  
> -	ret = block_write_begin(mapping, pos, len, flags | AOP_FLAG_NOFS,
> -				pagep, xfs_get_blocks);
> -	if (unlikely(ret))
> -		xfs_vm_write_failed(mapping, pos + len);
> -	return ret;
> +	ASSERT(len <= PAGE_CACHE_SIZE);
> +
> +	page = grab_cache_page_write_begin(mapping, index,
> +					   flags | AOP_FLAG_NOFS);
> +	if (!page)
> +		return -ENOMEM;
> +
> +	status = __block_write_begin(page, pos, len, xfs_get_blocks);
> +	if (unlikely(status)) {
> +		struct inode	*inode = mapping->host;
> +
> +		xfs_vm_write_failed(inode, page, pos, len);
> +		unlock_page(page);

Consistent with block_write_begin.

> +
> +		if (pos + len > i_size_read(inode))
> +			truncate_pagecache(inode, pos + len, i_size_read(inode));
> +
> +		page_cache_release(page);
> +		page = NULL;
> +	}
> +
> +	*pagep = page;
> +	return status;
>  }
>  
> +/*
> + * On failure, we only need to kill delalloc blocks beyond EOF because they
> + * will never be written. For blocks within EOF, generic_write_end() zeros them
> + * so they are safe to leave alone and be written with all the other valid data.
> + */
>  STATIC int
>  xfs_vm_write_end(
>  	struct file		*file,
> @@ -1482,9 +1553,19 @@ xfs_vm_write_end(
>  {
>  	int			ret;
>  
> +	ASSERT(len <= PAGE_CACHE_SIZE);
> +
>  	ret = generic_write_end(file, mapping, pos, len, copied, page, fsdata);
> -	if (unlikely(ret < len))
> -		xfs_vm_write_failed(mapping, pos + len);
> +	if (unlikely(ret < len)) {
> +		struct inode	*inode = mapping->host;
> +		size_t		isize = i_size_read(inode);
> +		loff_t		to = pos + len;
> +
> +		if (to > isize) {
> +			truncate_pagecache(inode, to, isize);
> +			xfs_vm_kill_delalloc_range(inode, isize, to);
> +		}
> +	}
>  	return ret;
>  }

Aside from a few nits this is looking good.

Reviewed-by: Ben Myers <bpm@xxxxxxx>

_______________________________________________
xfs mailing list
xfs@xxxxxxxxxxx
http://oss.sgi.com/mailman/listinfo/xfs