Re: [PATCH v2] xfs: always log the inode on unwritten extent conversion

Dave Chinner <david@xxxxxxxxxxxxx> · Fri, 24 Apr 2015 08:32:58 +1000

On Thu, Apr 23, 2015 at 12:42:44PM -0400, Brian Foster wrote:
> The fsync() requirements for crash consistency on XFS are to flush file
> data and force any in-core inode updates to the log. We currently check
> whether the inode is pinned to identify whether the log needs to be
> forced, since a non-zero pin count generally represents an inode that
> has transactions awaiting a flush to the on-disk log.
> 
> This is not sufficient in all cases, however. Reports of xfstests test
> generic/311 failures on ppc64/s390x hosts have identified failures to
> fsync outstanding inode modifications due to the inode not being pinned
> at the time of the fsync. This occurs because certain bmap updates can
> complete by logging bmapbt buffers but without ever dirtying (and thus
> pinning) the core inode. The following is a specific incarnation of this
> problem:
> 
> $ mount $dev /mnt -o noatime,nobarrier
> $ for i in $(seq 0 2 31); do \
>         xfs_io -f -c "falloc $((i * 32768)) 32k" -c fsync /mnt/file; \
> 	done
> $ xfs_io -c "pwrite -S 0 80k 16k" -c fsync -c "pwrite 76k 4k" -c fsync /mnt/file; \
> 	hexdump /mnt/file; \
> 	./xfstests-dev/src/godown /mnt
> ...
> 0000000 0000 0000 0000 0000 0000 0000 0000 0000
> *
> 0013000 cdcd cdcd cdcd cdcd cdcd cdcd cdcd cdcd
> *
> 0014000 0000 0000 0000 0000 0000 0000 0000 0000
> *
> 00f8000
> $ umount /mnt; mount ...
> $ hexdump /mnt/file
> 0000000 0000 0000 0000 0000 0000 0000 0000 0000
> *
> 00f8000
> 
> In short, the unwritten extent conversion for the last write is lost
> despite the fact that an fsync executed before the filesystem was
> shutdown. Note that this is impossible to reproduce on v5 supers due to
> unconditional time callbacks for di_changecount and highly difficult to
> reproduce on CONFIG_HZ=1000 kernels due to those same callbacks
> frequently updating cmtime prior to the bmap update. CONFIG_HZ=100
> reduces timer granularity enough to increase the odds that time updates
> are skipped and allows this to reproduce within a handful of attempts.
> 
> To deal with this problem, make sure that the inode is logged in the
> unwritten extent conversion path. Fix up the logflags, if necessary,
> after the extent conversion to keep the extent update code consistent
> with the other extent update helpers. This fixup is not necessary for
> the other (hole, delay) extent helpers because they execute in the block
> allocation codepath, which already logs the inode for other reasons
> (e.g., for di_nblocks).
> 
> Signed-off-by: Brian Foster <bfoster@xxxxxxxxxx>
> ---
> 
> v2:
> - Log inode unconditionally on unwritten extent conversion and retain
>   the fsync pincount check.
> v1: http://oss.sgi.com/pipermail/xfs/2015-April/041468.html
> 
>  fs/xfs/libxfs/xfs_bmap.c | 15 +++++++++++++++
>  1 file changed, 15 insertions(+)
> 
> diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
> index aeffeaa..e74e42bf 100644
> --- a/fs/xfs/libxfs/xfs_bmap.c
> +++ b/fs/xfs/libxfs/xfs_bmap.c
> @@ -4417,6 +4417,21 @@ xfs_bmapi_convert_unwritten(
>  	error = xfs_bmap_add_extent_unwritten_real(bma->tp, bma->ip, &bma->idx,
>  			&bma->cur, mval, bma->firstblock, bma->flist,
>  			&tmp_logflags);
> +	/*
> +	 * Unwritten extent conversion might not have dirtied the inode
> +	 * depending on the extent state. Unlike block allocation (e.g.,
> +	 * di_nblocks), there may be no other reason to log the inode in the
> +	 * unwritten extent conversion path.
> +	 *
> +	 * We need to make sure the inode is dirty in the transaction for the
> +	 * sake of fsync(), which will not force the log for this transaction
> +	 * unless it sees the inode pinned. This can only happen for btree
> +	 * format inodes so use XFS_ILOG_CORE.
> +	 */
> +	if (!error && !tmp_logflags) {
> +		ASSERT(bma->cur);
> +		tmp_logflags |= XFS_ILOG_CORE;
> +	}
>  	bma->logflags |= tmp_logflags;
>  	if (error)
>  		return error;

I'd just do:

	bma->logflags |= tmp_logflags | XFS_ILOG_CORE;

Because it really doesn't matter if we log an unchanged inode core
or not - it's likely already in the CIL or AIL given we are doing
unwritten extent conversion, so it is unlikely to introduce
significant new overhead from doing this....

Cheers,

Dave.
-- 
Dave Chinner
david@xxxxxxxxxxxxx

_______________________________________________
xfs mailing list
xfs@xxxxxxxxxxx
http://oss.sgi.com/mailman/listinfo/xfs