On Thu, Apr 23, 2015 at 12:42:44PM -0400, Brian Foster wrote: > The fsync() requirements for crash consistency on XFS are to flush file > data and force any in-core inode updates to the log. We currently check > whether the inode is pinned to identify whether the log needs to be > forced, since a non-zero pin count generally represents an inode that > has transactions awaiting a flush to the on-disk log. > > This is not sufficient in all cases, however. Reports of xfstests test > generic/311 failures on ppc64/s390x hosts have identified failures to > fsync outstanding inode modifications due to the inode not being pinned > at the time of the fsync. This occurs because certain bmap updates can > complete by logging bmapbt buffers but without ever dirtying (and thus > pinning) the core inode. The following is a specific incarnation of this > problem: > > $ mount $dev /mnt -o noatime,nobarrier > $ for i in $(seq 0 2 31); do \ > xfs_io -f -c "falloc $((i * 32768)) 32k" -c fsync /mnt/file; \ > done > $ xfs_io -c "pwrite -S 0 80k 16k" -c fsync -c "pwrite 76k 4k" -c fsync /mnt/file; \ > hexdump /mnt/file; \ > ./xfstests-dev/src/godown /mnt > ... > 0000000 0000 0000 0000 0000 0000 0000 0000 0000 > * > 0013000 cdcd cdcd cdcd cdcd cdcd cdcd cdcd cdcd > * > 0014000 0000 0000 0000 0000 0000 0000 0000 0000 > * > 00f8000 > $ umount /mnt; mount ... > $ hexdump /mnt/file > 0000000 0000 0000 0000 0000 0000 0000 0000 0000 > * > 00f8000 > > In short, the unwritten extent conversion for the last write is lost > despite the fact that an fsync executed before the filesystem was > shutdown. Note that this is impossible to reproduce on v5 supers due to > unconditional time callbacks for di_changecount and highly difficult to > reproduce on CONFIG_HZ=1000 kernels due to those same callbacks > frequently updating cmtime prior to the bmap update. CONFIG_HZ=100 > reduces timer granularity enough to increase the odds that time updates > are skipped and allows this to reproduce within a handful of attempts. > > To deal with this problem, make sure that the inode is logged in the > unwritten extent conversion path. Fix up the logflags, if necessary, > after the extent conversion to keep the extent update code consistent > with the other extent update helpers. This fixup is not necessary for > the other (hole, delay) extent helpers because they execute in the block > allocation codepath, which already logs the inode for other reasons > (e.g., for di_nblocks). > > Signed-off-by: Brian Foster <bfoster@xxxxxxxxxx> > --- > > v2: > - Log inode unconditionally on unwritten extent conversion and retain > the fsync pincount check. > v1: http://oss.sgi.com/pipermail/xfs/2015-April/041468.html > > fs/xfs/libxfs/xfs_bmap.c | 15 +++++++++++++++ > 1 file changed, 15 insertions(+) > > diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c > index aeffeaa..e74e42bf 100644 > --- a/fs/xfs/libxfs/xfs_bmap.c > +++ b/fs/xfs/libxfs/xfs_bmap.c > @@ -4417,6 +4417,21 @@ xfs_bmapi_convert_unwritten( > error = xfs_bmap_add_extent_unwritten_real(bma->tp, bma->ip, &bma->idx, > &bma->cur, mval, bma->firstblock, bma->flist, > &tmp_logflags); > + /* > + * Unwritten extent conversion might not have dirtied the inode > + * depending on the extent state. Unlike block allocation (e.g., > + * di_nblocks), there may be no other reason to log the inode in the > + * unwritten extent conversion path. > + * > + * We need to make sure the inode is dirty in the transaction for the > + * sake of fsync(), which will not force the log for this transaction > + * unless it sees the inode pinned. This can only happen for btree > + * format inodes so use XFS_ILOG_CORE. > + */ > + if (!error && !tmp_logflags) { > + ASSERT(bma->cur); > + tmp_logflags |= XFS_ILOG_CORE; > + } > bma->logflags |= tmp_logflags; > if (error) > return error; I'd just do: bma->logflags |= tmp_logflags | XFS_ILOG_CORE; Because it really doesn't matter if we log an unchanged inode core or not - it's likely already in the CIL or AIL given we are doing unwritten extent conversion, so it is unlikely to introduce significant new overhead from doing this.... Cheers, Dave. -- Dave Chinner david@xxxxxxxxxxxxx _______________________________________________ xfs mailing list xfs@xxxxxxxxxxx http://oss.sgi.com/mailman/listinfo/xfs