[PATCH] xfs: timestamp updates cause excessive fdatasync log traffic

Dave Chinner <david@xxxxxxxxxxxxx> · Fri, 28 Aug 2015 11:23:10 +1000

From: Dave Chinner <dchinner@xxxxxxxxxx>

Sage Weil reported that a ceph test workload was writing to the
log on every fdatasync during an overwrite workload. Event tracing
showed that the only metadata modification being made was the
timestamp updates during the write(2) syscall, but fdatasync(2)
is supposed to ignore them. The key observation was that the
transactions in the log all looked like this:

INODE: #regs: 4   ino: 0x8b  flags: 0x45   dsize: 32

And contained a flags field of 0x45 or 0x85, and had data and
attribute forks following the inode core. This means that the
timestamp updates were triggering dirty relogging of previously
logged parts of the inode that hadn't yet been flushed back to
disk.

This is caused by xfs_trans_log_inode(), where it folds the dirty
fields that have previously been logged back into the current
transaction dirty fields from the inode item ili_last_fields. The
issue is that ili_last_fields only contains a non-zero value when
the inode is in the process of being flushed. The typical state
progression is this, using a core field update as the modification
occuring:

state			ili_fields		ili_last_fields
clean			    0			      0
modified, logged	XFS_ILOG_CORE		      0
flushed to buffer	    0			XFS_ILOG_CORE
<buffer submitted>
buffer IO done (clean)	    0			      0

However, if we run a new transaction after it has been flushed to
buffer but before the buffer IO is done, we don't know if the
modifications in the inode buffer (i.e. what is in ili_last_fields)
will reach the disk before the new transaction reaches the log.
Hence to keep transactional ordering correct in recovery, we need to
ensure the new transaction re-logs the modifications that are being
flushed to disk.

By relogging, we ensure that if the transaction makes it to disk and
the buffer doesn't, then recovery replays all the changes upto that
point correctly. If the transaction does not make it disk, but the
buffer does, then recovery will see that the inode on disk is more
recent than in the log and won't overwrite it with changes that it
already contains.

The upshot of this is that while the inode buffer sits in memory
with the inode in the "flushed" state, fdatasync is going to see
the relogged state in the ili_fields. i.e:

What is happening here is this:

state			ili_fields		ili_last_fields
clean			    0			      0
modified, logged	   CORE			      0
flushed to buffer	    0			     CORE
<write>
timestamp update	 TIMESTAMP		     CORE
<xfs_trans_inode_log pulls in ili_last_fields>
		       CORE|TIMESTAMP		     CORE
<fdatasync>
  sees field other than TIMESTAMP in ili_fields,
  triggers xfs_log_force_lsn to flush inode

<buffer submittted>
buffer IO done (clean) CORE|TIMESTAMP		      0
.....
<write>
timestamp update       CORE|TIMESTAMP		      0
<fdatasync>
  sees field other than TIMESTAMP in ili_fields,
  triggers xfs_log_force_lsn to flush inode
.....
<write>
timestamp update       CORE|TIMESTAMP		      0
<fdatasync>
  sees field other than TIMESTAMP in ili_fields,
  triggers xfs_log_force_lsn to flush inode

So, essentially, once a race condition on the buffer flush has
occurred, the condition is not cleared until the inode is flushed to
the buffer again and it is written without racing against the inode
being dirtied again.

The behaviour we really want here is to capture the timestamp
update transactionally, but not trigger fdatasync because of the
relogged fields that haven't been modified since the inode was
flushed to the buffer. We still need to relog them, but we don't
need to force the log in the fdatasync case.

To do this, don't fold the ili_last_fields value into ili_fields
when logging the inode. Instead, fold it into the fields that get
logged during formatting of the inode item. This means that we will
stop logging those fields the moment we know that there is a more
recent version of the inode on disk than we have in the log and so
we don't need to log those fields anymore as the next transaction on
disk doesn't need to replay them.

Doing this separates changes that are in memory but are not being
flushed from those that have been flushed. Hence we can now look at
ili_fields and hence see what fields have been modified since the
last flush, and hence whether fdatasync needs to force the log or
not. If non-timestamp changes have been made since the inode was
flushed to the backing buffer, then fdatasync will do exactly the
right thing (i.e. force the log).

Reported-by: Sage Weil <sage@xxxxxxxxxx>
Signed-off-by: Dave Chinner <dchinner@xxxxxxxxxx>
---
 fs/xfs/xfs_inode_item.c | 161 ++++++++++++++++++++++++++++++++++--------------
 fs/xfs/xfs_inode_item.h |   1 +
 2 files changed, 115 insertions(+), 47 deletions(-)

diff --git a/fs/xfs/xfs_inode_item.c b/fs/xfs/xfs_inode_item.c
index 62bd80f..be04eb2 100644
--- a/fs/xfs/xfs_inode_item.c
+++ b/fs/xfs/xfs_inode_item.c
@@ -29,6 +29,63 @@
 #include "xfs_trans_priv.h"
 #include "xfs_log.h"
 
+/*
+ * Notes on ili_fields, ili_format_fields and ili_last_fields.
+ *
+ * ili_fields contains the flags that reflect the current changes that are in
+ * memory and have been logged, but have not been flushed to the backing buffer
+ * for writeback yet. When a transaction is done, the fields that are modified
+ * are added to ili_fields and all those fields are logged. This results in
+ * repeated transactions correctly relogging all the other changes in memory and
+ * allows the inode to be re-ordered (moved forward) in the AIL safely.
+ * ili_fields is copied to ili_last_fields when the inode is flushed to the
+ * backing buffer and is then cleared, indicating that the inode in memory is
+ * now clean from a transactional change point of view and does not need to
+ * relog those fields anymore on tranaction commit.
+ *
+ * ili_last_fields, therefore, is only set while there is an unresolved inode
+ * flush being done (i.e. flush lock is held). We need to keep this state
+ * available to avoid a transaction recovery vs inode buffer IO completion race
+ * if we crash. If the inode is logged again makes it to disk before the inode
+ * buffer IO complete, log recovery will replay the latest inode transaction and
+ * can lose the changes that were in the inode buffer. Hence we still need to
+ * log the changes that have been flushed so that transactions issued during the
+ * inode buffer write still contain the modifications being flushed. Once the
+ * inode buffer IO completes, we no longer need to log those changes as the
+ * inode on disk the same as the inode in memory apart from the changes made
+ * since the inode was flushed (i.e. those recorded in ili_fields).
+ *
+ * ili_format_fields is only used during transaction commit. It is the
+ * aggregation of ili_fields and ili_last_fields, as sampled during the sizing
+ * of the inode item to be logged. We need this field definition to be constant
+ * between sizing and formatting, but inode buffer IO can complete
+ * asynchronously with transaction commit and so we must only read it once so
+ * that the different stages of the item formatting work correctly. We don't
+ * care about the async completion clearing ili_last_fields after we've sampled
+ * it - if we log too much then so be it - we won't log it next time. Once we've
+ * formatted the inode item, we need to propagate the ili_format_fields value to
+ * the on-disk inode log item format field, and then use it to clear all the
+ * fields that we marked for logging but were not dirty from ili_fields.
+ *
+ * This separation of changes allows us to accurately determine what fields in
+ * the inode have changed since it was last flushed to disk. This is important
+ * for fdatasync() performance as there are certain fields that, if modified,
+ * should not trigger log forces because they contain metadata that fdatasync()
+ * does not need to guarantee is safe on disk. This is, currently, just
+ * XFS_ILOG_TIMESTAMP, though may be extended in future to cover other metadata.
+ *
+ * If we simply fold ili_last_fields back into ili_fields when we log the inode
+ * in a transaction, then if we have a flush/modification race it results in
+ * every timestamp update causing fdatasync to flush the log because ili_fields
+ * has fields other than XFS_ILOG_TIMESTAMP set in it. This behaviour will
+ * persist until the inode is next flushed to it's backing buffer and that
+ * buffer is written and completed without another flush/modification race
+ * occuring.  Hence we keep the change state separate and only combine them when
+ * formatting the inode into the log.
+ *
+ * For more information, there's another big comment in xfs_iflush_int() about
+ * this flush/modification race condition.
+ */
 
 kmem_zone_t	*xfs_ili_zone;		/* inode log item zone */
 
@@ -40,6 +97,7 @@ static inline struct xfs_inode_log_item *INODE_ITEM(struct xfs_log_item *lip)
 STATIC void
 xfs_inode_item_data_fork_size(
 	struct xfs_inode_log_item *iip,
+	unsigned int		fields,
 	int			*nvecs,
 	int			*nbytes)
 {
@@ -47,7 +105,7 @@ xfs_inode_item_data_fork_size(
 
 	switch (ip->i_d.di_format) {
 	case XFS_DINODE_FMT_EXTENTS:
-		if ((iip->ili_fields & XFS_ILOG_DEXT) &&
+		if ((fields & XFS_ILOG_DEXT) &&
 		    ip->i_d.di_nextents > 0 &&
 		    ip->i_df.if_bytes > 0) {
 			/* worst case, doesn't subtract delalloc extents */
@@ -56,14 +114,14 @@ xfs_inode_item_data_fork_size(
 		}
 		break;
 	case XFS_DINODE_FMT_BTREE:
-		if ((iip->ili_fields & XFS_ILOG_DBROOT) &&
+		if ((fields & XFS_ILOG_DBROOT) &&
 		    ip->i_df.if_broot_bytes > 0) {
 			*nbytes += ip->i_df.if_broot_bytes;
 			*nvecs += 1;
 		}
 		break;
 	case XFS_DINODE_FMT_LOCAL:
-		if ((iip->ili_fields & XFS_ILOG_DDATA) &&
+		if ((fields & XFS_ILOG_DDATA) &&
 		    ip->i_df.if_bytes > 0) {
 			*nbytes += roundup(ip->i_df.if_bytes, 4);
 			*nvecs += 1;
@@ -82,6 +140,7 @@ xfs_inode_item_data_fork_size(
 STATIC void
 xfs_inode_item_attr_fork_size(
 	struct xfs_inode_log_item *iip,
+	unsigned int		fields,
 	int			*nvecs,
 	int			*nbytes)
 {
@@ -89,7 +148,7 @@ xfs_inode_item_attr_fork_size(
 
 	switch (ip->i_d.di_aformat) {
 	case XFS_DINODE_FMT_EXTENTS:
-		if ((iip->ili_fields & XFS_ILOG_AEXT) &&
+		if ((fields & XFS_ILOG_AEXT) &&
 		    ip->i_d.di_anextents > 0 &&
 		    ip->i_afp->if_bytes > 0) {
 			/* worst case, doesn't subtract unused space */
@@ -98,14 +157,14 @@ xfs_inode_item_attr_fork_size(
 		}
 		break;
 	case XFS_DINODE_FMT_BTREE:
-		if ((iip->ili_fields & XFS_ILOG_ABROOT) &&
+		if ((fields & XFS_ILOG_ABROOT) &&
 		    ip->i_afp->if_broot_bytes > 0) {
 			*nbytes += ip->i_afp->if_broot_bytes;
 			*nvecs += 1;
 		}
 		break;
 	case XFS_DINODE_FMT_LOCAL:
-		if ((iip->ili_fields & XFS_ILOG_ADATA) &&
+		if ((fields & XFS_ILOG_ADATA) &&
 		    ip->i_afp->if_bytes > 0) {
 			*nbytes += roundup(ip->i_afp->if_bytes, 4);
 			*nvecs += 1;
@@ -133,13 +192,17 @@ xfs_inode_item_size(
 	struct xfs_inode_log_item *iip = INODE_ITEM(lip);
 	struct xfs_inode	*ip = iip->ili_inode;
 
+	iip->ili_format_fields = iip->ili_fields | iip->ili_last_fields;
+
 	*nvecs += 2;
 	*nbytes += sizeof(struct xfs_inode_log_format) +
 		   xfs_icdinode_size(ip->i_d.di_version);
 
-	xfs_inode_item_data_fork_size(iip, nvecs, nbytes);
+	xfs_inode_item_data_fork_size(iip, iip->ili_format_fields,
+				      nvecs, nbytes);
 	if (XFS_IFORK_Q(ip))
-		xfs_inode_item_attr_fork_size(iip, nvecs, nbytes);
+		xfs_inode_item_attr_fork_size(iip, iip->ili_format_fields,
+					      nvecs, nbytes);
 }
 
 STATIC void
@@ -151,14 +214,14 @@ xfs_inode_item_format_data_fork(
 {
 	struct xfs_inode	*ip = iip->ili_inode;
 	size_t			data_bytes;
+	unsigned int		fields = iip->ili_format_fields;
 
 	switch (ip->i_d.di_format) {
 	case XFS_DINODE_FMT_EXTENTS:
-		iip->ili_fields &=
-			~(XFS_ILOG_DDATA | XFS_ILOG_DBROOT |
-			  XFS_ILOG_DEV | XFS_ILOG_UUID);
+		fields &= ~(XFS_ILOG_DDATA | XFS_ILOG_DBROOT |
+			    XFS_ILOG_DEV | XFS_ILOG_UUID);
 
-		if ((iip->ili_fields & XFS_ILOG_DEXT) &&
+		if ((fields & XFS_ILOG_DEXT) &&
 		    ip->i_d.di_nextents > 0 &&
 		    ip->i_df.if_bytes > 0) {
 			struct xfs_bmbt_rec *p;
@@ -175,15 +238,14 @@ xfs_inode_item_format_data_fork(
 			ilf->ilf_dsize = data_bytes;
 			ilf->ilf_size++;
 		} else {
-			iip->ili_fields &= ~XFS_ILOG_DEXT;
+			fields &= ~XFS_ILOG_DEXT;
 		}
 		break;
 	case XFS_DINODE_FMT_BTREE:
-		iip->ili_fields &=
-			~(XFS_ILOG_DDATA | XFS_ILOG_DEXT |
-			  XFS_ILOG_DEV | XFS_ILOG_UUID);
+		fields &= ~(XFS_ILOG_DDATA | XFS_ILOG_DEXT |
+			    XFS_ILOG_DEV | XFS_ILOG_UUID);
 
-		if ((iip->ili_fields & XFS_ILOG_DBROOT) &&
+		if ((fields & XFS_ILOG_DBROOT) &&
 		    ip->i_df.if_broot_bytes > 0) {
 			ASSERT(ip->i_df.if_broot != NULL);
 			xlog_copy_iovec(lv, vecp, XLOG_REG_TYPE_IBROOT,
@@ -192,16 +254,15 @@ xfs_inode_item_format_data_fork(
 			ilf->ilf_dsize = ip->i_df.if_broot_bytes;
 			ilf->ilf_size++;
 		} else {
-			ASSERT(!(iip->ili_fields &
-				 XFS_ILOG_DBROOT));
-			iip->ili_fields &= ~XFS_ILOG_DBROOT;
+			ASSERT(!(fields & XFS_ILOG_DBROOT));
+			fields &= ~XFS_ILOG_DBROOT;
 		}
 		break;
 	case XFS_DINODE_FMT_LOCAL:
-		iip->ili_fields &=
-			~(XFS_ILOG_DEXT | XFS_ILOG_DBROOT |
-			  XFS_ILOG_DEV | XFS_ILOG_UUID);
-		if ((iip->ili_fields & XFS_ILOG_DDATA) &&
+		fields &= ~(XFS_ILOG_DEXT | XFS_ILOG_DBROOT |
+			    XFS_ILOG_DEV | XFS_ILOG_UUID);
+
+		if ((fields & XFS_ILOG_DDATA) &&
 		    ip->i_df.if_bytes > 0) {
 			/*
 			 * Round i_bytes up to a word boundary.
@@ -218,27 +279,28 @@ xfs_inode_item_format_data_fork(
 			ilf->ilf_dsize = (unsigned)data_bytes;
 			ilf->ilf_size++;
 		} else {
-			iip->ili_fields &= ~XFS_ILOG_DDATA;
+			fields &= ~XFS_ILOG_DDATA;
 		}
 		break;
 	case XFS_DINODE_FMT_DEV:
-		iip->ili_fields &=
-			~(XFS_ILOG_DDATA | XFS_ILOG_DBROOT |
-			  XFS_ILOG_DEXT | XFS_ILOG_UUID);
-		if (iip->ili_fields & XFS_ILOG_DEV)
+		fields &= ~(XFS_ILOG_DDATA | XFS_ILOG_DBROOT |
+			    XFS_ILOG_DEXT | XFS_ILOG_UUID);
+		if (fields & XFS_ILOG_DEV)
 			ilf->ilf_u.ilfu_rdev = ip->i_df.if_u2.if_rdev;
 		break;
 	case XFS_DINODE_FMT_UUID:
-		iip->ili_fields &=
-			~(XFS_ILOG_DDATA | XFS_ILOG_DBROOT |
-			  XFS_ILOG_DEXT | XFS_ILOG_DEV);
-		if (iip->ili_fields & XFS_ILOG_UUID)
+		fields &= ~(XFS_ILOG_DDATA | XFS_ILOG_DBROOT |
+			    XFS_ILOG_DEXT | XFS_ILOG_DEV);
+		if (fields & XFS_ILOG_UUID)
 			ilf->ilf_u.ilfu_uuid = ip->i_df.if_u2.if_uuid;
 		break;
 	default:
 		ASSERT(0);
 		break;
 	}
+
+	/* reflect the logged fields back in ili_format_fields */
+	iip->ili_format_fields = fields;
 }
 
 STATIC void
@@ -250,13 +312,13 @@ xfs_inode_item_format_attr_fork(
 {
 	struct xfs_inode	*ip = iip->ili_inode;
 	size_t			data_bytes;
+	unsigned int		fields = iip->ili_format_fields;
 
 	switch (ip->i_d.di_aformat) {
 	case XFS_DINODE_FMT_EXTENTS:
-		iip->ili_fields &=
-			~(XFS_ILOG_ADATA | XFS_ILOG_ABROOT);
+		fields &= ~(XFS_ILOG_ADATA | XFS_ILOG_ABROOT);
 
-		if ((iip->ili_fields & XFS_ILOG_AEXT) &&
+		if ((fields & XFS_ILOG_AEXT) &&
 		    ip->i_d.di_anextents > 0 &&
 		    ip->i_afp->if_bytes > 0) {
 			struct xfs_bmbt_rec *p;
@@ -272,14 +334,13 @@ xfs_inode_item_format_attr_fork(
 			ilf->ilf_asize = data_bytes;
 			ilf->ilf_size++;
 		} else {
-			iip->ili_fields &= ~XFS_ILOG_AEXT;
+			fields &= ~XFS_ILOG_AEXT;
 		}
 		break;
 	case XFS_DINODE_FMT_BTREE:
-		iip->ili_fields &=
-			~(XFS_ILOG_ADATA | XFS_ILOG_AEXT);
+		fields &= ~(XFS_ILOG_ADATA | XFS_ILOG_AEXT);
 
-		if ((iip->ili_fields & XFS_ILOG_ABROOT) &&
+		if ((fields & XFS_ILOG_ABROOT) &&
 		    ip->i_afp->if_broot_bytes > 0) {
 			ASSERT(ip->i_afp->if_broot != NULL);
 
@@ -289,14 +350,13 @@ xfs_inode_item_format_attr_fork(
 			ilf->ilf_asize = ip->i_afp->if_broot_bytes;
 			ilf->ilf_size++;
 		} else {
-			iip->ili_fields &= ~XFS_ILOG_ABROOT;
+			fields &= ~XFS_ILOG_ABROOT;
 		}
 		break;
 	case XFS_DINODE_FMT_LOCAL:
-		iip->ili_fields &=
-			~(XFS_ILOG_AEXT | XFS_ILOG_ABROOT);
+		fields &= ~(XFS_ILOG_AEXT | XFS_ILOG_ABROOT);
 
-		if ((iip->ili_fields & XFS_ILOG_ADATA) &&
+		if ((fields & XFS_ILOG_ADATA) &&
 		    ip->i_afp->if_bytes > 0) {
 			/*
 			 * Round i_bytes up to a word boundary.
@@ -313,13 +373,16 @@ xfs_inode_item_format_attr_fork(
 			ilf->ilf_asize = (unsigned)data_bytes;
 			ilf->ilf_size++;
 		} else {
-			iip->ili_fields &= ~XFS_ILOG_ADATA;
+			fields &= ~XFS_ILOG_ADATA;
 		}
 		break;
 	default:
 		ASSERT(0);
 		break;
 	}
+
+	/* reflect the logged fields back in ili_format_fields */
+	iip->ili_format_fields = fields;
 }
 
 /*
@@ -359,12 +422,16 @@ xfs_inode_item_format(
 	if (XFS_IFORK_Q(ip)) {
 		xfs_inode_item_format_attr_fork(iip, ilf, lv, &vecp);
 	} else {
-		iip->ili_fields &=
+		iip->ili_format_fields &=
 			~(XFS_ILOG_ADATA | XFS_ILOG_ABROOT | XFS_ILOG_AEXT);
 	}
 
 	/* update the format with the exact fields we actually logged */
-	ilf->ilf_fields |= (iip->ili_fields & ~XFS_ILOG_TIMESTAMP);
+	ilf->ilf_fields |= (iip->ili_format_fields & ~XFS_ILOG_TIMESTAMP);
+
+	/* clear any fields we didn't log from ili_fields */
+	iip->ili_fields &= iip->ili_format_fields;
+	iip->ili_format_fields = 0;
 }
 
 /*
diff --git a/fs/xfs/xfs_inode_item.h b/fs/xfs/xfs_inode_item.h
index 488d812..43a9e1c 100644
--- a/fs/xfs/xfs_inode_item.h
+++ b/fs/xfs/xfs_inode_item.h
@@ -34,6 +34,7 @@ typedef struct xfs_inode_log_item {
 	unsigned short		ili_logged;	   /* flushed logged data */
 	unsigned int		ili_last_fields;   /* fields when flushed */
 	unsigned int		ili_fields;	   /* fields to be logged */
+	unsigned int		ili_format_fields; /* combined log fields */
 } xfs_inode_log_item_t;
 
 static inline int xfs_inode_clean(xfs_inode_t *ip)
-- 
2.5.0

_______________________________________________
xfs mailing list
xfs@xxxxxxxxxxx
http://oss.sgi.com/mailman/listinfo/xfs