Hi Dave, On Fri, 28 Aug 2015, Dave Chinner wrote: > > From: Dave Chinner <dchinner@xxxxxxxxxx> > > Sage Weil reported that a ceph test workload was writing to the > log on every fdatasync during an overwrite workload. Event tracing > showed that the only metadata modification being made was the > timestamp updates during the write(2) syscall, but fdatasync(2) > is supposed to ignore them. The key observation was that the > transactions in the log all looked like this: > > INODE: #regs: 4 ino: 0x8b flags: 0x45 dsize: 32 > > And contained a flags field of 0x45 or 0x85, and had data and > attribute forks following the inode core. This means that the > timestamp updates were triggering dirty relogging of previously > logged parts of the inode that hadn't yet been flushed back to > disk. > > This is caused by xfs_trans_log_inode(), where it folds the dirty > fields that have previously been logged back into the current > transaction dirty fields from the inode item ili_last_fields. The > issue is that ili_last_fields only contains a non-zero value when > the inode is in the process of being flushed. The typical state > progression is this, using a core field update as the modification > occuring: > > state ili_fields ili_last_fields > clean 0 0 > modified, logged XFS_ILOG_CORE 0 > flushed to buffer 0 XFS_ILOG_CORE > <buffer submitted> > buffer IO done (clean) 0 0 > > However, if we run a new transaction after it has been flushed to > buffer but before the buffer IO is done, we don't know if the > modifications in the inode buffer (i.e. what is in ili_last_fields) > will reach the disk before the new transaction reaches the log. > Hence to keep transactional ordering correct in recovery, we need to > ensure the new transaction re-logs the modifications that are being > flushed to disk. > > By relogging, we ensure that if the transaction makes it to disk and > the buffer doesn't, then recovery replays all the changes upto that > point correctly. If the transaction does not make it disk, but the > buffer does, then recovery will see that the inode on disk is more > recent than in the log and won't overwrite it with changes that it > already contains. > > The upshot of this is that while the inode buffer sits in memory > with the inode in the "flushed" state, fdatasync is going to see > the relogged state in the ili_fields. i.e: > > What is happening here is this: > > state ili_fields ili_last_fields > clean 0 0 > modified, logged CORE 0 > flushed to buffer 0 CORE > <write> > timestamp update TIMESTAMP CORE > <xfs_trans_inode_log pulls in ili_last_fields> > CORE|TIMESTAMP CORE > <fdatasync> > sees field other than TIMESTAMP in ili_fields, > triggers xfs_log_force_lsn to flush inode > > <buffer submittted> > buffer IO done (clean) CORE|TIMESTAMP 0 > ..... > <write> > timestamp update CORE|TIMESTAMP 0 > <fdatasync> > sees field other than TIMESTAMP in ili_fields, > triggers xfs_log_force_lsn to flush inode > ..... > <write> > timestamp update CORE|TIMESTAMP 0 > <fdatasync> > sees field other than TIMESTAMP in ili_fields, > triggers xfs_log_force_lsn to flush inode > > So, essentially, once a race condition on the buffer flush has > occurred, the condition is not cleared until the inode is flushed to > the buffer again and it is written without racing against the inode > being dirtied again. > > The behaviour we really want here is to capture the timestamp > update transactionally, but not trigger fdatasync because of the > relogged fields that haven't been modified since the inode was > flushed to the buffer. We still need to relog them, but we don't > need to force the log in the fdatasync case. > > To do this, don't fold the ili_last_fields value into ili_fields > when logging the inode. Instead, fold it into the fields that get > logged during formatting of the inode item. This means that we will > stop logging those fields the moment we know that there is a more > recent version of the inode on disk than we have in the log and so > we don't need to log those fields anymore as the next transaction on > disk doesn't need to replay them. > > Doing this separates changes that are in memory but are not being > flushed from those that have been flushed. Hence we can now look at > ili_fields and hence see what fields have been modified since the > last flush, and hence whether fdatasync needs to force the log or > not. If non-timestamp changes have been made since the inode was > flushed to the backing buffer, then fdatasync will do exactly the > right thing (i.e. force the log). > > Reported-by: Sage Weil <sage@xxxxxxxxxx> > Signed-off-by: Dave Chinner <dchinner@xxxxxxxxxx> > --- > Version 2: > - include the hunk from fs/xfs/xfs_trans_inode.c that I missed > when committing the patch locally the first time. I gave this a go on my machine but I'm still seeing the same symptom. I've gathered the trace, strace, and other useful bits at http://newdream.net/~sage/drop/rocksdb.2/ This is pretty easy to reproduce with the ceph_test_keyvaluedb binary (built on fedora 22), also in that dir: rm -rf kv_test_temp_dir/ ./ceph_test_keyvaluedb --gtest_filter=KeyValueDB/KVTest.BenchCommit/1 Thanks for looking into this! sage > > fs/xfs/xfs_inode_item.c | 161 +++++++++++++++++++++++++++++++++-------------- > fs/xfs/xfs_inode_item.h | 1 + > fs/xfs/xfs_trans_inode.c | 8 --- > 3 files changed, 115 insertions(+), 55 deletions(-) > > diff --git a/fs/xfs/xfs_inode_item.c b/fs/xfs/xfs_inode_item.c > index 62bd80f..be04eb2 100644 > --- a/fs/xfs/xfs_inode_item.c > +++ b/fs/xfs/xfs_inode_item.c > @@ -29,6 +29,63 @@ > #include "xfs_trans_priv.h" > #include "xfs_log.h" > > +/* > + * Notes on ili_fields, ili_format_fields and ili_last_fields. > + * > + * ili_fields contains the flags that reflect the current changes that are in > + * memory and have been logged, but have not been flushed to the backing buffer > + * for writeback yet. When a transaction is done, the fields that are modified > + * are added to ili_fields and all those fields are logged. This results in > + * repeated transactions correctly relogging all the other changes in memory and > + * allows the inode to be re-ordered (moved forward) in the AIL safely. > + * ili_fields is copied to ili_last_fields when the inode is flushed to the > + * backing buffer and is then cleared, indicating that the inode in memory is > + * now clean from a transactional change point of view and does not need to > + * relog those fields anymore on tranaction commit. > + * > + * ili_last_fields, therefore, is only set while there is an unresolved inode > + * flush being done (i.e. flush lock is held). We need to keep this state > + * available to avoid a transaction recovery vs inode buffer IO completion race > + * if we crash. If the inode is logged again makes it to disk before the inode > + * buffer IO complete, log recovery will replay the latest inode transaction and > + * can lose the changes that were in the inode buffer. Hence we still need to > + * log the changes that have been flushed so that transactions issued during the > + * inode buffer write still contain the modifications being flushed. Once the > + * inode buffer IO completes, we no longer need to log those changes as the > + * inode on disk the same as the inode in memory apart from the changes made > + * since the inode was flushed (i.e. those recorded in ili_fields). > + * > + * ili_format_fields is only used during transaction commit. It is the > + * aggregation of ili_fields and ili_last_fields, as sampled during the sizing > + * of the inode item to be logged. We need this field definition to be constant > + * between sizing and formatting, but inode buffer IO can complete > + * asynchronously with transaction commit and so we must only read it once so > + * that the different stages of the item formatting work correctly. We don't > + * care about the async completion clearing ili_last_fields after we've sampled > + * it - if we log too much then so be it - we won't log it next time. Once we've > + * formatted the inode item, we need to propagate the ili_format_fields value to > + * the on-disk inode log item format field, and then use it to clear all the > + * fields that we marked for logging but were not dirty from ili_fields. > + * > + * This separation of changes allows us to accurately determine what fields in > + * the inode have changed since it was last flushed to disk. This is important > + * for fdatasync() performance as there are certain fields that, if modified, > + * should not trigger log forces because they contain metadata that fdatasync() > + * does not need to guarantee is safe on disk. This is, currently, just > + * XFS_ILOG_TIMESTAMP, though may be extended in future to cover other metadata. > + * > + * If we simply fold ili_last_fields back into ili_fields when we log the inode > + * in a transaction, then if we have a flush/modification race it results in > + * every timestamp update causing fdatasync to flush the log because ili_fields > + * has fields other than XFS_ILOG_TIMESTAMP set in it. This behaviour will > + * persist until the inode is next flushed to it's backing buffer and that > + * buffer is written and completed without another flush/modification race > + * occuring. Hence we keep the change state separate and only combine them when > + * formatting the inode into the log. > + * > + * For more information, there's another big comment in xfs_iflush_int() about > + * this flush/modification race condition. > + */ > > kmem_zone_t *xfs_ili_zone; /* inode log item zone */ > > @@ -40,6 +97,7 @@ static inline struct xfs_inode_log_item *INODE_ITEM(struct xfs_log_item *lip) > STATIC void > xfs_inode_item_data_fork_size( > struct xfs_inode_log_item *iip, > + unsigned int fields, > int *nvecs, > int *nbytes) > { > @@ -47,7 +105,7 @@ xfs_inode_item_data_fork_size( > > switch (ip->i_d.di_format) { > case XFS_DINODE_FMT_EXTENTS: > - if ((iip->ili_fields & XFS_ILOG_DEXT) && > + if ((fields & XFS_ILOG_DEXT) && > ip->i_d.di_nextents > 0 && > ip->i_df.if_bytes > 0) { > /* worst case, doesn't subtract delalloc extents */ > @@ -56,14 +114,14 @@ xfs_inode_item_data_fork_size( > } > break; > case XFS_DINODE_FMT_BTREE: > - if ((iip->ili_fields & XFS_ILOG_DBROOT) && > + if ((fields & XFS_ILOG_DBROOT) && > ip->i_df.if_broot_bytes > 0) { > *nbytes += ip->i_df.if_broot_bytes; > *nvecs += 1; > } > break; > case XFS_DINODE_FMT_LOCAL: > - if ((iip->ili_fields & XFS_ILOG_DDATA) && > + if ((fields & XFS_ILOG_DDATA) && > ip->i_df.if_bytes > 0) { > *nbytes += roundup(ip->i_df.if_bytes, 4); > *nvecs += 1; > @@ -82,6 +140,7 @@ xfs_inode_item_data_fork_size( > STATIC void > xfs_inode_item_attr_fork_size( > struct xfs_inode_log_item *iip, > + unsigned int fields, > int *nvecs, > int *nbytes) > { > @@ -89,7 +148,7 @@ xfs_inode_item_attr_fork_size( > > switch (ip->i_d.di_aformat) { > case XFS_DINODE_FMT_EXTENTS: > - if ((iip->ili_fields & XFS_ILOG_AEXT) && > + if ((fields & XFS_ILOG_AEXT) && > ip->i_d.di_anextents > 0 && > ip->i_afp->if_bytes > 0) { > /* worst case, doesn't subtract unused space */ > @@ -98,14 +157,14 @@ xfs_inode_item_attr_fork_size( > } > break; > case XFS_DINODE_FMT_BTREE: > - if ((iip->ili_fields & XFS_ILOG_ABROOT) && > + if ((fields & XFS_ILOG_ABROOT) && > ip->i_afp->if_broot_bytes > 0) { > *nbytes += ip->i_afp->if_broot_bytes; > *nvecs += 1; > } > break; > case XFS_DINODE_FMT_LOCAL: > - if ((iip->ili_fields & XFS_ILOG_ADATA) && > + if ((fields & XFS_ILOG_ADATA) && > ip->i_afp->if_bytes > 0) { > *nbytes += roundup(ip->i_afp->if_bytes, 4); > *nvecs += 1; > @@ -133,13 +192,17 @@ xfs_inode_item_size( > struct xfs_inode_log_item *iip = INODE_ITEM(lip); > struct xfs_inode *ip = iip->ili_inode; > > + iip->ili_format_fields = iip->ili_fields | iip->ili_last_fields; > + > *nvecs += 2; > *nbytes += sizeof(struct xfs_inode_log_format) + > xfs_icdinode_size(ip->i_d.di_version); > > - xfs_inode_item_data_fork_size(iip, nvecs, nbytes); > + xfs_inode_item_data_fork_size(iip, iip->ili_format_fields, > + nvecs, nbytes); > if (XFS_IFORK_Q(ip)) > - xfs_inode_item_attr_fork_size(iip, nvecs, nbytes); > + xfs_inode_item_attr_fork_size(iip, iip->ili_format_fields, > + nvecs, nbytes); > } > > STATIC void > @@ -151,14 +214,14 @@ xfs_inode_item_format_data_fork( > { > struct xfs_inode *ip = iip->ili_inode; > size_t data_bytes; > + unsigned int fields = iip->ili_format_fields; > > switch (ip->i_d.di_format) { > case XFS_DINODE_FMT_EXTENTS: > - iip->ili_fields &= > - ~(XFS_ILOG_DDATA | XFS_ILOG_DBROOT | > - XFS_ILOG_DEV | XFS_ILOG_UUID); > + fields &= ~(XFS_ILOG_DDATA | XFS_ILOG_DBROOT | > + XFS_ILOG_DEV | XFS_ILOG_UUID); > > - if ((iip->ili_fields & XFS_ILOG_DEXT) && > + if ((fields & XFS_ILOG_DEXT) && > ip->i_d.di_nextents > 0 && > ip->i_df.if_bytes > 0) { > struct xfs_bmbt_rec *p; > @@ -175,15 +238,14 @@ xfs_inode_item_format_data_fork( > ilf->ilf_dsize = data_bytes; > ilf->ilf_size++; > } else { > - iip->ili_fields &= ~XFS_ILOG_DEXT; > + fields &= ~XFS_ILOG_DEXT; > } > break; > case XFS_DINODE_FMT_BTREE: > - iip->ili_fields &= > - ~(XFS_ILOG_DDATA | XFS_ILOG_DEXT | > - XFS_ILOG_DEV | XFS_ILOG_UUID); > + fields &= ~(XFS_ILOG_DDATA | XFS_ILOG_DEXT | > + XFS_ILOG_DEV | XFS_ILOG_UUID); > > - if ((iip->ili_fields & XFS_ILOG_DBROOT) && > + if ((fields & XFS_ILOG_DBROOT) && > ip->i_df.if_broot_bytes > 0) { > ASSERT(ip->i_df.if_broot != NULL); > xlog_copy_iovec(lv, vecp, XLOG_REG_TYPE_IBROOT, > @@ -192,16 +254,15 @@ xfs_inode_item_format_data_fork( > ilf->ilf_dsize = ip->i_df.if_broot_bytes; > ilf->ilf_size++; > } else { > - ASSERT(!(iip->ili_fields & > - XFS_ILOG_DBROOT)); > - iip->ili_fields &= ~XFS_ILOG_DBROOT; > + ASSERT(!(fields & XFS_ILOG_DBROOT)); > + fields &= ~XFS_ILOG_DBROOT; > } > break; > case XFS_DINODE_FMT_LOCAL: > - iip->ili_fields &= > - ~(XFS_ILOG_DEXT | XFS_ILOG_DBROOT | > - XFS_ILOG_DEV | XFS_ILOG_UUID); > - if ((iip->ili_fields & XFS_ILOG_DDATA) && > + fields &= ~(XFS_ILOG_DEXT | XFS_ILOG_DBROOT | > + XFS_ILOG_DEV | XFS_ILOG_UUID); > + > + if ((fields & XFS_ILOG_DDATA) && > ip->i_df.if_bytes > 0) { > /* > * Round i_bytes up to a word boundary. > @@ -218,27 +279,28 @@ xfs_inode_item_format_data_fork( > ilf->ilf_dsize = (unsigned)data_bytes; > ilf->ilf_size++; > } else { > - iip->ili_fields &= ~XFS_ILOG_DDATA; > + fields &= ~XFS_ILOG_DDATA; > } > break; > case XFS_DINODE_FMT_DEV: > - iip->ili_fields &= > - ~(XFS_ILOG_DDATA | XFS_ILOG_DBROOT | > - XFS_ILOG_DEXT | XFS_ILOG_UUID); > - if (iip->ili_fields & XFS_ILOG_DEV) > + fields &= ~(XFS_ILOG_DDATA | XFS_ILOG_DBROOT | > + XFS_ILOG_DEXT | XFS_ILOG_UUID); > + if (fields & XFS_ILOG_DEV) > ilf->ilf_u.ilfu_rdev = ip->i_df.if_u2.if_rdev; > break; > case XFS_DINODE_FMT_UUID: > - iip->ili_fields &= > - ~(XFS_ILOG_DDATA | XFS_ILOG_DBROOT | > - XFS_ILOG_DEXT | XFS_ILOG_DEV); > - if (iip->ili_fields & XFS_ILOG_UUID) > + fields &= ~(XFS_ILOG_DDATA | XFS_ILOG_DBROOT | > + XFS_ILOG_DEXT | XFS_ILOG_DEV); > + if (fields & XFS_ILOG_UUID) > ilf->ilf_u.ilfu_uuid = ip->i_df.if_u2.if_uuid; > break; > default: > ASSERT(0); > break; > } > + > + /* reflect the logged fields back in ili_format_fields */ > + iip->ili_format_fields = fields; > } > > STATIC void > @@ -250,13 +312,13 @@ xfs_inode_item_format_attr_fork( > { > struct xfs_inode *ip = iip->ili_inode; > size_t data_bytes; > + unsigned int fields = iip->ili_format_fields; > > switch (ip->i_d.di_aformat) { > case XFS_DINODE_FMT_EXTENTS: > - iip->ili_fields &= > - ~(XFS_ILOG_ADATA | XFS_ILOG_ABROOT); > + fields &= ~(XFS_ILOG_ADATA | XFS_ILOG_ABROOT); > > - if ((iip->ili_fields & XFS_ILOG_AEXT) && > + if ((fields & XFS_ILOG_AEXT) && > ip->i_d.di_anextents > 0 && > ip->i_afp->if_bytes > 0) { > struct xfs_bmbt_rec *p; > @@ -272,14 +334,13 @@ xfs_inode_item_format_attr_fork( > ilf->ilf_asize = data_bytes; > ilf->ilf_size++; > } else { > - iip->ili_fields &= ~XFS_ILOG_AEXT; > + fields &= ~XFS_ILOG_AEXT; > } > break; > case XFS_DINODE_FMT_BTREE: > - iip->ili_fields &= > - ~(XFS_ILOG_ADATA | XFS_ILOG_AEXT); > + fields &= ~(XFS_ILOG_ADATA | XFS_ILOG_AEXT); > > - if ((iip->ili_fields & XFS_ILOG_ABROOT) && > + if ((fields & XFS_ILOG_ABROOT) && > ip->i_afp->if_broot_bytes > 0) { > ASSERT(ip->i_afp->if_broot != NULL); > > @@ -289,14 +350,13 @@ xfs_inode_item_format_attr_fork( > ilf->ilf_asize = ip->i_afp->if_broot_bytes; > ilf->ilf_size++; > } else { > - iip->ili_fields &= ~XFS_ILOG_ABROOT; > + fields &= ~XFS_ILOG_ABROOT; > } > break; > case XFS_DINODE_FMT_LOCAL: > - iip->ili_fields &= > - ~(XFS_ILOG_AEXT | XFS_ILOG_ABROOT); > + fields &= ~(XFS_ILOG_AEXT | XFS_ILOG_ABROOT); > > - if ((iip->ili_fields & XFS_ILOG_ADATA) && > + if ((fields & XFS_ILOG_ADATA) && > ip->i_afp->if_bytes > 0) { > /* > * Round i_bytes up to a word boundary. > @@ -313,13 +373,16 @@ xfs_inode_item_format_attr_fork( > ilf->ilf_asize = (unsigned)data_bytes; > ilf->ilf_size++; > } else { > - iip->ili_fields &= ~XFS_ILOG_ADATA; > + fields &= ~XFS_ILOG_ADATA; > } > break; > default: > ASSERT(0); > break; > } > + > + /* reflect the logged fields back in ili_format_fields */ > + iip->ili_format_fields = fields; > } > > /* > @@ -359,12 +422,16 @@ xfs_inode_item_format( > if (XFS_IFORK_Q(ip)) { > xfs_inode_item_format_attr_fork(iip, ilf, lv, &vecp); > } else { > - iip->ili_fields &= > + iip->ili_format_fields &= > ~(XFS_ILOG_ADATA | XFS_ILOG_ABROOT | XFS_ILOG_AEXT); > } > > /* update the format with the exact fields we actually logged */ > - ilf->ilf_fields |= (iip->ili_fields & ~XFS_ILOG_TIMESTAMP); > + ilf->ilf_fields |= (iip->ili_format_fields & ~XFS_ILOG_TIMESTAMP); > + > + /* clear any fields we didn't log from ili_fields */ > + iip->ili_fields &= iip->ili_format_fields; > + iip->ili_format_fields = 0; > } > > /* > diff --git a/fs/xfs/xfs_inode_item.h b/fs/xfs/xfs_inode_item.h > index 488d812..43a9e1c 100644 > --- a/fs/xfs/xfs_inode_item.h > +++ b/fs/xfs/xfs_inode_item.h > @@ -34,6 +34,7 @@ typedef struct xfs_inode_log_item { > unsigned short ili_logged; /* flushed logged data */ > unsigned int ili_last_fields; /* fields when flushed */ > unsigned int ili_fields; /* fields to be logged */ > + unsigned int ili_format_fields; /* combined log fields */ > } xfs_inode_log_item_t; > > static inline int xfs_inode_clean(xfs_inode_t *ip) > diff --git a/fs/xfs/xfs_trans_inode.c b/fs/xfs/xfs_trans_inode.c > index 17280cd..77f1e8a 100644 > --- a/fs/xfs/xfs_trans_inode.c > +++ b/fs/xfs/xfs_trans_inode.c > @@ -123,13 +123,5 @@ xfs_trans_log_inode( > tp->t_flags |= XFS_TRANS_DIRTY; > ip->i_itemp->ili_item.li_desc->lid_flags |= XFS_LID_DIRTY; > > - /* > - * Always OR in the bits from the ili_last_fields field. > - * This is to coordinate with the xfs_iflush() and xfs_iflush_done() > - * routines in the eventual clearing of the ili_fields bits. > - * See the big comment in xfs_iflush() for an explanation of > - * this coordination mechanism. > - */ > - flags |= ip->i_itemp->ili_last_fields; > ip->i_itemp->ili_fields |= flags; > } > > _______________________________________________ xfs mailing list xfs@xxxxxxxxxxx http://oss.sgi.com/mailman/listinfo/xfs