On Wed, May 14, 2014 at 11:46:19AM +1000, Dave Chinner wrote: > From: Dave Chinner <dchinner@xxxxxxxxxx> > > The addition of direct formatting of log items into the CIL > linear buffer added alignment restrictions that the start of each > vector needed to be 64 bit aligned. Hence padding was added in > xlog_finish_iovec() to round up the vector length to ensure the next > vector started with the correct alignment. > > This adds a small number of bytes to the size of > the linear buffer that is otherwise unused. The issue is that we > then use the linear buffer size to determine the log space used by > the log item, and this includes the unused space. Hence when we > account for space used by the log item, it's more than is actually > written into the iclogs, and hence we slowly leak this space. > > This results on log hangs when reserving space, with threads getting > stuck with these stack traces: > > Call Trace: > [<ffffffff81d15989>] schedule+0x29/0x70 > [<ffffffff8150d3a2>] xlog_grant_head_wait+0xa2/0x1a0 > [<ffffffff8150d55d>] xlog_grant_head_check+0xbd/0x140 > [<ffffffff8150ee33>] xfs_log_reserve+0x103/0x220 > [<ffffffff814b7f05>] xfs_trans_reserve+0x2f5/0x310 > ..... > > > The 4 bytes is significant. Brain Foster did all the hard work in > tracking down a reproducable leak to inode chunk allocation (it went > away with the ikeep mount option). His rough numbers were that > creating 50,000 inodes leaked 11 log blocks. This turns out to be > roughly 800 inode chunks or 1600 inode cluster buffers. That > works out at roughly 4 bytes per cluster buffer logged, and at that > I started looking for a 4 byte leak in the buffer logging code. > > What I found was that a struct xfs_buf_log_format structure for an > inode cluster buffer is 28 bytes in length. This gets rounded up to > 32 bytes, but the vector length remains 28 bytes. Hence the CIL > ticket reservation is decremented by 32 bytes (via lv->lv_buf_len) > for that vector rather than 28 bytes which are written into the log. > > The fix for this problem is to separately track the bytes used by > the log vectors in the item and use that instead of the buffer > length when accounting for the log space that will be used by the > formatted log item. > > Again, thanks to Brian Foster for doing all the hard work and long > hours to isolate this leak and make finding the bug relatively > simple. > > Signed-off-by: Dave Chinner <dchinner@xxxxxxxxxx> > --- > fs/xfs/xfs_log.h | 19 +++++++++++++------ > fs/xfs/xfs_log_cil.c | 7 ++++--- > 2 files changed, 17 insertions(+), 9 deletions(-) > > diff --git a/fs/xfs/xfs_log.h b/fs/xfs/xfs_log.h > index 2c40044..84e0deb 100644 > --- a/fs/xfs/xfs_log.h > +++ b/fs/xfs/xfs_log.h > @@ -24,7 +24,8 @@ struct xfs_log_vec { > struct xfs_log_iovec *lv_iovecp; /* iovec array */ > struct xfs_log_item *lv_item; /* owner */ > char *lv_buf; /* formatted buffer */ > - int lv_buf_len; /* size of formatted buffer */ > + int lv_bytes; /* accounted space in buffer */ > + int lv_buf_len; /* aligned size of buffer */ > int lv_size; /* size of allocated lv */ > }; > > @@ -52,15 +53,21 @@ xlog_prepare_iovec(struct xfs_log_vec *lv, struct xfs_log_iovec **vecp, > return vec->i_addr; > } > > +/* > + * We need to make sure the next buffer is naturally aligned for the biggest > + * basic data type we put into it. We already accounted for this padding when > + * sizing the buffer. > + * > + * However, this padding does not get written into the log, and hence we have to > + * track the space used by the log vectors separately to prevent log space hangs > + * due to inaccurate accounting (i.e. a leak) of the used log space through the > + * CIL context ticket. > + */ > static inline void > xlog_finish_iovec(struct xfs_log_vec *lv, struct xfs_log_iovec *vec, int len) > { > - /* > - * We need to make sure the next buffer is naturally aligned for the > - * biggest basic data type we put into it. We already accounted for > - * this when sizing the buffer. > - */ > lv->lv_buf_len += round_up(len, sizeof(uint64_t)); > + lv->lv_bytes += len; > vec->i_len = len; > } > Ok, so we basically keep separate tracking for the memory buffer vs. used log space. The former is now only used simply to align the start of the next buffer. The latter is aggregated and accounted against the transaction as used log space. This now ensures that the reserve grant head will be restored to the point that the tail will eventually hit when the associated log items are cycled through the AIL. In other words, all unused log space that the transaction has reserved is returned to the reserve head. This passes my isolated inode creation/removal and fsstress tests. After running some operations and issuing a freeze, the reserve grant head matches up exactly with the current log head (i.e., no lingering reservations). I'm going to run a bit more testing, including the longer test that originally reproduced this problem (over several days). But otherwise this looks pretty good to me. Thanks for tracking this down. Reviewed-by: Brian Foster <bfoster@xxxxxxxxxx> I'm also going to look into exporting some of the data that was used to identify this problem. To provide more background, this originally manifested as an fs instance severely choking after several days of a high stress (fsstress) workload, and even then, was not straightforward to diagnose. While somewhat prominent, the smallish nature of the leak is difficult to detect under a normal workload, and may cause performance degradation long before an actual hang. Dumping some of the internal log state for debugging purposes actually made this problem obvious. For example, sampling the current log head and reserve grant head after some number of fsstress ops (with the fs quiesced) shows a clear trend of leaking the reserve grant head. I think exporting some of this internal data (e.g., through /sysfs) is generally beneficial and also should facilitate the addition of an xfstests test to explicitly detect such leaks in the future. Brian > diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c > index 039c873..b3425b3 100644 > --- a/fs/xfs/xfs_log_cil.c > +++ b/fs/xfs/xfs_log_cil.c > @@ -97,7 +97,7 @@ xfs_cil_prepare_item( > { > /* Account for the new LV being passed in */ > if (lv->lv_buf_len != XFS_LOG_VEC_ORDERED) { > - *diff_len += lv->lv_buf_len; > + *diff_len += lv->lv_bytes; > *diff_iovecs += lv->lv_niovecs; > } > > @@ -111,7 +111,7 @@ xfs_cil_prepare_item( > else if (old_lv != lv) { > ASSERT(lv->lv_buf_len != XFS_LOG_VEC_ORDERED); > > - *diff_len -= old_lv->lv_buf_len; > + *diff_len -= old_lv->lv_bytes; > *diff_iovecs -= old_lv->lv_niovecs; > kmem_free(old_lv); > } > @@ -239,7 +239,7 @@ xlog_cil_insert_format_items( > * that the space reservation accounting is correct. > */ > *diff_iovecs -= lv->lv_niovecs; > - *diff_len -= lv->lv_buf_len; > + *diff_len -= lv->lv_bytes; > } else { > /* allocate new data chunk */ > lv = kmem_zalloc(buf_size, KM_SLEEP|KM_NOFS); > @@ -259,6 +259,7 @@ xlog_cil_insert_format_items( > > /* The allocated data region lies beyond the iovec region */ > lv->lv_buf_len = 0; > + lv->lv_bytes = 0; > lv->lv_buf = (char *)lv + buf_size - nbytes; > ASSERT(IS_ALIGNED((unsigned long)lv->lv_buf, sizeof(uint64_t))); > > -- > 1.9.0 > > _______________________________________________ > xfs mailing list > xfs@xxxxxxxxxxx > http://oss.sgi.com/mailman/listinfo/xfs _______________________________________________ xfs mailing list xfs@xxxxxxxxxxx http://oss.sgi.com/mailman/listinfo/xfs