On Wed, Jul 20, 2016 at 10:24:45AM +1000, Dave Chinner wrote: > From: Dave Chinner <dchinner@xxxxxxxxxx> > > One of the problems we currently have with delayed logging is that > under serious memory pressure we can deadlock memory reclaim. THis > occurs when memory reclaim (such as run by kswapd) is reclaiming XFS > inodes and issues a log force to unpin inodes that are dirty in the > CIL. > > The CIL is pushed, but this will only occur once it gets the CIL > context lock to ensure that all committing transactions are complete > and no new transactions start being committed to the CIL while the > push switches to a new context. > > The deadlock occurs when the CIL context lock is held by a > committing process that is doing memory allocation for log vector > buffers, and that allocation is then blocked on memory reclaim > making progress. Memory reclaim, however, is blocked waiting for > a log force to make progress, and so we effectively deadlock at this > point. > > To solve this problem, we have to move the CIL log vector buffer > allocation outside of the context lock so that memory reclaim can > always make progress when it needs to force the log. The problem > with doing this is that a CIL push can take place while we are > determining if we need to allocate a new log vector buffer for > an item and hence the current log vector may go away without > warning. That means we canot rely on the existing log vector being > present when we finally grab the context lock and so we must have a > replacement buffer ready to go at all times. > > To ensure this, introduce a "shadow log vector" buffer that is > always guaranteed to be present when we gain the CIL context lock > and format the item. This shadow buffer may or may not be used > during the formatting, but if the log item does not have an existing > log vector buffer or that buffer is too small for the new > modifications, we swap it for the new shadow buffer and format > the modifications into that new log vector buffer. > > The result of this is that for any object we modify more than once > in a given CIL checkpoint, we double the memory required > to track dirty regions in the log. For single modifications then > we consume the shadow log vectorwe allocate on commit, and that gets > consumed by the checkpoint. However, if we make multiple > modifications, then the second transaction commit will allocate a > shadow log vector and hence we will end up with double the memory > usage as only one of the log vectors is consumed by the CIL > checkpoint. The remaining shadow vector will be freed when th elog > item is freed. > > This can probably be optimised in future - access to the shadow log > vector is serialised by the object lock (as opposited to the active > log vector, which is controlled by the CIL context lock) and so we > can probably free shadow log vector from some objects when the log > item is marked clean on removal from the AIL. > > Signed-off-by: Dave Chinner <dchinner@xxxxxxxxxx> > --- > fs/xfs/xfs_buf_item.c | 1 + > fs/xfs/xfs_dquot.c | 1 + > fs/xfs/xfs_dquot_item.c | 2 + > fs/xfs/xfs_extfree_item.c | 2 + > fs/xfs/xfs_inode_item.c | 1 + > fs/xfs/xfs_log_cil.c | 258 ++++++++++++++++++++++++++++++++++------------ > fs/xfs/xfs_trans.h | 1 + > 7 files changed, 202 insertions(+), 64 deletions(-) > ... > diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c > index 5e54e79..90ebd92 100644 > --- a/fs/xfs/xfs_log_cil.c > +++ b/fs/xfs/xfs_log_cil.c > @@ -78,6 +78,157 @@ xlog_cil_init_post_recovery( > log->l_cilp->xc_ctx->sequence = 1; > } > > +static inline int > +xlog_cil_iovec_space( > + uint niovecs) > +{ > + return round_up((sizeof(struct xfs_log_vec) + > + niovecs * sizeof(struct xfs_log_iovec)), > + sizeof(uint64_t)); > +} > + > +/* > + * Allocate or pin log vector buffers for CIL insertion. > + * > + * The CIL currently uses disposable buffers for copying a snapshot of the > + * modified items into the log during a push. The biggest problem with this is > + * the requirement to allocate the disposable buffer during the commit if: > + * a) does not exist; or > + * b) it is too small > + * > + * If we do this allocation within xlog_cil_insert_format_items(), it is done > + * under the xc_ctx_lock, which means that a CIL push cannot occur during > + * the memory allocation. This means that we have a potential deadlock situation > + * under low memory conditions when we have lots of dirty metadata pinned in > + * the CIL and we need a CIL commit to occur to free memory. > + * > + * To avoid this, we need to move the memory allocation outside the > + * xc_ctx_lock(), but because the log vector buffers are disposable, that opens ^ no params here > + * up a TOCTOU race condition w.r.t. the CIL commiting and removing the log committing > + * vector buffers between the check and the formatting of the item into the > + * log vector buffer within the xc_ctx_lock. > + * > + * Because the log vector buffer needs to be unchanged during the CIL push > + * process, we cannot share the buffer between the transaction commit (which > + * modifies the buffer) and the CIL push context that is writing the changes > + * into the log. This means skipping preallocation of buffer space is > + * unreliable, but we most definitely do not want to be allocating and freeing > + * buffers unnecessarily during commits when overwrites can be done safely. > + * > + * The simplest solution to this problem is to allocate a shadow buffer when a > + * log item is committed for the second time, and then to only use this buffer > + * if necessary. The buffer can remain attached to the log item until such time > + * it is needed, and this is the buffer that is reallocated to match the size of > + * the incoming modification. Then during the formatting of the item we can swap > + * the active buffer with the new one if we can't reuse the existing buffer. We > + * don't free the old buffer as it may be reused on the next modification if > + * it's size is right, otherwise we'll free and reallocate it at that point. > + * > + * This function builds a vector for the changes in each log item in the > + * transaction. It then works out the length of the buffer needed for each log > + * item, allocates them and attaches the vector to the log item in preparation > + * for the formatting step which occurs under the xc_ctx_lock. > + * > + * While this means the memory footprint goes up, it avoids the repeated > + * alloc/free pattern that repeated modifications of an item would otherwise > + * cause, and hence minimises the CPU overhead of such behaviour. > + */ ... > @@ -170,59 +326,29 @@ xlog_cil_insert_format_items( > list_for_each_entry(lidp, &tp->t_items, lid_trans) { > struct xfs_log_item *lip = lidp->lid_item; > struct xfs_log_vec *lv; > - struct xfs_log_vec *old_lv; > - int niovecs = 0; > - int nbytes = 0; > - int buf_size; > + struct xfs_log_vec *old_lv = NULL; > + struct xfs_log_vec *shadow; > bool ordered = false; > > /* Skip items which aren't dirty in this transaction. */ > if (!(lidp->lid_flags & XFS_LID_DIRTY)) > continue; > > - /* get number of vecs and size of data to be stored */ > - lip->li_ops->iop_size(lip, &niovecs, &nbytes); > - > - /* Skip items that do not have any vectors for writing */ > - if (!niovecs) > - continue; > - > /* > - * Ordered items need to be tracked but we do not wish to write > - * them. We need a logvec to track the object, but we do not > - * need an iovec or buffer to be allocated for copying data. > + * The formatting size information is already attached to > + * the shadow lv on the log item. > */ > - if (niovecs == XFS_LOG_VEC_ORDERED) { > + shadow = lip->li_lv_shadow; > + if (shadow->lv_buf_len == XFS_LOG_VEC_ORDERED) > ordered = true; > - niovecs = 0; > - nbytes = 0; > - } > > - /* > - * We 64-bit align the length of each iovec so that the start > - * of the next one is naturally aligned. We'll need to > - * account for that slack space here. Then round nbytes up > - * to 64-bit alignment so that the initial buffer alignment is > - * easy to calculate and verify. > - */ > - nbytes += niovecs * sizeof(uint64_t); > - nbytes = round_up(nbytes, sizeof(uint64_t)); > - > - /* grab the old item if it exists for reservation accounting */ > - old_lv = lip->li_lv; > - > - /* > - * The data buffer needs to start 64-bit aligned, so round up > - * that space to ensure we can align it appropriately and not > - * overrun the buffer. > - */ > - buf_size = nbytes + > - round_up((sizeof(struct xfs_log_vec) + > - niovecs * sizeof(struct xfs_log_iovec)), > - sizeof(uint64_t)); > + /* Skip items that do not have any vectors for writing */ > + if (!shadow->lv_niovecs && !ordered) > + continue; It would be nice if we didn't need to introduce an allocation to communicate this particular case, but probably not worth messing with at this point. Otherwise looks good: Reviewed-by: Brian Foster <bfoster@xxxxxxxxxx> > > /* compare to existing item size */ > - if (lip->li_lv && buf_size <= lip->li_lv->lv_size) { > + old_lv = lip->li_lv; > + if (lip->li_lv && shadow->lv_size <= lip->li_lv->lv_size) { > /* same or smaller, optimise common overwrite case */ > lv = lip->li_lv; > lv->lv_next = NULL; > @@ -236,32 +362,29 @@ xlog_cil_insert_format_items( > */ > *diff_iovecs -= lv->lv_niovecs; > *diff_len -= lv->lv_bytes; > + > + /* Ensure the lv is set up according to ->iop_size */ > + lv->lv_niovecs = shadow->lv_niovecs; > + > + /* reset the lv buffer information for new formatting */ > + lv->lv_buf_len = 0; > + lv->lv_bytes = 0; > + lv->lv_buf = (char *)lv + > + xlog_cil_iovec_space(lv->lv_niovecs); > } else { > - /* allocate new data chunk */ > - lv = kmem_zalloc(buf_size, KM_SLEEP|KM_NOFS); > + /* switch to shadow buffer! */ > + lv = shadow; > lv->lv_item = lip; > - lv->lv_size = buf_size; > if (ordered) { > /* track as an ordered logvec */ > ASSERT(lip->li_lv == NULL); > - lv->lv_buf_len = XFS_LOG_VEC_ORDERED; > goto insert; > } > - lv->lv_iovecp = (struct xfs_log_iovec *)&lv[1]; > } > > - /* Ensure the lv is set up according to ->iop_size */ > - lv->lv_niovecs = niovecs; > - > - /* The allocated data region lies beyond the iovec region */ > - lv->lv_buf_len = 0; > - lv->lv_bytes = 0; > - lv->lv_buf = (char *)lv + buf_size - nbytes; > ASSERT(IS_ALIGNED((unsigned long)lv->lv_buf, sizeof(uint64_t))); > - > lip->li_ops->iop_format(lip, lv); > insert: > - ASSERT(lv->lv_buf_len <= nbytes); > xfs_cil_prepare_item(log, lv, old_lv, diff_len, diff_iovecs); > } > } > @@ -783,6 +906,13 @@ xfs_log_commit_cil( > struct xlog *log = mp->m_log; > struct xfs_cil *cil = log->l_cilp; > > + /* > + * Do all necessary memory allocation before we lock the CIL. > + * This ensures the allocation does not deadlock with a CIL > + * push in memory reclaim (e.g. from kswapd). > + */ > + xlog_cil_alloc_shadow_bufs(log, tp); > + > /* lock out background commit */ > down_read(&cil->xc_ctx_lock); > > diff --git a/fs/xfs/xfs_trans.h b/fs/xfs/xfs_trans.h > index 9a462e8..9b2b9fa 100644 > --- a/fs/xfs/xfs_trans.h > +++ b/fs/xfs/xfs_trans.h > @@ -52,6 +52,7 @@ typedef struct xfs_log_item { > /* delayed logging */ > struct list_head li_cil; /* CIL pointers */ > struct xfs_log_vec *li_lv; /* active log vector */ > + struct xfs_log_vec *li_lv_shadow; /* standby vector */ > xfs_lsn_t li_seq; /* CIL commit seq */ > } xfs_log_item_t; > > -- > 2.8.0.rc3 > > _______________________________________________ > xfs mailing list > xfs@xxxxxxxxxxx > http://oss.sgi.com/mailman/listinfo/xfs _______________________________________________ xfs mailing list xfs@xxxxxxxxxxx http://oss.sgi.com/mailman/listinfo/xfs