On 10/8/13 7:31 PM, Dave Chinner wrote: > From: Dave Chinner <dchinner@xxxxxxxxxx> > > Recent analysis of a deadlocked XFS filesystem from a kernel > crash dump indicated that the filesystem was stuck waiting for log > space. The short story of the hang on the RHEL6 kernel is this: > > - the tail of the log is pinned by an inode > - the inode has been pushed by the xfsaild > - the inode has been flushed to it's backing buffer and is > currently flush locked and hence waiting for backing > buffer IO to complete and remove it from the AIL > - the backing buffer is marked for write - it is on the > delayed write queue > - the inode buffer has been modified directly and logged > recently due to unlinked inode list modification > - the backing buffer is pinned in memory as it is in the > active CIL context. > - the xfsbufd won't start buffer writeback because it is > pinned > - xfssyncd won't force the log because it sees the log as > needing to be covered and hence wants to issue a dummy > transaction to move the log covering state machine along. > > Hence there is no trigger to force the CIL to the log and hence > unpin the inode buffer and therefore complete the inode IO, remove > it from the AIL and hence move the tail of the log along, allowing > transactions to start again. > > Mainline kernels also have the same deadlock, though the signature > is slightly different - the inode buffer never reaches the delayed > write lists because xfs_buf_item_push() sees that it is pinned and > hence never adds it to the delayed write list that the xfsaild > flushes. > > There are two possible solutions here. The first is to simply force > the log before trying to cover the log and so ensure that the CIL is > emptied before we try to reserve space for the dummy transaction in > the xfs_log_worker(). While this might work most of the time, it is > still racy and is no guarantee that we don't get stuck in > xfs_trans_reserve waiting for log space to come free. Hence it's not > the best way to solve the problem. > > The second solution is to modify xfs_log_need_covered() to be aware > of the CIL. We only should be attempting to cover the log if there > is no current activity in the log - covering the log is the process > of ensuring that the head and tail in the log on disk are identical > (i.e. the log is clean and at idle). Hence, by definition, if there > are items in the CIL then the log is not at idle and so we don't > need to attempt to cover it. > > When we don't need to cover the log because it is active or idle, we > issue a log force from xfs_log_worker() - if the log is idle, then > this does nothing. However, if the log is active due to there being > items in the CIL, it will force the items in the CIL to the log and > unpin them. > > In the case of the above deadlock scenario, instead of > xfs_log_worker() getting stuck in xfs_trans_reserve() attempting to > cover the log, it will instead force the log, thereby unpinning the > inode buffer, allowing IO to be issued and complete and hence > removing the inode that was pinning the tail of the log from the > AIL. At that point, everything will start moving along again. i.e. > the xfs_log_worker turns back into a watchdog that can alleviate > deadlocks based around pinned items that prevent the tail of the log > from being moved... Seems reasonable to me. Minor cosmetic things below, take them or leave them... > Signed-off-by: Dave Chinner <dchinner@xxxxxxxxxx> > --- > fs/xfs/xfs_log.c | 48 +++++++++++++++++++++++++++++------------------- > fs/xfs/xfs_log_cil.c | 14 ++++++++++++++ > fs/xfs/xfs_log_priv.h | 10 ++++------ > 3 files changed, 47 insertions(+), 25 deletions(-) > > diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c > index a2dea108..613ed94 100644 > --- a/fs/xfs/xfs_log.c > +++ b/fs/xfs/xfs_log.c > @@ -1000,27 +1000,34 @@ xfs_log_space_wake( > } > > /* > - * Determine if we have a transaction that has gone to disk > - * that needs to be covered. To begin the transition to the idle state > - * firstly the log needs to be idle (no AIL and nothing in the iclogs). > - * If we are then in a state where covering is needed, the caller is informed > - * that dummy transactions are required to move the log into the idle state. > + * Determine if we have a transaction that has gone to disk that needs to be > + * covered. To begin the transition to the idle state firstly the log needs to > + * be idle. That means the CIL, the AIL and the iclogs needs to be empty before > + * we start attempting to cover the log. > * > - * Because this is called as part of the sync process, we should also indicate > - * that dummy transactions should be issued in anything but the covered or > - * idle states. This ensures that the log tail is accurately reflected in > - * the log at the end of the sync, hence if a crash occurrs avoids replay > - * of transactions where the metadata is already on disk. > + * Only if we are then in a state where covering is needed, the caller is > + * informed that dummy transactions are required to move the log into the idle > + * state. > + * > + * If there are any items in the AIl or CIL, then we do not want to attempt to > + * cover the log as we may be in a situation where there isn't log space > + * available to run a dummy transaction and this can lead to deadlocks when the > + * tail of the log is pinned by an item that is modified in the CIL. Hence > + * there's no point in running a dummy transaction at this point because we > + * can't start trying to idle the log until both the CIL and AIL are empty. > */ > int > xfs_log_need_covered(xfs_mount_t *mp) > { > - int needed = 0; > struct xlog *log = mp->m_log; > + int needed = 0; > > if (!xfs_fs_writable(mp)) > return 0; > > + if (!xlog_cil_empty(log)) > + return 0; > + > spin_lock(&log->l_icloglock); > switch (log->l_covered_state) { > case XLOG_STATE_COVER_DONE: > @@ -1029,14 +1036,17 @@ xfs_log_need_covered(xfs_mount_t *mp) This hunk is all cosmetic, right? (nice cosmetic, but cosmetic). Kinda wish this were in a patch 2/2 just for clarity. > break; > case XLOG_STATE_COVER_NEED: > case XLOG_STATE_COVER_NEED2: > - if (!xfs_ail_min_lsn(log->l_ailp) && > - xlog_iclogs_empty(log)) { > - if (log->l_covered_state == XLOG_STATE_COVER_NEED) > - log->l_covered_state = XLOG_STATE_COVER_DONE; > - else > - log->l_covered_state = XLOG_STATE_COVER_DONE2; > - } > - /* FALLTHRU */ > + if (xfs_ail_min_lsn(log->l_ailp)) > + break; > + if (!xlog_iclogs_empty(log)) > + break; > + > + needed = 1; > + if (log->l_covered_state == XLOG_STATE_COVER_NEED) > + log->l_covered_state = XLOG_STATE_COVER_DONE; > + else > + log->l_covered_state = XLOG_STATE_COVER_DONE2; > + break; > default: > needed = 1; > break; > diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c > index cfe9797..da8524e77 100644 > --- a/fs/xfs/xfs_log_cil.c > +++ b/fs/xfs/xfs_log_cil.c > @@ -711,6 +711,20 @@ xlog_cil_push_foreground( > xlog_cil_push(log); > } > > +bool > +xlog_cil_empty( > + struct xlog *log) > +{ > + struct xfs_cil *cil = log->l_cilp; > + bool empty = false; > + > + spin_lock(&cil->xc_push_lock); > + if (list_empty(&cil->xc_cil)) > + empty = true; > + spin_unlock(&cil->xc_push_lock); > + return empty; > +} maybe just: xlog_cil_empty( struct xlog *log) { struct xfs_cil *cil = log->l_cilp; bool empty; spin_lock(&cil->xc_push_lock); empty = list_empty(&cil->xc_cil); spin_unlock(&cil->xc_push_lock); return empty; } but *shrug* (That was Zach's idea) ;) -Eric > + > /* > * Commit a transaction with the given vector to the Committed Item List. > * > diff --git a/fs/xfs/xfs_log_priv.h b/fs/xfs/xfs_log_priv.h > index 136654b..de24ffb 100644 > --- a/fs/xfs/xfs_log_priv.h > +++ b/fs/xfs/xfs_log_priv.h > @@ -514,12 +514,10 @@ xlog_assign_grant_head(atomic64_t *head, int cycle, int space) > /* > * Committed Item List interfaces > */ > -int > -xlog_cil_init(struct xlog *log); > -void > -xlog_cil_init_post_recovery(struct xlog *log); > -void > -xlog_cil_destroy(struct xlog *log); > +int xlog_cil_init(struct xlog *log); > +void xlog_cil_init_post_recovery(struct xlog *log); > +void xlog_cil_destroy(struct xlog *log); > +bool xlog_cil_empty(struct xlog *log) > > /* > * CIL force routines > _______________________________________________ xfs mailing list xfs@xxxxxxxxxxx http://oss.sgi.com/mailman/listinfo/xfs