On Thu, Oct 17, 2013 at 10:54:29AM -0500, Eric Sandeen wrote: > On 10/14/13 5:17 PM, Dave Chinner wrote: > > From: Dave Chinner <dchinner@xxxxxxxxxx> > > > > Recent analysis of a deadlocked XFS filesystem from a kernel > > crash dump indicated that the filesystem was stuck waiting for log > > space. The short story of the hang on the RHEL6 kernel is this: > > > > - the tail of the log is pinned by an inode > > - the inode has been pushed by the xfsaild > > - the inode has been flushed to it's backing buffer and is > > currently flush locked and hence waiting for backing > > buffer IO to complete and remove it from the AIL > > - the backing buffer is marked for write - it is on the > > delayed write queue > > - the inode buffer has been modified directly and logged > > recently due to unlinked inode list modification > > - the backing buffer is pinned in memory as it is in the > > active CIL context. > > - the xfsbufd won't start buffer writeback because it is > > pinned > > - xfssyncd won't force the log because it sees the log as > > needing to be covered and hence wants to issue a dummy > > transaction to move the log covering state machine along. > > > > Hence there is no trigger to force the CIL to the log and hence > > unpin the inode buffer and therefore complete the inode IO, remove > > it from the AIL and hence move the tail of the log along, allowing > > transactions to start again. > > > > Mainline kernels also have the same deadlock, though the signature > > is slightly different - the inode buffer never reaches the delayed > > write lists because xfs_buf_item_push() sees that it is pinned and > > hence never adds it to the delayed write list that the xfsaild > > flushes. > > > > There are two possible solutions here. The first is to simply force > > the log before trying to cover the log and so ensure that the CIL is > > emptied before we try to reserve space for the dummy transaction in > > the xfs_log_worker(). While this might work most of the time, it is > > still racy and is no guarantee that we don't get stuck in > > xfs_trans_reserve waiting for log space to come free. Hence it's not > > the best way to solve the problem. > > > > The second solution is to modify xfs_log_need_covered() to be aware > > of the CIL. We only should be attempting to cover the log if there > > is no current activity in the log - covering the log is the process > > of ensuring that the head and tail in the log on disk are identical > > (i.e. the log is clean and at idle). Hence, by definition, if there > > are items in the CIL then the log is not at idle and so we don't > > need to attempt to cover it. > > > > When we don't need to cover the log because it is active or idle, we > > issue a log force from xfs_log_worker() - if the log is idle, then > > this does nothing. However, if the log is active due to there being > > items in the CIL, it will force the items in the CIL to the log and > > unpin them. > > > > In the case of the above deadlock scenario, instead of > > xfs_log_worker() getting stuck in xfs_trans_reserve() attempting to > > cover the log, it will instead force the log, thereby unpinning the > > inode buffer, allowing IO to be issued and complete and hence > > removing the inode that was pinning the tail of the log from the > > AIL. At that point, everything will start moving along again. i.e. > > the xfs_log_worker turns back into a watchdog that can alleviate > > deadlocks based around pinned items that prevent the tail of the log > > from being moved... > > > > Signed-off-by: Dave Chinner <dchinner@xxxxxxxxxx> > > Reviewed-by: Eric Sandeen <sandeen@xxxxxxxxxx> Applied. _______________________________________________ xfs mailing list xfs@xxxxxxxxxxx http://oss.sgi.com/mailman/listinfo/xfs