xfs_reclaim_inodes_ag() do infinate locking on pag_ici_reclaim_lock at the 2nd round of walking of all AGs when SYNC_TRYLOCK is set (conditionally). That causes dead lock in a special situation: 1) In a heavy memory load environment, process A is doing direct memory reclaiming waiting for xfs_inode.i_pincount to be cleared while holding mutex lock pag_ici_reclaim_lock. 2) i_pincount is increased by adding the xfs_inode to journal transection, and it's expected to be decreased when the transection related IO is done. Step 1) happens after i_pincount is increased and before truansection IO is issued. 3) Now the transection IO is issued by process B. In the IO path (IO could be more complex than you think), memory allocation and memory direct reclaiming happened too. It is blocked, during the 2nd walking of AGs, at locking pag_ici_reclaim_lock which is now held by process A. Thus Process A waiting for IO done holding pag_ici_reclaim_lock, process B tries to issue the IO but blocked at pag_ici_reclaim_lock. -- That forms dead lock. The fix is: don't change to infinate wait when SYNC_TRYLOCK is set. To avoid long time spining, just walk each AG only once. Signed-off-by: Wengang Wang <wen.gang.wang@xxxxxxxxxx> --- fs/xfs/xfs_icache.c | 15 --------------- 1 file changed, 15 deletions(-) diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c index 8dc2e5414276..e2a6ab04db3d 100644 --- a/fs/xfs/xfs_icache.c +++ b/fs/xfs/xfs_icache.c @@ -1245,11 +1245,8 @@ xfs_reclaim_inodes_ag( int last_error = 0; xfs_agnumber_t ag; int trylock = flags & SYNC_TRYLOCK; - int skipped; -restart: ag = 0; - skipped = 0; while ((pag = xfs_perag_get_tag(mp, ag, XFS_ICI_RECLAIM_TAG))) { unsigned long first_index = 0; int done = 0; @@ -1259,7 +1256,6 @@ xfs_reclaim_inodes_ag( if (trylock) { if (!mutex_trylock(&pag->pag_ici_reclaim_lock)) { - skipped++; xfs_perag_put(pag); continue; } @@ -1340,17 +1336,6 @@ xfs_reclaim_inodes_ag( xfs_perag_put(pag); } - /* - * if we skipped any AG, and we still have scan count remaining, do - * another pass this time using blocking reclaim semantics (i.e - * waiting on the reclaim locks and ignoring the reclaim cursors). This - * ensure that when we get more reclaimers than AGs we block rather - * than spin trying to execute reclaim. - */ - if (skipped && (flags & SYNC_WAIT) && *nr_to_scan > 0) { - trylock = 0; - goto restart; - } return last_error; } -- 2.21.0