Re: xfstests 073 regression

Wu Fengguang <fengguang.wu@xxxxxxxxx> · Sun, 31 Jul 2011 17:09:16 +0800

On Sat, Jul 30, 2011 at 09:44:22PM +0800, Christoph Hellwig wrote:
> On Fri, Jul 29, 2011 at 10:21:21PM +0800, Wu Fengguang wrote:
> > I cannot reproduce the bug. However looking through the code, I find
> > the only possible place that may keep wb_writeback() looping with
> > wb->list_lock grabbed is the below requeue_io() call.
> > 
> > Would you try the patch?  Note that even if it fixed the soft lockup,
> > it may not be suitable as the final fix.
> 
> This patch fixes the hang for me.

Great. It means grab_super_passive() always returns false for up to 22s,
due to

a) list_empty(&sb->s_instances), which is very unlikely

b) failed to grab &sb->s_umount

So the chances are s_umount is mostly taken by others during the 22s.

Maybe some task other than the flusher is actively doing writeback.

These callers are not likely since they only do _small_ writes that
hardly takes one second.

        bdi_forker_thread:
                writeback_inodes_wb(&bdi->wb, 1024);

        balance_dirty_pages:
                writeback_inodes_wb(&bdi->wb, write_chunk);

However the writeback_inodes_sb*() and sync_inodes_sb() functions will
very likely take dozens of seconds to complete. They have the same
pattern of

        down_read(&sb->s_umount);
        bdi_queue_work(sb->s_bdi, &work);
        wait_for_completion(&done);
        up_read(&sb->s_umount);

Note that s_umount is grabbed as early as bdi_queue_work() time, when
the flusher is actively working on some other works. And since the
for_background/for_kupdate works will quit on seeing other pending
works, the soft lockup should only happen when the flusher is
executing some nr_pages=LARGE work when there comes a sync() which
calls writeback_inodes_sb() for the wait=0 sync stage.

If we simply apply the change

                if (!grab_super_passive(sb)) {
-                       requeue_io(inode, wb);
+                       redirty_tail(inode, wb);
                        continue;
                }

Some inodes' writeback will be delayed undesirably, however they are
likely picked up by later writeback_inodes_sb*(), sync_inodes_sb() and
writeback_inodes_wb() calls which don't check inode dirty age, so it
should not be a big problem. The change is also safe for sync*()
because they won't go through the conditional grab_super_passive()
code path at all.

So I'd propose this patch as the reasonable fix for 3.1. In long term,
we may further consider make the nr_pages works give up temporarily
when there comes a sync work, which could eliminate lots of
redirty_tail()s at this point.

---
Subject: redirty the inode on failed grab_super_passive()
Date: Fri Jul 29 22:14:35 CST 2011

This fixes a deadlock, which happens when

a) the flusher is working on a work by __bdi_start_writeback(), while

b) someone else calls writeback_inodes_sb*() or sync_inodes_sb(), which
   grab sb->s_umount and enqueue a new work for the flusher to execute

The s_umount grabbed by (b) will fail the grab_super_passive() in (a).
Then if the inode is requeued, wb_writeback() will busy retry on it.
As a result, wb_writeback() loops for ever without releasing
wb->list_lock, which further blocks other tasks.

Fix the busy loop by redirtying the inode. This may undesirably delay
the writeback of the inode, however most likely it will be picked up
soon by the queued work by writeback_inodes_sb*(), sync_inodes_sb() or
even writeback_inodes_wb().

Signed-off-by: Wu Fengguang <fengguang.wu@xxxxxxxxx>
---
 fs/fs-writeback.c |    7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

--- linux.orig/fs/fs-writeback.c	2011-07-29 22:14:18.000000000 +0800
+++ linux/fs/fs-writeback.c	2011-07-31 17:04:25.000000000 +0800
@@ -618,7 +618,12 @@ static long __writeback_inodes_wb(struct
 		struct super_block *sb = inode->i_sb;
 
 		if (!grab_super_passive(sb)) {
-			requeue_io(inode, wb);
+			/*
+			 * grab_super_passive() may fail consistently due to
+			 * s_umount being grabbed by someone else. So redirty
+			 * the inode to avoid busy loop.
+			 */
+			redirty_tail(inode, wb);
 			continue;
 		}
 		wrote += writeback_sb_inodes(sb, wb, work);
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html