On Sat, Jul 30, 2011 at 09:44:22PM +0800, Christoph Hellwig wrote: > On Fri, Jul 29, 2011 at 10:21:21PM +0800, Wu Fengguang wrote: > > I cannot reproduce the bug. However looking through the code, I find > > the only possible place that may keep wb_writeback() looping with > > wb->list_lock grabbed is the below requeue_io() call. > > > > Would you try the patch? Note that even if it fixed the soft lockup, > > it may not be suitable as the final fix. > > This patch fixes the hang for me. Great. It means grab_super_passive() always returns false for up to 22s, due to a) list_empty(&sb->s_instances), which is very unlikely b) failed to grab &sb->s_umount So the chances are s_umount is mostly taken by others during the 22s. Maybe some task other than the flusher is actively doing writeback. These callers are not likely since they only do _small_ writes that hardly takes one second. bdi_forker_thread: writeback_inodes_wb(&bdi->wb, 1024); balance_dirty_pages: writeback_inodes_wb(&bdi->wb, write_chunk); However the writeback_inodes_sb*() and sync_inodes_sb() functions will very likely take dozens of seconds to complete. They have the same pattern of down_read(&sb->s_umount); bdi_queue_work(sb->s_bdi, &work); wait_for_completion(&done); up_read(&sb->s_umount); Note that s_umount is grabbed as early as bdi_queue_work() time, when the flusher is actively working on some other works. And since the for_background/for_kupdate works will quit on seeing other pending works, the soft lockup should only happen when the flusher is executing some nr_pages=LARGE work when there comes a sync() which calls writeback_inodes_sb() for the wait=0 sync stage. If we simply apply the change if (!grab_super_passive(sb)) { - requeue_io(inode, wb); + redirty_tail(inode, wb); continue; } Some inodes' writeback will be delayed undesirably, however they are likely picked up by later writeback_inodes_sb*(), sync_inodes_sb() and writeback_inodes_wb() calls which don't check inode dirty age, so it should not be a big problem. The change is also safe for sync*() because they won't go through the conditional grab_super_passive() code path at all. So I'd propose this patch as the reasonable fix for 3.1. In long term, we may further consider make the nr_pages works give up temporarily when there comes a sync work, which could eliminate lots of redirty_tail()s at this point. --- Subject: redirty the inode on failed grab_super_passive() Date: Fri Jul 29 22:14:35 CST 2011 This fixes a deadlock, which happens when a) the flusher is working on a work by __bdi_start_writeback(), while b) someone else calls writeback_inodes_sb*() or sync_inodes_sb(), which grab sb->s_umount and enqueue a new work for the flusher to execute The s_umount grabbed by (b) will fail the grab_super_passive() in (a). Then if the inode is requeued, wb_writeback() will busy retry on it. As a result, wb_writeback() loops for ever without releasing wb->list_lock, which further blocks other tasks. Fix the busy loop by redirtying the inode. This may undesirably delay the writeback of the inode, however most likely it will be picked up soon by the queued work by writeback_inodes_sb*(), sync_inodes_sb() or even writeback_inodes_wb(). Signed-off-by: Wu Fengguang <fengguang.wu@xxxxxxxxx> --- fs/fs-writeback.c | 7 ++++++- 1 file changed, 6 insertions(+), 1 deletion(-) --- linux.orig/fs/fs-writeback.c 2011-07-29 22:14:18.000000000 +0800 +++ linux/fs/fs-writeback.c 2011-07-31 17:04:25.000000000 +0800 @@ -618,7 +618,12 @@ static long __writeback_inodes_wb(struct struct super_block *sb = inode->i_sb; if (!grab_super_passive(sb)) { - requeue_io(inode, wb); + /* + * grab_super_passive() may fail consistently due to + * s_umount being grabbed by someone else. So redirty + * the inode to avoid busy loop. + */ + redirty_tail(inode, wb); continue; } wrote += writeback_sb_inodes(sb, wb, work); -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html