On Mon, Aug 01, 2011 at 10:09:51AM +0800, Dave Chinner wrote: > On Sun, Jul 31, 2011 at 03:40:20PM -1000, Linus Torvalds wrote: > > On Sun, Jul 31, 2011 at 3:28 PM, Dave Chinner <david@xxxxxxxxxxxxx> wrote: > > > > > > IOWs, what I'm asking is whether this "just move the inodes one at a > > > time to a different queue" is just a bandaid for a particular > > > symptom of a deeper problem we haven't realised existed.... > > > > Deeper problems in writeback? Unpossible. > > Heh. > > But that's exactly why I'd like to understand the problem fully. > > > The writeback code has pretty much always been just a collection of > > "bandaids for particular symptoms of deeper problems". So let's just > > say I'd not be shocked. But what else would you suggest? You could > > just break out of the loop if you can't get the read lock, but while > > the *common* case is likely that a lot of the inodes are on the same > > filesystem, that's certainly not the only possible case. > > Right, but in this specific case of executing writeback_inodes_wb(), > we can only be operating on a specific bdi without being told which > sb to flush. If we are told which sb, then we go through > __writeback_inodes_sb() and avoid the grab_super_passive() > altogether because some other thread holds the s_umount lock. > > These no-specific-sb cases can come only from > wb_check_background_flush() or wb_check_old_data_flush() which, by > definition, are oppurtunist background asynchronous writeback > executed only when there is no other work to do. Further, if there > is new work queued while they are running, they abort. There is another type of work that won't abort: the one that started by __bdi_start_writeback() and I'd call it "nr_pages" work since its termination condition is simply nr_pages and nothing more. It's not the for_background or for_kupdate works that will abort as soon as other works are queued. Here I listed the two conditions for the deadlock (missing the 3rd one: the read-write-read lock): http://lkml.org/lkml/2011/7/31/63 In particular, the deadlock, once triggered, does not depend on how large nr_pages is. It can be fixed by either of 1) the flusher abort the work early 2) the flusher don't busy retry the inode(s) In the other email, I proposed to fix (2) for now and then do (1) in future: : So I'd propose this patch as the reasonable fix for 3.1. In long term, : we may further consider make the nr_pages works give up temporarily : when there comes a sync work, which could eliminate lots of : redirty_tail()s at this point. > Hence if we can't grab the superblock here, it is simply another > case of a "new work pending" interrupt, right? And so aborting the > work is the correct thing to do? Especially as it avoids all the > ordering problems of redirtying inodes and allows the writeback work > to restart (form whatever context it is stared from next time) where > it stopped. The long term solution (2) I proposed is actually the same as your proposal to abort the work :) Thanks, Fengguang -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html