On Sun, Jul 31, 2011 at 02:25:27PM -1000, Linus Torvalds wrote: > On Sun, Jul 31, 2011 at 1:47 PM, Dave Chinner <david@xxxxxxxxxxxxx> wrote: > > > > Yes, I already have, a couple of hours before you sent this: > > > > http://www.spinics.net/lists/linux-fsdevel/msg47357.html > > > > We haven't found the root cause of the problem, and writeback cannot > > hold off grab_super_passive() because writeback only holds read > > locks on s_umount, just like grab_super_passive. > > With read-write semaphores, even read-vs-read recursion is a deadlock > possibility. > > Why? Because if a writer comes in on another thread, while the read > lock is initially held, then the writer will now block. And due to > fairness, now a subsequent reader will *also* block. Yes, I know. > So no, nesting readers is *not* allowed for rw_semaphores even if > naively you'd think it should work. Right, hence my question of where the deadlock was. I can see how we get a temporary holdoff due to background flushing holding the s_umount lock (and hence why this change prevents that holdoff) but that should not deadlock even if there is a pending write lock. > So if xfstests 073 does mount/umount testing, then it is entirely > possible that a reader blocks another reader due to a pending writer. And I know that remount,ro will take the lock in write mode, hence cause such a holdoff to occur. Test 073 does this, and it is likely to be the cause of the problem. However, if we can't grab the superblock, it implies it that there is something major happening on the superblock. It's likely to be going away, or in the case of a remount, something significant is changing. In both cases, writeback of dirty data is going to be handled by the umount/remount process, and so whatever writeback that is currently in progress is just getting in the way of executing the higher level, more important operation. AFAICT, there's a bunch of non-trivial ways that we can get read-write-read deadlocks that the bdi-flusher "avoids" by ignoring inodes when it can't get the superblock. However, if it can't get the superblock reference, why shouldn't we just abort whatever writeback that is going through writeback_inodes_wb? That writeback is guaranteed to be unable to complete until the reference to s_umount is dropped, so why do we continue to try one inode at a time? IOWs, what I'm asking is whether this "just move the inodes one at a time to a different queue" is just a bandaid for a particular symptom of a deeper problem we haven't realised existed.... Cheers, Dave. -- Dave Chinner david@xxxxxxxxxxxxx -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html