On Wed 06-04-11 15:40:05, Dave Chinner wrote: > On Fri, Apr 01, 2011 at 04:08:56PM +0200, Jan Kara wrote: > > On Fri 01-04-11 10:40:50, Dave Chinner wrote: > > > On Mon, Mar 28, 2011 at 05:06:28PM +0900, Toshiyuki Okajima wrote: > > > > On Thu, 17 Feb 2011 11:45:52 +0100 > > > > Jan Kara <jack@xxxxxxx> wrote: > > > > > On Thu 17-02-11 12:50:51, Toshiyuki Okajima wrote: > > > > > > (2011/02/16 23:56), Jan Kara wrote: > > > > > > >On Wed 16-02-11 08:17:46, Toshiyuki Okajima wrote: > > > > > > >>On Tue, 15 Feb 2011 18:29:54 +0100 > > > > > > >>Jan Kara<jack@xxxxxxx> wrote: > > > > > > >>>On Tue 15-02-11 12:03:52, Ted Ts'o wrote: > > > > > > >>>>On Tue, Feb 15, 2011 at 05:06:30PM +0100, Jan Kara wrote: > > > > > > >>>>>Thanks for detailed analysis. Indeed this is a bug. Whenever we do IO > > > > > > >>>>>under s_umount semaphore, we are prone to deadlock like the one you > > > > > > >>>>>describe above. > > > > > > >>>> > > > > > > >>>>One of the fundamental problems here is that the freeze and thaw > > > > > > >>>>routines are using down_write(&sb->s_umount) for two purposes. The > > > > > > >>>>first is to prevent the resume/thaw from racing with a umount (which > > > > > > >>>>it could do just as well by taking a read lock), but the second is to > > > > > > >>>>prevent the resume/thaw code from racing with itself. That's the core > > > > > > >>>>fundamental problem here. > > > > > > >>>> > > > > > > >>>>So I think we can solve this by introduce a new mutex, s_freeze, and > > > > > > >>>>having the the resume/thaw first take the s_freeze mutex and then > > > > > > >>>>second take a read lock on the s_umount. > > > > > > >>> Sadly this does not quite work because even down_read(&sb->s_umount) > > > > > > >>>in thaw_super() can block if there is another process that tries to acquire > > > > > > >>>s_umount for writing - a situation like: > > > > > > >>> TASK 1 (e.g. flusher) TASK 2 (e.g. remount) TASK 3 (unfreeze) > > > > > > >>>down_read(&sb->s_umount) > > > > > > >>> block on s_frozen > > > > > > >>> down_write(&sb->s_umount) > > > > > > >>> -blocked > > > > > > >>> down_read(&sb->s_umount) > > > > > > >>> -blocked > > > > > > >>>behind the write access... > > > > > > >>> > > > > > > >>>The only working solution I see is to check for frozen filesystem before > > > > > > >>>taking s_umount semaphore which seems rather ugly (but might be bearable if > > > > > > >>>we did so in some well described wrapper). > > > > > > >>I created the patch that you imagine yesterday. > > > > > > >> > > > > > > >>I got a reproducer from Mizuma-san yesterday, and then I executed it on the kernel > > > > > > >>without a fixed patch. After an hour, I confirmed that this deadlock happened. > > > > > > >> > > > > > > >>However, on the kernel with a fixed patch, this deadlock doesn't still happen > > > > > > >>after 12 hours passed. > > > > > > >> > > > > > > >>The patch for linux-2.6.38-rc4 is as follows: > > > > > > >>--- > > > > > > >> fs/fs-writeback.c | 2 +- > > > > > > >> 1 files changed, 1 insertions(+), 1 deletions(-) > > > > > > >> > > > > > > >>diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c > > > > > > >>index 59c6e49..1c9a05e 100644 > > > > > > >>--- a/fs/fs-writeback.c > > > > > > >>+++ b/fs/fs-writeback.c > > > > > > >>@@ -456,7 +456,7 @@ static bool pin_sb_for_writeback(struct super_block *sb) > > > > > > >> spin_unlock(&sb_lock); > > > > > > >> > > > > > > >> if (down_read_trylock(&sb->s_umount)) { > > > > > > >>- if (sb->s_root) > > > > > > >>+ if (sb->s_frozen == SB_UNFROZEN&& sb->s_root) > > > > > > >> return true; > > > > > > >> up_read(&sb->s_umount); > > > > > > > > > > > > > So this is something along the lines I thought but it actually won't work > > > > > > >for example if sync(1) is run while the filesystem is frozen (that takes > > > > > > >s_umount semaphore in a different place). And generally, I'm not convinced > > > > > > >there are not other places that try to do IO while holding s_umount > > > > > > >semaphore... > > > > > > OK. I understand. > > > > > > > > > > > > This code only fixes the case for the following path: > > > > > > writeback_inodes_wb > > > > > > -> ext4_da_writepages > > > > > > -> ext4_journal_start_sb > > > > > > -> vfs_check_frozen > > > > > > But, the code doesn't fix the other cases. > > > > > > > > > > > > We must modify the local filesystem part in order to fix all cases...? > > > > > Yes, possibly. But most importantly we should first find clear locking > > > > > rules for frozen filesystem that avoid deadlocks like the one above. And > > > > > the freezing / unfreezing code might become subtle for that reason, that's > > > > > fine, but it would be really good to avoid any complicated things for the > > > > > code in the rest of the VFS / filesystems. > > > > I have deeply continued to examined the root cause of this problem, then > > > > I found it. > > > > > > > > It is that we can write a memory which is mmaped to a file. Then the memory > > > > becomes "DIRTY" so then the flusher thread (ex. wb_do_writeback) tries to > > > > "writeback" the memory. > > > > > > Then surely the issue is that .page_mkwrite is not checking that the > > > filesystem is frozen before allowing the page fault to continue and > > > dirty the page? > > And is this a bug? That isn't clear to me... > > Given the semantics of a frozen filesystem, letting any object be > dirtied while frozen (be it an inode, a page, a metadata block, etc) > is definitely a bug. > > The way the freeze code is architected is that incoming dirtying > events are prevented so that the writeback side does not need to > care about the frozen state of the filesystem at all. The freeze > operation is supposed to block new dirtiers, then flush all dirty > objects resulting in everything being clean in the filesystem. > > Hence if no objects are being dirtied, then there should never be > any need to block writeback threads due to the filesytem being > frozen because, by definition, there should be no work for them to > do. Hence if objects are being dirtied while the filesystem is > frozen, then that is a bug. OK, after some thought I start to agree with you that it would be nice if we didn't allow the pages to be dirtied at the first place. Otherwise things get a bit fragile as writing a data block does *not* need a transaction start as such (we just happen to do it in all code paths)... > > > > I think the best approach to fix this problem is to let users not to write > > > > memory which is mapped to a certain file while the filesystem is freezing. > > > > However, it is very difficult to control users not to write memory which has > > > > been already mapped to the file. > > > > > > If you don't allow the page to be dirtied in the fist place, then > > > nothing needs to be done to the writeback path because there is > > > nothing dirty for it to write back. > > Sure but that's only the problem he was able to hit. But generally, > > there's a problem with needing s_umount for unfreezing because it isn't > > clear there aren't other code paths which can block with s_umount held > > waiting for fs to get unfrozen. And these code paths would cause the same > > deadlock. That's why I chose to get rid of s_umount during thawing. > > Holding the s_umount lock while checking if frozen and sleeping > is essentially an ABBA lock inversion bug that can bite in many more > places that just thawing the filesystem. Any where this is done should > be fixed, so I don't think just removing the s_umount lock from the thaw > path is sufficient to avoid problems. That's easily said but hard to do - any transaction start in ext3/4 may block on filesystem being frozen (this seems to be similar for XFS as I'm looking into the code) and transaction start traditionally nests inside s_umount (and basically there's no way around that since sync() calls your fs code with s_umount held). So I'm afraid we are not going to get rid of this ABBA dependency unless we declare that s_umount ranks above filesystem being frozen - but surely I'm open to suggestions. Another possibility is just to hide the problem e.g. by checking for frozen filesystem whenever we try to get s_umount. But that looks a bit ugly to me. Honza -- Jan Kara <jack@xxxxxxx> SUSE Labs, CR -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html