On Tue, Mar 13, 2012 at 10:30:20PM +0100, Jan Kara wrote: > On Tue 13-03-12 10:48:17, Dave Chinner wrote: > > On Mon, Mar 12, 2012 at 06:55:51PM +0100, Jan Kara wrote: > > > On Mon 12-03-12 09:45:37, Dave Chinner wrote: > > > I just had to change some properties of the shell to > > > be usable for freezing as well. But the shell has to be maintained > > > regardless of how we decide to do freezing. > > > > > > Also believe me that I've initially tried to fix the freezing without the > > > external shell. But it just isn't enough to prevent dirty inodes from being > > > created (e.g. from directory operations) while sync during freezing is > > > running. > > > > Sure - the old mechanism was "freeze data" which only protected data > > modification paths, followed by "freeze trans" which protected > > metadata modification paths (like directory operations). You started > > by removing the metadata modification path protection - it is any > > wonder that those operations continued to dirty inodes until you > > expanded the "freeze data" shell to mean "freeze data and metadata"? > That's just not true. Remember older versions of this patch set. Old > model is doing: > freeze data > sync > freeze metadata (via freezing transations) > and it is broken mainly because while sync is running, new metadata are > dirtied and thus you end up with dirty metadata on a frozen filesystem. But that exactly the issue that the filesystem .freeze_fs method is supposed to handle. It's supposed to finish the freeze process by finishing all outstanding metadata operations and cleaning all remaining dirty metadata. It still has to guarantee this regardless of the model used prior to this. > Forcing the journal commit in ->freeze_fs() pushes changes to disk but > inodes still remain dirty and actually, it can be too late for flusher > thread anyway because it can be already hanging trying to start a > transaction. If the changes are already in the journal, then there is no new transaction work needed to clean the VFS state. Indeed, .freezefs needs to write back all dirty metadata objects that it is tracking internally in the journal so that a snapshot of the frozen image is consistent and can be mounted read-only without running recovery... > > It's like the hard shelli/soft center security model - it protects > > well from external attackers, but once they get inside, there's no > > protection. Indeed, there is zero protection from inside jobs, and > > thats where multiple layers of defense are needed. > > > > Your freeze changes provide us with a hard outer shell, but it's got > > a really, really soft center. We can't use the outer shell defenses > > deep inside the shell because of the locking orders that have been > > defined (freeze protection must be outermost), so we need another > > layer.... > I think there are two different issues. One is your complaint that > extending the shell to cover also internal sources of fs modification will > be too fragile if not done implicitely on transaction start. Most definitely. It means that we have to have freeze behaviour in mind whether doing any async/background work. That's usually the furthest think from people's minds when implementing stuff like background inode initialisation... > Another is a question whether we could reasonably fend off internal > modification using the current two counters. The MRU thread or ->release > can be easily handled with them (->release can be handled by the counter > ranking below mmap_sem). For shrinkers, I'm not that certain. Certainly the > outer counter won't work. The inner one with lock ranking just below > mmap_sem might work - depends on whether there are any XFS locks under > mmap_sem from within which we'd do GFP_KERNEL allocations. But here I agree > issues might be nasty enough to warrant another counter. Might be nasty? There's no question about it in my mind that it *would* be nasty.... > This is connected with another somewhat related issue: Do we want shrinkers > to block on frozen filesystem? That could result in unrelated processes > blocking on frozen filesystem. I'd find it nicer if we just skipped > the writes if the filesystem is frozen as it seems to me > xfs_free_eofblocks() is somewhat optional. Is it true? Not really. It needs to be done on rw filesytems[*]. Besides, this isn't really an XFS specific problem - the shrinkers need to operate on frozen filesystems. e.g. run a directory traversal of a frozen filesystem, and if you stop the shrinkers from running then the dcache/icache growth will run the system out of memory. Sure, blocking a shrinker can be considered antisocial, but if you can't free memory because it will dirty inodes on a frozen filesystem then it is better to block the shrinker until the filesytem is unfrozen than it is to allow the shrinker to fail and eventually OOM the machine. Freezing filesysetms is supposed to be a short term, workload-invisible operation. Kicking the OOM killer because you can't free inodes is far more antisocial than having those applications pause while the fs is frozen. [*] Right now we are discussing adding a new background/async worker that that does speculative preallocation removal to make it occur faster than it currently does. So skipping it is not really an option and this is exactly the sort of "new feature needs to consider how to deal with freeze" problem I'm talking about. > > > But in the end, if you guys really feel strongly about it and decide that > > > XFS wants to keep it's mechanism of freezing on transaction start, then I > > > won't stop you. Although it would mean XFS would have to have three > > > counters to protect freezing while other filesystems will need only two. > > > > There is two counters only to avoid lockdep problems because the > > different entry points to the outer shell have different locking > > orders. I fail to see why adding a third counter to provide an > > innermost freeze lock that retains the current level of protection > > is a big deal because all that it really needs to work is a > > different lock order. > The two counters are needed to avoid deadlocks, not just because of > lockdep... The third counter can be added but per-cpu counters are not > exactly cheap (in terms of memory) so I don't want to add it without a good > reason. Lock ordering constraints in shrinkers might be this good reason. An additional percpu counter is far, far cheaper than the cost of always needing to consider how not to break filesystem freezing for the forseeable future. > Also fitting the XFS transaction system to freezing was a bit ugly (because > you need to log a dummy transaction on a frozen filesystem Sure - that's to ensure the log is dirty so that open-but-unlinked inodes are handled correctly by recovery if the system crashes while the filesystem is frozen. But the last patch in your series would prevent freezing filesystems while there are open-but-unlinked files present, so all that grot could be removed. Indeed, when that patch goes to mainline, the XFS code for handling remount,ro and freeze can be made identical.... > and you have > things like xfs_trans_dup() which requires us to bump the counter even > though the filesystem is already in freezing process) so I'd be happier if > we could avoid it... The only additional infrastructure that you need to provide is a "start_nested" operation for xfs_trans_dup() that increments the percpu counter. I don't think that's a difficult problem to solve. Cheers, Dave. -- Dave Chinner dchinner@xxxxxxxxxx -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html