Re: [RFC] How to fix broken freezing?

Jan Kara <jack@xxxxxxx> · Thu, 12 Jan 2012 02:09:36 +0100



On Thu 12-01-12 09:36:31, Dave Chinner wrote:
> On Wed, Jan 11, 2012 at 03:53:24PM -0600, Eric Sandeen wrote:
> > On 1/6/12 8:09 AM, Jan Kara wrote:
> > >   Hello,
> > > 
> > >   I was looking at what causes filesystem to have dirty data after it is
> > > frozen. After some thought I realized freezing code is inherently racy and
> > > all filesystems (ext3, ext4, xfs) can have dirty data on frozen filesystem.
> > > 
> > > The race is basically following:
> > > 	Task 1					Task 2
> > > freeze_super()				__generic_file_aio_write()
> > >   ...					  vfs_check_frozen(sb, SB_FREEZE_WRITE)
> > >   sb->s_frozen = SB_FREEZE_WRITE;
> > >   sync_filesystem(sb);
> > > 					  do the write
> > > 					    /* Here we create dirty data
> > > 					     * which is left on frozen fs */
> > >   sb->s_frozen = SB_FREEZE_TRANS;
> > >   ...
> > >   ->freeze_fs()
> > > 
> > > The problem is that you can never make checking for frozen filesystem
> > > race-free with the current s_frozen scheme - the filesystem can always be
> > > frozen the instant after you check for it and you end up creating dirty
> > > data on frozen filesystem.
> > > 
> > > The question is what to do with this problem. I outline the possibilities
> > > that come to my mind below:
> > > 1) Ignore the problem - depending on the exact fs details this could lead to
> > >    fs snapshot being corrupted, also flusher thread can hang on the frozen
> > >    filesystem (e.g. because of sync(1)) creating all sorts of secondary
> > >    issues. So I don't think this is really an option.
> > > 2) Have a rwlock in the superblock that is held for writing while
> > >    filesystem freezing is in progress and held for reading by the filesystem
> > >    while a transaction is running except for transactions that are required
> > >    to do writeback. This is kind of ugly but at least for ext3/4 relatively
> > >    easy to implement.
> > 
> > This is as far as I had gotten while independently thinking about it ;)
> > 
> > But talking with dchinner, he had concerns about the scalability of any
> > rwlock, and I think we (ok, mostly Dave) came up with another idea.
> > 
> > What if we had 2 counters in the superblock, one for the equivalent of
> > SB_FREEZE_WRITE and one for SB_FREEZE_TRANS.  These would use similar
> > infrastructure to mnt_want_write et al.
> > 
> > Everywhere we currently vfs_check_frozen() we'd have a better-named function
> > which increments the counter, then checks the freeze level.  If we are
> > being frozen, we drop the counter & wait.  If not frozen, we continue;
> > like this pseudo-code:
> > 
> > void super_freeze_wait(sb, level) {
> > 	while (1) {
> > 		level_ref++;
> > 		if (!frozen(sb, level))
> > 			return;	/* not freezing */
> > 		level_ref--;
> > 		wait_unfrozen(sb, level);
> > 	}
> > }
> > 
> > There would also be new code to drop the counter when the dirtying is complete.
> > 
> > The freezing functions then just have to wait until the counters hit zero
> > before they can consider themselves done, and freezing is complete.  That way if
> > someone sneaks in while the freeze level is being set, they have already
> > notified their intent, and freeze can wait for it anyway before returning.
> 
> Just to clarify, freeze_super would do:
> 
> 	sb->s_frozen = SB_FREEZE_WRITE;
> 	smp_wmb();
> 
> 	while (sb->s_active_write_cnt > 0)
> 		wait;
> 
> 	/* no new or existing dirtying writers now, safe to sync */
> 	sync_filesystem(sb);
> 
> 	sb->s_frozen = SB_FREEZE_TRANS;
> 	smp_wmb();
> 
> 	while (sb->s_active_trans_cnt > 0)
> 		wait;
> 
> 	/* no new or existing transactions in progress now, so freeze */
> 	sb->s_op->freeze_fs(sb);
> 
> The counter implemetations will need to scale (e.g. per-cpu
> counters) and we could probably use a generic waitqueue, but I think
> this can all be implemented at the superblock level and we only need
> to call the inc/dec helper functions in the correct places to make
> it all work.
  Yeah, this is where my thoughts were going as well so I was already
working on patches in this direction. I'll send a first version of them in
a moment.

								Honza
-- 
Jan Kara <jack@xxxxxxx>
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html