On Thu 12-01-12 09:36:31, Dave Chinner wrote: > On Wed, Jan 11, 2012 at 03:53:24PM -0600, Eric Sandeen wrote: > > On 1/6/12 8:09 AM, Jan Kara wrote: > > > Hello, > > > > > > I was looking at what causes filesystem to have dirty data after it is > > > frozen. After some thought I realized freezing code is inherently racy and > > > all filesystems (ext3, ext4, xfs) can have dirty data on frozen filesystem. > > > > > > The race is basically following: > > > Task 1 Task 2 > > > freeze_super() __generic_file_aio_write() > > > ... vfs_check_frozen(sb, SB_FREEZE_WRITE) > > > sb->s_frozen = SB_FREEZE_WRITE; > > > sync_filesystem(sb); > > > do the write > > > /* Here we create dirty data > > > * which is left on frozen fs */ > > > sb->s_frozen = SB_FREEZE_TRANS; > > > ... > > > ->freeze_fs() > > > > > > The problem is that you can never make checking for frozen filesystem > > > race-free with the current s_frozen scheme - the filesystem can always be > > > frozen the instant after you check for it and you end up creating dirty > > > data on frozen filesystem. > > > > > > The question is what to do with this problem. I outline the possibilities > > > that come to my mind below: > > > 1) Ignore the problem - depending on the exact fs details this could lead to > > > fs snapshot being corrupted, also flusher thread can hang on the frozen > > > filesystem (e.g. because of sync(1)) creating all sorts of secondary > > > issues. So I don't think this is really an option. > > > 2) Have a rwlock in the superblock that is held for writing while > > > filesystem freezing is in progress and held for reading by the filesystem > > > while a transaction is running except for transactions that are required > > > to do writeback. This is kind of ugly but at least for ext3/4 relatively > > > easy to implement. > > > > This is as far as I had gotten while independently thinking about it ;) > > > > But talking with dchinner, he had concerns about the scalability of any > > rwlock, and I think we (ok, mostly Dave) came up with another idea. > > > > What if we had 2 counters in the superblock, one for the equivalent of > > SB_FREEZE_WRITE and one for SB_FREEZE_TRANS. These would use similar > > infrastructure to mnt_want_write et al. > > > > Everywhere we currently vfs_check_frozen() we'd have a better-named function > > which increments the counter, then checks the freeze level. If we are > > being frozen, we drop the counter & wait. If not frozen, we continue; > > like this pseudo-code: > > > > void super_freeze_wait(sb, level) { > > while (1) { > > level_ref++; > > if (!frozen(sb, level)) > > return; /* not freezing */ > > level_ref--; > > wait_unfrozen(sb, level); > > } > > } > > > > There would also be new code to drop the counter when the dirtying is complete. > > > > The freezing functions then just have to wait until the counters hit zero > > before they can consider themselves done, and freezing is complete. That way if > > someone sneaks in while the freeze level is being set, they have already > > notified their intent, and freeze can wait for it anyway before returning. > > Just to clarify, freeze_super would do: > > sb->s_frozen = SB_FREEZE_WRITE; > smp_wmb(); > > while (sb->s_active_write_cnt > 0) > wait; > > /* no new or existing dirtying writers now, safe to sync */ > sync_filesystem(sb); > > sb->s_frozen = SB_FREEZE_TRANS; > smp_wmb(); > > while (sb->s_active_trans_cnt > 0) > wait; > > /* no new or existing transactions in progress now, so freeze */ > sb->s_op->freeze_fs(sb); > > The counter implemetations will need to scale (e.g. per-cpu > counters) and we could probably use a generic waitqueue, but I think > this can all be implemented at the superblock level and we only need > to call the inc/dec helper functions in the correct places to make > it all work. Yeah, this is where my thoughts were going as well so I was already working on patches in this direction. I'll send a first version of them in a moment. Honza -- Jan Kara <jack@xxxxxxx> SUSE Labs, CR -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html