RE: [RFC PATCH] fs: file freeze support

Namjae Jeon <namjae.jeon@xxxxxxxxxxx> · Mon, 19 Jan 2015 22:07:01 +0900



> On Thu, Jan 15, 2015 at 08:36:55PM +0900, Namjae Jeon wrote:
> > We introduce per-file freeze feature for unifying defrag ext4 and xfs
> > as first ingredient. We get the idea courtesy of Dave Chinner
> > (https://lkml.org/lkml/2014/7/14/759)
Hi Dave,
> 
> /me pages the context back in
> 
> > per-file freeze will be used to avoid that file is not modified while
> > userspace is doing the defrag.
> >
> > This patch tries to implement file write freeze functionality.
> > Introduce new inode state, I_WRITE_FREEZED.
> 
> I_WRITE_FROZEN.
Okay.
> 
> > When this state is set, any process which tries to modify the file's address
> > space, either by pagefault mmap writes or using write(2), will block until
> > the this state is cleared. I_WRITE_FREEZED is set by calling FS_IOC_FWFREEZE
> > ioctl and clear by FS_IOC_FWTHAW ioctl.
> >
> > File write freeze functionality, when used in conjunction with
> > inode's immutable flag can be used for creating truly stable file snapshots
> > wherein write freeze will prevent any modification to the file from already
> > open file descriptors and immutable flag will prevent any new modification
> > to the file. One of the intended uses for stable file snapshots would be in
> > the defragmentation applications which defrags single file.
> 
> I don't quite understand why the full filesystem freeze is
> necessary? The thaw occurs immediately after I_WRITE_FREEZED is set,
We started by looking at fs freeze for file freeze implementation,
So got biased for using fs freeze or similar approach.
Thanks for suggesting a better way.

> which means there's nothing that prevent the file from being
> truncated or otherwise modified by fallocate, etc while it is
> frozen....
Right, So, After that, we had also thought of setting immutable
flag of inode. Immutable flag + I_WRITE_FROZEN => truly frozen file.

> 
> AFAICT, fsync will bring the file down to a consistent state and
> we've already got freeze hooks for all inode modification
> operations. We also have IO barriers for truncate operations so that
> we can wait for all outstanding IO to complete, so I would have
> thought this covers all bases for an inode freeze. i.e.:
Right.

> 
> i_mutex -> I_FROZEN -> fsync -> inode_dio_wait
> 
> Should give us a clean inode where there are not ongoing operations
> by the time that inode_dio_wait() completes. All new modification
> operations need to check I_FROZEN in addition to the superblock
> freeze checks...
I checked the routines where checks for I_FROZEN would be required.
Most of them are Ok but do_unlinkat() confuses me a little.
vfs_unlink is called under parent inode's i_mutex, so we cannot sleep
keeping parent's i_mutex held.
i.e while freezing file, all file in directory are blocked by parent
i_mutex. Is it ok to release parnets->mutex before checking for I_FROZEN
or there is some idea?

> 
> > For implementation purpose, initially we tried to keep percpu usage counters
> > inside struct inode just like there is struct sb_writers in super_block.
> > But considering that it will significantly bloat up struct inode when actually
> > the usage of file write freeze will be infrequent, we dropped this idea.
> > Instead we have tried to use already present filesystem freezing infrastructure.
> > Current approach makes it possible for implementing file write freeze without
> > bloating any of struct super_block/inode.
> > In FS_IOC_FWFREEZE, we wait for complete fs to be frozen, set I_WRITE_FREEZED to
> > inode's state and unfreeze the fs.
> >
> > TODO : prevent inode eviction when I_WRITE_FREEZED state is set.
> 
> The app needs to retain a reference to the inode (i.e. open
> file descriptor) for the length of time that the inode is to be
> frozen. That will prevent reclaim of the inode.
> 
> If the app fails to thaw the inode before it drops all references
> to it, then all bets are off - we do not want to give userspace a
> mechanism to pin arbitrarily large amounts of memory....
Okay.

> 
> > TODO : xfstests testcase for file freeze.
> >
> > Signed-off-by: Namjae Jeon <namjae.jeon@xxxxxxxxxxx>
> > Signed-off-by: Ashish Sangwan <a.sangwan@xxxxxxxxxxx>
> > ---
> >  fs/btrfs/inode.c        |  1 +
> >  fs/buffer.c             |  1 +
> >  fs/ext4/inode.c         |  1 +
> >  fs/f2fs/file.c          |  1 +
> >  fs/gfs2/file.c          |  1 +
> >  fs/inode.c              | 19 ++++++++++++++++++
> >  fs/ioctl.c              | 30 +++++++++++++++++++++++++++++
> >  fs/nilfs2/file.c        |  1 +
> >  fs/ocfs2/mmap.c         |  1 +
> >  fs/super.c              | 51 +++++++++++++++++++++++++++++++++++++++++++++++++
> >  include/linux/fs.h      | 13 +++++++++++++
> >  include/uapi/linux/fs.h |  2 ++
> >  mm/filemap.c            |  1 +
> >  13 files changed, 123 insertions(+)
> >
> > diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> > index e687bb0..357c749 100644
> > --- a/fs/btrfs/inode.c
> > +++ b/fs/btrfs/inode.c
> > @@ -8262,6 +8262,7 @@ int btrfs_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
> >  	u64 page_start;
> >  	u64 page_end;
> >
> > +	inode_start_write(inode);
> >  	sb_start_pagefault(inode->i_sb);
> 
> Wouldn't it be better to have:
> 
> 	inode_start_pagefault(inode);
> 
> and then have it call sb_start_pagefault() internally?
Right.

> 
> > +void inode_start_write(struct inode *inode)
> > +{
> > +	struct super_block *sb = inode->i_sb;
> > +
> > +retry:
> > +	spin_lock(&inode->i_lock);
> > +	if (inode->i_state & I_WRITE_FREEZED) {
> > +		DEFINE_WAIT(wait);
> > +
> > +		prepare_to_wait(&sb->s_writers.wait_unfrozen, &wait,
> > +				TASK_UNINTERRUPTIBLE);
> > +		spin_unlock(&inode->i_lock);
> > +		schedule();
> > +		finish_wait(&sb->s_writers.wait_unfrozen, &wait);
> > +		goto retry;
> > +	}
> > +	spin_unlock(&inode->i_lock);
> 
> Please use a while() loop, not goto.
Okay.

> 
> 	spin_lock(&inode->i_lock);
> 	while(inode->i_state & I_WRITE_FROZEN) {
> 		DEFINE_WAIT(wait);
> 
> 		prepare_to_wait(&sb->s_writers.wait_unfrozen, &wait,
> 				TASK_UNINTERRUPTIBLE);
> 		spin_unlock(&inode->i_lock);
> 		schedule();
> 		finish_wait(&sb->s_writers.wait_unfrozen, &wait);
> 		spin_lock(&inode->i_lock);
> 	}
> 	spin_unlock(&inode->i_lock);
> 
> > +
> > +int file_write_freeze(struct inode *inode)
> > +{
> > +	struct super_block *sb = inode->i_sb;
> > +	int ret = 0;
> > +
> > +	if (!S_ISREG(inode->i_mode))
> > +		return -EINVAL;
> > +
> > +	spin_lock(&inode->i_lock);
> > +
> > +	if (inode->i_state & I_WRITE_FREEZED)
> > +		ret = -EBUSY;
> > +	spin_unlock(&inode->i_lock);
> > +	if (ret)
> > +		return ret;
> > +
> > +	ret = freeze_super(sb);
> > +	if (ret)
> > +		return ret;
> > +
> > +	spin_lock(&inode->i_lock);
> > +	inode->i_state |= I_WRITE_FREEZED;
> > +	smp_wmb();
> > +	spin_unlock(&inode->i_lock);
> > +
> > +	return thaw_super(sb);
> > +}
> > +EXPORT_SYMBOL(file_write_freeze);
> > +
> > +int file_write_unfreeze(struct inode *inode)
> > +{
> > +	struct super_block *sb = inode->i_sb;
> > +
> > +	if (!S_ISREG(inode->i_mode))
> > +		return -EINVAL;
> > +
> > +	spin_lock(&inode->i_lock);
> > +
> > +	if (!(inode->i_state & I_WRITE_FREEZED)) {
> > +		spin_unlock(&inode->i_lock);
> > +		return -EINVAL;
> > +	}
> > +
> > +	inode->i_state &= ~I_WRITE_FREEZED;
> > +	smp_wmb();
> > +	wake_up(&sb->s_writers.wait_unfrozen);
> > +	spin_unlock(&inode->i_lock);
> > +	return 0;
> > +}
> > +EXPORT_SYMBOL(file_write_unfreeze);
> 
> These look terrible racy. Freeze can race with other filesystem
> freeze and thaw operations, so if thaw_super() fails you could have
> frozen the inode but you still get -EINVAL for the ioctl. Or you
> could have multiple applications doing file freeze/thaw, and so an
> unfreeze occurs while a freeze if waiting on a thaw, and so the app
> waiting on a freeze might return with an unfrozen inode, even though
> there were no errors....
True.

> 
> 
> > @@ -2328,6 +2340,7 @@ static inline bool file_start_write_trylock(struct file *file)
> >  {
> >  	if (!S_ISREG(file_inode(file)->i_mode))
> >  		return true;
> > +	inode_start_write(file_inode(file));
> >  	return __sb_start_write(file_inode(file)->i_sb, SB_FREEZE_WRITE, false);
> >  }
> 
> This needs "trylock" semantics, not blocking semantics.
Okay.

Thanks for your review!!
> 
> Cheers,
> 
> Dave.
> --
> Dave Chinner
> david@xxxxxxxxxxxxx

--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html