Hello, On Thu 15-01-15 20:36:55, Namjae Jeon wrote: > We introduce per-file freeze feature for unifying defrag ext4 and xfs > as first ingredient. We get the idea courtesy of Dave Chinner > (https://lkml.org/lkml/2014/7/14/759) > per-file freeze will be used to avoid that file is not modified while > userspace is doing the defrag. > > This patch tries to implement file write freeze functionality. > Introduce new inode state, I_WRITE_FREEZED. > When this state is set, any process which tries to modify the file's address > space, either by pagefault mmap writes or using write(2), will block until > the this state is cleared. I_WRITE_FREEZED is set by calling FS_IOC_FWFREEZE > ioctl and clear by FS_IOC_FWTHAW ioctl. > > File write freeze functionality, when used in conjunction with > inode's immutable flag can be used for creating truly stable file snapshots > wherein write freeze will prevent any modification to the file from already > open file descriptors and immutable flag will prevent any new modification > to the file. One of the intended uses for stable file snapshots would be in > the defragmentation applications which defrags single file. > > For implementation purpose, initially we tried to keep percpu usage counters > inside struct inode just like there is struct sb_writers in super_block. > But considering that it will significantly bloat up struct inode when actually > the usage of file write freeze will be infrequent, we dropped this idea. > Instead we have tried to use already present filesystem freezing infrastructure. > Current approach makes it possible for implementing file write freeze without > bloating any of struct super_block/inode. > In FS_IOC_FWFREEZE, we wait for complete fs to be frozen, set I_WRITE_FREEZED to > inode's state and unfreeze the fs. > > TODO : prevent inode eviction when I_WRITE_FREEZED state is set. > TODO : xfstests testcase for file freeze. > > Signed-off-by: Namjae Jeon <namjae.jeon@xxxxxxxxxxx> > Signed-off-by: Ashish Sangwan <a.sangwan@xxxxxxxxxxx> > --- > fs/btrfs/inode.c | 1 + > fs/buffer.c | 1 + > fs/ext4/inode.c | 1 + > fs/f2fs/file.c | 1 + > fs/gfs2/file.c | 1 + > fs/inode.c | 19 ++++++++++++++++++ > fs/ioctl.c | 30 +++++++++++++++++++++++++++++ > fs/nilfs2/file.c | 1 + > fs/ocfs2/mmap.c | 1 + > fs/super.c | 51 +++++++++++++++++++++++++++++++++++++++++++++++++ > include/linux/fs.h | 13 +++++++++++++ > include/uapi/linux/fs.h | 2 ++ > mm/filemap.c | 1 + > 13 files changed, 123 insertions(+) > ... > diff --git a/fs/inode.c b/fs/inode.c > index aa149e7..82a96d0 100644 > --- a/fs/inode.c > +++ b/fs/inode.c > @@ -1929,3 +1929,22 @@ void inode_set_flags(struct inode *inode, unsigned int flags, > new_flags) != old_flags)); > } > EXPORT_SYMBOL(inode_set_flags); > + > +void inode_start_write(struct inode *inode) > +{ > + struct super_block *sb = inode->i_sb; > + > +retry: > + spin_lock(&inode->i_lock); > + if (inode->i_state & I_WRITE_FREEZED) { > + DEFINE_WAIT(wait); > + > + prepare_to_wait(&sb->s_writers.wait_unfrozen, &wait, > + TASK_UNINTERRUPTIBLE); > + spin_unlock(&inode->i_lock); > + schedule(); > + finish_wait(&sb->s_writers.wait_unfrozen, &wait); > + goto retry; > + } > + spin_unlock(&inode->i_lock); > +} ... > diff --git a/fs/super.c b/fs/super.c > index eae088f..5e44e42 100644 > --- a/fs/super.c > +++ b/fs/super.c > @@ -1393,3 +1393,54 @@ out: > return 0; > } > EXPORT_SYMBOL(thaw_super); > + > +int file_write_freeze(struct inode *inode) > +{ > + struct super_block *sb = inode->i_sb; > + int ret = 0; > + > + if (!S_ISREG(inode->i_mode)) > + return -EINVAL; > + > + spin_lock(&inode->i_lock); > + > + if (inode->i_state & I_WRITE_FREEZED) > + ret = -EBUSY; > + spin_unlock(&inode->i_lock); > + if (ret) > + return ret; > + > + ret = freeze_super(sb); > + if (ret) > + return ret; > + > + spin_lock(&inode->i_lock); > + inode->i_state |= I_WRITE_FREEZED; > + smp_wmb(); > + spin_unlock(&inode->i_lock); > + > + return thaw_super(sb); > +} > +EXPORT_SYMBOL(file_write_freeze); > + > +int file_write_unfreeze(struct inode *inode) > +{ > + struct super_block *sb = inode->i_sb; > + > + if (!S_ISREG(inode->i_mode)) > + return -EINVAL; > + > + spin_lock(&inode->i_lock); > + > + if (!(inode->i_state & I_WRITE_FREEZED)) { > + spin_unlock(&inode->i_lock); > + return -EINVAL; > + } > + > + inode->i_state &= ~I_WRITE_FREEZED; > + smp_wmb(); > + wake_up(&sb->s_writers.wait_unfrozen); > + spin_unlock(&inode->i_lock); > + return 0; > +} > +EXPORT_SYMBOL(file_write_unfreeze); So I was looking at the implementation and I have a few comments: 1) The trick with freezing superblock looks nice but I'm somewhat worried that if we wanted to heavily use per-inode freezing to defrag the whole filesystem it may be too slow to freeze the whole fs, mark one inode as frozen and then unfreeze the fs. But I guess we'll see that once have some reasonably working implementation. 2) The tests you are currently doing are racy. If things happen as: CPU1 CPU2 inode_start_write() file_write_freeze() sb_start_pagefault() Do modifications. Then you have a CPU modifying a file while file_write_freeze() has succeeded so it should be frozen. If you swap inode_start_write() with sb_start_pagefault() the above race doesn't happen but userspace program has to be really careful not to hit a deadlock. E.g. if you tried to freeze two inodes the following could happen: CPU1 CPU2 file_write_freeze(inode1) fault on inode1: sb_start_pagefault() inode_start_write() -> blocks file_write_freeze(inode2) blocks in freeze_super() So I don't think this is a good scheme for inode freezing... Honza -- Jan Kara <jack@xxxxxxx> SUSE Labs, CR -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html