> 在 2018年1月9日,下午9:29,Amir Goldstein <amir73il@xxxxxxxxx> 写道: > > On Tue, Jan 9, 2018 at 2:07 PM, Chengguang Xu <cgxu519@xxxxxxxxxx> wrote: >>> >>> 在 2018年1月8日,下午8:10,Amir Goldstein <amir73il@xxxxxxxxx> 写道: >>> >>> On Mon, Jan 8, 2018 at 10:41 AM, Chengguang Xu <cgxu519@xxxxxxxxxx> wrote: >>>> Currently syncfs(2) call on overlayfs just simply call sync_filesystem() >>>> on upper_sb to synchronize whole dirty inodes in upper filesystem >>>> regardless of the overlay ownership of the inode. It has obvious >>>> shortcomings as below. >>>> >>>> (1) Performance >>>> Synchronization is probably heavy in most cases, especially when upper >>>> filesystem is not dedicated to target overlayfs. >>>> >>>> (2) Interference >>>> Unplanned synchronization will probably impact IO performance of the >>>> processes which use other overlayfs on same upper filesystem or directly >>>> use upper filesystem. >>>> >>>> This patch iterates overlay inodes to only sync target dirty inodes in >>>> upper filesystem and wait on temporary waiting list to avoid waiting on >>>> inodes that have started writeback afterwards. By doing this, >>>> it is able to reduce cost of synchronization and will not seriously impact >>>> IO performance of irrelative processes. >>> >>> This looks like a very good improvement! >>> >>>> >>>> Signed-off-by: Chengguang Xu <cgxu519@xxxxxxxxxx> >>>> --- >>>> fs/overlayfs/ovl_entry.h | 4 ++ >>>> fs/overlayfs/super.c | 101 +++++++++++++++++++++++++++++++++++++++++++++-- >>>> 2 files changed, 102 insertions(+), 3 deletions(-) >>>> >>>> diff --git a/fs/overlayfs/ovl_entry.h b/fs/overlayfs/ovl_entry.h >>>> index 9d0bc03..e92c425 100644 >>>> --- a/fs/overlayfs/ovl_entry.h >>>> +++ b/fs/overlayfs/ovl_entry.h >>>> @@ -52,6 +52,8 @@ struct ovl_fs { >>>> /* Did we take the inuse lock? */ >>>> bool upperdir_locked; >>>> bool workdir_locked; >>>> + /* ovl inode sync list lock */ >>>> + spinlock_t ovl_sync_list_lock; >>>> }; >>>> >>>> /* private information held for every overlayfs dentry */ >>>> @@ -80,6 +82,8 @@ struct ovl_inode { >>>> >>>> /* synchronize copy up and more */ >>>> struct mutex lock; >>>> + /* ovl inode sync list */ >>>> + struct list_head sync_list; >>>> }; >>>> >>>> static inline struct ovl_inode *OVL_I(struct inode *inode) >>>> diff --git a/fs/overlayfs/super.c b/fs/overlayfs/super.c >>>> index 76440fe..17f1406 100644 >>>> --- a/fs/overlayfs/super.c >>>> +++ b/fs/overlayfs/super.c >>>> @@ -17,6 +17,8 @@ >>>> #include <linux/statfs.h> >>>> #include <linux/seq_file.h> >>>> #include <linux/posix_acl_xattr.h> >>>> +#include <linux/writeback.h> >>>> +#include <linux/blkdev.h> >>>> #include "overlayfs.h" >>>> >>>> MODULE_AUTHOR("Miklos Szeredi <miklos@xxxxxxxxxx>"); >>>> @@ -195,6 +197,7 @@ static struct inode *ovl_alloc_inode(struct super_block *sb) >>>> oi->__upperdentry = NULL; >>>> oi->lower = NULL; >>>> mutex_init(&oi->lock); >>>> + INIT_LIST_HEAD(&oi->sync_list); >>>> >>>> return &oi->vfs_inode; >>>> } >>>> @@ -252,6 +255,96 @@ static void ovl_put_super(struct super_block *sb) >>>> ovl_free_fs(ofs); >>>> } >>>> >>>> +/** >>>> + * ovl_sync_filesystem >>>> + * @sb: The overlayfs super block >>>> + * >>>> + * Sync underlying dirty inodes in upper filesystem and wait for completion. >>>> + * Use temporary wait list to avoid waiting on inodes that have started >>>> + * writeback after this point. >>>> + */ >>>> +static int ovl_sync_filesystem(struct super_block *sb) >>>> +{ >>>> + struct ovl_fs *ofs = sb->s_fs_info; >>>> + struct super_block *upper_sb = ofs->upper_mnt->mnt_sb; >>>> + struct ovl_inode *oi, *oi_next; >>>> + struct inode *inode, *i_next; >>>> + struct inode *upper_inode; >>>> + struct blk_plug plug; >>>> + LIST_HEAD(ovl_sync_list); >>>> + >>>> + struct writeback_control wbc = { >>>> + .sync_mode = WB_SYNC_ALL, >>> >>> Shouldn't this be WB_SYNC_NONE? We are explicitly waiting >>> inodes on the list afterwards… >> >> IIUC, when wait == 1 should use WB_SYNC_ALL mode. > > Well, yes, but for a different reason. > According to documentation of write_cache_pages() > WB_SYNC_ALL should be used for data integrity sync, which > is the case of syncfs, to start write also on pages that are already > under io, but WB_SYNC_ALL *also* implies waiting on all pages > after starting io and that is not needed in this case, because you call > filemap_fdatawait_keep_errors() later. Nevermind. WB_SYNC_ALL > is the right value to use here AFAICT. If you look carefully inside __writeback_single_inode(), there is a condition to check whether this writeback caused by reason “for_sync" or not, and if it is then skip wait because it has a separate, external IO completion path and sync_fs for guaranteeing inode metadata is written back correctly. > >> >> >>> >>>> + .for_sync = 1, >>>> + .range_start = 0, >>>> + .range_end = LLONG_MAX, >>>> + .nr_to_write = LONG_MAX, >>>> + }; >>>> + >>>> + blk_start_plug(&plug); >>>> + spin_lock(&sb->s_inode_list_lock); >>>> + list_for_each_entry_safe(inode, i_next, &sb->s_inodes, i_sb_list) { >>>> + upper_inode = ovl_inode_upper(inode); >>>> + if (!upper_inode) >>>> + continue; >>>> + >>>> + spin_lock(&upper_inode->i_lock); >>>> + if (upper_inode->i_state & (I_NEW | I_FREEING | I_WILL_FREE) || >>>> + list_empty(&upper_inode->i_io_list)) { >>>> + spin_unlock(&upper_inode->i_lock); >>>> + continue; >>>> + } >>>> + spin_unlock(&upper_inode->i_lock); >>>> + >>>> + if (!igrab(inode)) >>>> + continue; >>>> + >>>> + if (!igrab(upper_inode)) { >>>> + iput(inode); >>>> + continue; >>>> + } >>>> + spin_unlock(&sb->s_inode_list_lock); >>>> + >>>> + sync_inode(upper_inode, &wbc); >>> >>> Maybe allowed to call sync_inode() under s_inode_list_lock >>> if using WB_SYNC_NONE??? >> >> I think if hold s_inode_list_lock long time, it will probably affect >> inode allocation/destruction because every inode needs to be in/out >> sb->s_inodes list. >> >> > > Considering that we must use WB_SYNC_ALL, my comment here > is irrelevant. > >>> Do we not need to check return value from sync_inode()? >> >> I have no idea what extra work can we do for error case, >> __sync_filesytem() itself does not check return code seriously. >> >> >>> I bet it could be EROFS if upper fs was remounted readonly >>> and we did not check for that before starting to sync… >> >> I’m curious is there a real use case like this? >> Anyway, let’s add check for safety. >> >>> >>>> + spin_lock(&ofs->ovl_sync_list_lock); >>>> + if (list_empty(&OVL_I(inode)->sync_list)) >>>> + list_add(&OVL_I(inode)->sync_list, &ovl_sync_list); >>>> + else { >>>> + iput(upper_inode); >>>> + iput(inode); >>>> + } >>>> + spin_unlock(&ofs->ovl_sync_list_lock); >>>> + >>>> + if (need_resched()) { >>>> + blk_finish_plug(&plug); >>>> + cond_resched(); >>>> + blk_start_plug(&plug); >>>> + } >>>> + spin_lock(&sb->s_inode_list_lock); >>>> + } >>>> + spin_unlock(&sb->s_inode_list_lock); >>>> + blk_finish_plug(&plug); >>>> + >>>> + mutex_lock(&upper_sb->s_sync_lock); >>> >>> I suppose mutex_lock_interruptible() is in order. >> >> I can’t see benefit for it, and considering interruption by signal >> I’ll replace temporary waiting list using sb->s_inodes list. >> > > If you use mutex_lock_interruptible() at least then the 2nd user calling > syncfs can interrupt the wait while 1st user is syncing. Current syncfs for other filesystems does not behave like this, so I want to keep consistency with others. Thanks, Chengguang. -- To unsubscribe from this list: send the line "unsubscribe linux-unionfs" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html