On Wed, Jan 10, 2018 at 3:45 AM, Chengguang Xu <cgxu519@xxxxxxxxxx> wrote: > >> 在 2018年1月9日,下午9:29,Amir Goldstein <amir73il@xxxxxxxxx> 写道: >> >> On Tue, Jan 9, 2018 at 2:07 PM, Chengguang Xu <cgxu519@xxxxxxxxxx> wrote: >>>> >>>> 在 2018年1月8日,下午8:10,Amir Goldstein <amir73il@xxxxxxxxx> 写道: >>>> >>>> On Mon, Jan 8, 2018 at 10:41 AM, Chengguang Xu <cgxu519@xxxxxxxxxx> wrote: >>>>> Currently syncfs(2) call on overlayfs just simply call sync_filesystem() >>>>> on upper_sb to synchronize whole dirty inodes in upper filesystem >>>>> regardless of the overlay ownership of the inode. It has obvious >>>>> shortcomings as below. >>>>> >>>>> (1) Performance >>>>> Synchronization is probably heavy in most cases, especially when upper >>>>> filesystem is not dedicated to target overlayfs. >>>>> >>>>> (2) Interference >>>>> Unplanned synchronization will probably impact IO performance of the >>>>> processes which use other overlayfs on same upper filesystem or directly >>>>> use upper filesystem. >>>>> >>>>> This patch iterates overlay inodes to only sync target dirty inodes in >>>>> upper filesystem and wait on temporary waiting list to avoid waiting on >>>>> inodes that have started writeback afterwards. By doing this, >>>>> it is able to reduce cost of synchronization and will not seriously impact >>>>> IO performance of irrelative processes. >>>> >>>> This looks like a very good improvement! >>>> >>>>> >>>>> Signed-off-by: Chengguang Xu <cgxu519@xxxxxxxxxx> >>>>> --- >>>>> fs/overlayfs/ovl_entry.h | 4 ++ >>>>> fs/overlayfs/super.c | 101 +++++++++++++++++++++++++++++++++++++++++++++-- >>>>> 2 files changed, 102 insertions(+), 3 deletions(-) >>>>> >>>>> diff --git a/fs/overlayfs/ovl_entry.h b/fs/overlayfs/ovl_entry.h >>>>> index 9d0bc03..e92c425 100644 >>>>> --- a/fs/overlayfs/ovl_entry.h >>>>> +++ b/fs/overlayfs/ovl_entry.h >>>>> @@ -52,6 +52,8 @@ struct ovl_fs { >>>>> /* Did we take the inuse lock? */ >>>>> bool upperdir_locked; >>>>> bool workdir_locked; >>>>> + /* ovl inode sync list lock */ >>>>> + spinlock_t ovl_sync_list_lock; >>>>> }; >>>>> >>>>> /* private information held for every overlayfs dentry */ >>>>> @@ -80,6 +82,8 @@ struct ovl_inode { >>>>> >>>>> /* synchronize copy up and more */ >>>>> struct mutex lock; >>>>> + /* ovl inode sync list */ >>>>> + struct list_head sync_list; >>>>> }; >>>>> >>>>> static inline struct ovl_inode *OVL_I(struct inode *inode) >>>>> diff --git a/fs/overlayfs/super.c b/fs/overlayfs/super.c >>>>> index 76440fe..17f1406 100644 >>>>> --- a/fs/overlayfs/super.c >>>>> +++ b/fs/overlayfs/super.c >>>>> @@ -17,6 +17,8 @@ >>>>> #include <linux/statfs.h> >>>>> #include <linux/seq_file.h> >>>>> #include <linux/posix_acl_xattr.h> >>>>> +#include <linux/writeback.h> >>>>> +#include <linux/blkdev.h> >>>>> #include "overlayfs.h" >>>>> >>>>> MODULE_AUTHOR("Miklos Szeredi <miklos@xxxxxxxxxx>"); >>>>> @@ -195,6 +197,7 @@ static struct inode *ovl_alloc_inode(struct super_block *sb) >>>>> oi->__upperdentry = NULL; >>>>> oi->lower = NULL; >>>>> mutex_init(&oi->lock); >>>>> + INIT_LIST_HEAD(&oi->sync_list); >>>>> >>>>> return &oi->vfs_inode; >>>>> } >>>>> @@ -252,6 +255,96 @@ static void ovl_put_super(struct super_block *sb) >>>>> ovl_free_fs(ofs); >>>>> } >>>>> >>>>> +/** >>>>> + * ovl_sync_filesystem >>>>> + * @sb: The overlayfs super block >>>>> + * >>>>> + * Sync underlying dirty inodes in upper filesystem and wait for completion. >>>>> + * Use temporary wait list to avoid waiting on inodes that have started >>>>> + * writeback after this point. >>>>> + */ >>>>> +static int ovl_sync_filesystem(struct super_block *sb) >>>>> +{ >>>>> + struct ovl_fs *ofs = sb->s_fs_info; >>>>> + struct super_block *upper_sb = ofs->upper_mnt->mnt_sb; >>>>> + struct ovl_inode *oi, *oi_next; >>>>> + struct inode *inode, *i_next; >>>>> + struct inode *upper_inode; >>>>> + struct blk_plug plug; >>>>> + LIST_HEAD(ovl_sync_list); >>>>> + >>>>> + struct writeback_control wbc = { >>>>> + .sync_mode = WB_SYNC_ALL, >>>> >>>> Shouldn't this be WB_SYNC_NONE? We are explicitly waiting >>>> inodes on the list afterwards… >>> >>> IIUC, when wait == 1 should use WB_SYNC_ALL mode. >> >> Well, yes, but for a different reason. >> According to documentation of write_cache_pages() >> WB_SYNC_ALL should be used for data integrity sync, which >> is the case of syncfs, to start write also on pages that are already >> under io, but WB_SYNC_ALL *also* implies waiting on all pages >> after starting io and that is not needed in this case, because you call >> filemap_fdatawait_keep_errors() later. Nevermind. WB_SYNC_ALL >> is the right value to use here AFAICT. > > If you look carefully inside __writeback_single_inode(), there is a condition > to check whether this writeback caused by reason “for_sync" or not, and if it is > then skip wait because it has a separate, external IO completion path and sync_fs > for guaranteeing inode metadata is written back correctly. > > I see. I was confused by this code in write_cache_pages(): if (PageWriteback(page)) { if (wbc->sync_mode != WB_SYNC_NONE) wait_on_page_writeback(page); I thought it meant that if sync_mode == WB_SYNC_ALL, then waiting on IO completion inside write_cache_pages(), but it means waiting for a previous IO submission to complete before re-submitting IO. >> >>> >>> >>>> >>>>> + .for_sync = 1, >>>>> + .range_start = 0, >>>>> + .range_end = LLONG_MAX, >>>>> + .nr_to_write = LONG_MAX, >>>>> + }; >>>>> + >>>>> + blk_start_plug(&plug); >>>>> + spin_lock(&sb->s_inode_list_lock); >>>>> + list_for_each_entry_safe(inode, i_next, &sb->s_inodes, i_sb_list) { >>>>> + upper_inode = ovl_inode_upper(inode); >>>>> + if (!upper_inode) >>>>> + continue; >>>>> + >>>>> + spin_lock(&upper_inode->i_lock); >>>>> + if (upper_inode->i_state & (I_NEW | I_FREEING | I_WILL_FREE) || >>>>> + list_empty(&upper_inode->i_io_list)) { >>>>> + spin_unlock(&upper_inode->i_lock); >>>>> + continue; >>>>> + } >>>>> + spin_unlock(&upper_inode->i_lock); >>>>> + >>>>> + if (!igrab(inode)) >>>>> + continue; >>>>> + >>>>> + if (!igrab(upper_inode)) { >>>>> + iput(inode); >>>>> + continue; >>>>> + } >>>>> + spin_unlock(&sb->s_inode_list_lock); >>>>> + >>>>> + sync_inode(upper_inode, &wbc); [...] >>>>> + spin_lock(&ofs->ovl_sync_list_lock); >>>>> + if (list_empty(&OVL_I(inode)->sync_list)) >>>>> + list_add(&OVL_I(inode)->sync_list, &ovl_sync_list); >>>>> + else { >>>>> + iput(upper_inode); >>>>> + iput(inode); >>>>> + } >>>>> + spin_unlock(&ofs->ovl_sync_list_lock); >>>>> + >>>>> + if (need_resched()) { >>>>> + blk_finish_plug(&plug); >>>>> + cond_resched(); >>>>> + blk_start_plug(&plug); >>>>> + } >>>>> + spin_lock(&sb->s_inode_list_lock); >>>>> + } >>>>> + spin_unlock(&sb->s_inode_list_lock); >>>>> + blk_finish_plug(&plug); >>>>> + >>>>> + mutex_lock(&upper_sb->s_sync_lock); >>>> >>>> I suppose mutex_lock_interruptible() is in order. >>> >>> I can’t see benefit for it, and considering interruption by signal >>> I’ll replace temporary waiting list using sb->s_inodes list. >>> >> >> If you use mutex_lock_interruptible() at least then the 2nd user calling >> syncfs can interrupt the wait while 1st user is syncing. > > Current syncfs for other filesystems does not behave like this, > so I want to keep consistency with others. > Makes sense. Thanks, Amir. -- To unsubscribe from this list: send the line "unsubscribe linux-unionfs" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html