Re: [PATCH v7] ovl: Improving syncfs efficiency

Chengguang Xu <cgxu519@xxxxxxxxxx> · Fri, 26 Jan 2018 11:02:03 +0800

Hi Jan

Thanks for your comments, I have some questions below.

> 在 2018年1月25日，下午10:32，Jan Kara <jack@xxxxxxx> 写道：
> 
> On Thu 25-01-18 17:35:55, Chengguang Xu wrote:
>> Currently syncfs(2) call on overlayfs just simply call sync_filesystem()
>> on upper_sb to synchronize whole dirty inodes in upper filesystem
>> regardless of the overlay ownership of the inode. In the use case of
>> container, when multiple containers using the same underlying upper
>> filesystem, it has some shortcomings as below.
>> 
>> (1) Performance
>> Synchronization is probably heavy because it actually syncs unnecessary
>> inodes for target overlayfs.
>> 
>> (2) Interference
>> Unplanned synchronization will probably impact IO performance of
>> unrelated container processes on the other overlayfs.
>> 
>> This patch iterates upper inode list in overlayfs to only sync target
>> dirty inodes and wait for completion. By doing this, It is able to reduce
>> cost of synchronization and will not seriously impact IO performance of
>> unrelated processes. In special case, when having very few dirty inodes
>> and a lot of clean upper inodes in overlayfs, then iteration may waste
>> time so that synchronization is slower than before, but we think the
>> potential for performance improvement far surpass the potential for
>> performance regression in most cases.
>> 
>> Signed-off-by: Chengguang Xu <cgxu519@xxxxxxxxxx>
> 
> This looks much better than the last version I've reviewed! Good work. Just
> some small comments below.
> 
>> +void ovl_evict_inode(struct inode *inode)
>> +{
>> +	if (ovl_inode_upper(inode)) {
>> +		write_inode_now(ovl_inode_upper(inode), 1);
>> +		ovl_detach_upper_inodes_list(inode);
>> +
>> +		/*
>> +		 * ovl_sync_inodes() may wait until
>> +		 * flag OVL_EVICT_PENDING to be cleared.
>> +		 */
>> +		spin_lock(&inode->i_lock);
>> +		ovl_clear_flag(OVL_EVICT_PENDING, inode);
>> +		/* See also atomic_bitops.txt */
>> +		smp_mb__after_atomic();
>> +		wake_up_bit(&OVL_I(inode)->flags, OVL_EVICT_PENDING);
>> +		spin_unlock(&inode->i_lock);
>> +		clear_inode(inode);
>> +	}
>> +}
>> +
>> +void ovl_wait_evict_pending(struct inode *inode)
>> +{
>> +	struct ovl_fs *ofs = inode->i_sb->s_fs_info;
>> +	struct ovl_inode *oi = OVL_I(inode);
>> +	DEFINE_WAIT_BIT(wait, &oi->flags, OVL_EVICT_PENDING);
>> +	wait_queue_head_t *wqh;
>> +	bool sleep;
>> +
>> +	if (ovl_test_flag(OVL_EVICT_PENDING, inode)) {
> 
> AFAIU your code OVL_EVICT_PENDING should be always set when this function
> is called. Maybe just make a WARN_ON from this condition?

OK, I’ll add.

> 
>> +		wqh = bit_waitqueue(&oi->flags, OVL_EVICT_PENDING);
>> +		prepare_to_wait(wqh, &wait.wq_entry, TASK_UNINTERRUPTIBLE);
>> +		sleep = ovl_test_flag(OVL_EVICT_PENDING, inode);

I think we also don’t have to check sleep again here because before unlock
ofs->upper_inodes_lock OVL_EVICT_PENDING will not be cleared. 

>> +		spin_unlock(&inode->i_lock);
>> +		spin_unlock(&ofs->upper_inodes_lock);
>> +		if (sleep)
>> +			schedule();
>> +		finish_wait(wqh, &wait.wq_entry);
>> +	}
>> +}
>> +
>> +/**
>> + * ovl_sync_inodes
>> + * @sb: The overlayfs super block
>> + *
>> + * upper_inodes list is used for organizing ovl inode which has upper inode,
>> + * by iterating the list to looking for and syncing dirty upper inode.
>> + *
>> + * When starting syncing inode, we add the inode to wait list explicitly,
>> + * in order to save memory we reuse inode->i_wb_list and sb->s_inodes_wb,
>> + * so those fields have slightly differnt meanings in overlayfs.
>> + */
>> +static void ovl_sync_inodes(struct super_block *sb)
>> +{
>> +	struct ovl_fs *ofs = sb->s_fs_info;
>> +	struct ovl_inode *oi;
>> +	struct inode *inode;
>> +	struct inode *upper_inode;
>> +	struct blk_plug plug;
>> +	LIST_HEAD(sync_tmp_list);
>> +
>> +	struct writeback_control wbc_for_sync = {
>> +		.sync_mode		= WB_SYNC_ALL,
>> +		.for_sync		= 1,
>> +		.range_start		= 0,
>> +		.range_end		= LLONG_MAX,
>> +		.nr_to_write		= LONG_MAX,
>> +	};
>> +
>> +	blk_start_plug(&plug);
>> +	spin_lock(&ofs->upper_inodes_lock);
>> +	list_splice_init(&ofs->upper_inodes, &sync_tmp_list);
>> +
>> +	while (!list_empty(&sync_tmp_list)) {
>> +		oi = list_first_entry(&sync_tmp_list, struct ovl_inode,
>> +						upper_inodes_list);
>> +		inode = &oi->vfs_inode;
>> +		spin_lock(&inode->i_lock);
>> +
>> +		list_move_tail(&oi->upper_inodes_list, &ofs->upper_inodes);
>> +		if (inode->i_state & I_NEW) {
>> +			spin_unlock(&inode->i_lock);
>> +			continue;
>> +		}
>> +
>> +		/*
>> +		 * If ovl inode state is I_FREEING,
>> +		 * left sync operation to the ovl_evict_inode(),
>> +		 * so wait here until OVL_EVICT_PENDING flag to be cleared.
>> +		 */
>> +		if (inode->i_state & I_FREEING) {
>> +			ovl_wait_evict_pending(inode);
> 
> This is still somewhat racy. The problem is that a wakeup in
> ovl_wait_evict_pending() can come from all sorts of reasons. You are not
> guaranteed OVL_EVICT_PENDING is cleared when you are woken up. You cannot
> easily recheck the flag as the inode *may* really be free by this time.

I think DEFINE_WAIT_BIT can wakeup sleeping thread only for right reason and
there is no other place to set OVL_EVICT_PENDING again. IIUC, we don’t have
to check OVL_EVICT_PENDING in a loop. Am I missing something?

> What I'd do to fix this race is not to move inode with OVL_EVICT_PENDING to
> the tail of ofs->upper_inodes but just keep it in &sync_tmp_list. That way
> we are sure that either the inode gets freed (thus removed from
> &sync_tmp_list) and we'll never see it again, or we'll again see it in
> &sync_tmp_list, see again OVL_EVICT_PENDING, and wait again for it which is
> what we need.

If my analysis above is not correct, I’ll modify as your suggestion here.

> 
>> +			goto next;
>> +		}
>> +
>> +		ihold(inode);
>> +		upper_inode = ovl_inode_upper(inode);
>> +		spin_unlock(&inode->i_lock);
>> +		spin_unlock(&ofs->upper_inodes_lock);
>> +
>> +		if (!(upper_inode->i_state & I_DIRTY_ALL)) {
>> +			if (!mapping_tagged(upper_inode->i_mapping,
>> +						PAGECACHE_TAG_WRITEBACK)) {
>> +				iput(inode);
>> +				goto next;
>> +			}
>> +		} else {
>> +			sync_inode(upper_inode, &wbc_for_sync);
>> +		}
>> +
>> +		spin_lock(&sb->s_inode_wblist_lock);
>> +		list_add_tail(&inode->i_wb_list, &sb->s_inodes_wb);
>> +		spin_unlock(&sb->s_inode_wblist_lock);
>> +
>> +next:
>> +		if (need_resched()) {
>> +			blk_finish_plug(&plug);
> 
> Plug is automatically unplugged when scheduling so no need for
> blk_finish_plug() here.
> 
>> +			cond_resched();
>> +			blk_start_plug(&plug);
>> +		}
> 
> Also
> 	if (need_resched())
> 		cond_resched();
> is equivalent to just
> 	cond_resched();
> 
> so you can just do that here.

I will fix.

> 
>> +		spin_lock(&ofs->upper_inodes_lock);
>> +	}
>> +	spin_unlock(&ofs->upper_inodes_lock);
>> +	blk_finish_plug(&plug);
>> +}
>> +
>> +/**
>> + * ovl_wait_inodes
>> + * @sb: The overlayfs super block
>> + *
>> + * Waiting writeback inodes which are in s_inodes_wb list,
>> + * all the IO that has been issued up to the time this
>> + * function is enter is guaranteed to be completed.
>> + */
>> +static void ovl_wait_inodes(struct super_block *sb)
>> +{
>> +	LIST_HEAD(sync_wait_list);
>> +	struct inode *inode;
>> +	struct inode *upper_inode;
>> +
>> +	/*
>> +	 * ovl inode in sb->s_inodes_wb list has already got inode reference,
>> +	 * this will avoid silent inode droping during waiting on it.
>> +	 *
>> +	 * Splice the sync wait list onto a temporary list to avoid waiting on
>> +	 * inodes that have started writeback after this point.
>> +	 */
>> +	spin_lock(&sb->s_inode_wblist_lock);
>> +	list_splice_init(&sb->s_inodes_wb, &sync_wait_list);
>> +
>> +	while (!list_empty(&sync_wait_list)) {
>> +		inode = list_first_entry(&sync_wait_list, struct inode,
>> +							i_wb_list);
>> +		list_del_init(&inode->i_wb_list);
>> +		upper_inode = ovl_inode_upper(inode);
>> +		spin_unlock(&sb->s_inode_wblist_lock);
>> +
>> +		if (!mapping_tagged(upper_inode->i_mapping,
>> +				PAGECACHE_TAG_WRITEBACK)) {
>> +			goto next;
>> +		}
>> +
>> +		filemap_fdatawait_keep_errors(upper_inode->i_mapping);
>> +
>> +		if (need_resched())
>> +			cond_resched();
> 
> Again, just cond_resched() is enough here.

I will fix.

Thanks,
Chengguang.

--
To unsubscribe from this list: send the line "unsubscribe linux-unionfs" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html