On Mon, Jan 22, 2018 at 6:00 PM, Chengguang Xu <cgxu519@xxxxxxxxxx> wrote: > >> 在 2018年1月22日,下午8:58,Amir Goldstein <amir73il@xxxxxxxxx> 写道: >> >> On Mon, Jan 22, 2018 at 2:05 PM, Chengguang Xu <cgxu519@xxxxxxxxxx> wrote: >>>> >>>> 在 2018年1月22日,下午7:27,Amir Goldstein <amir73il@xxxxxxxxx> 写道: >>>> >>>> On Mon, Jan 22, 2018 at 11:47 AM, Chengguang Xu <cgxu519@xxxxxxxxxx> wrote: >>>>> Currently syncfs(2) call on overlayfs just simply call sync_filesystem() >>>>> on upper_sb to synchronize whole dirty inodes in upper filesystem >>>>> regardless of the overlay ownership of the inode. In the use case of >>>>> container, when multiple containers using the same underlying upper >>>>> filesystem, it has some shortcomings as below. >>>>> >>>>> (1) Performance >>>>> Synchronization is probably heavy because it actually syncs unnecessary >>>>> inodes for target overlayfs. >>>>> >>>>> (2) Interference >>>>> Unplanned synchronization will probably impact IO performance of >>>>> unrelated container processes on the other overlayfs. >>>>> >>>>> This patch iterates upper inode list in overlayfs to only sync target >>>>> dirty inodes and wait for completion. By doing this, It is able to reduce >>>>> cost of synchronization and will not seriously impact IO performance of >>>>> unrelated processes. In special case, when having very few dirty inodes >>>>> and a lot of clean upper inodes in overlayfs, then iteration may waste >>>>> time so that synchronization is slower than before, but we think the >>>>> potential for performance improvement far surpass the potential for >>>>> performance regression in most cases. >>>>> >>>>> Signed-off-by: Chengguang Xu <cgxu519@xxxxxxxxxx> >>>>> --- >>>>> Changes since v5: >>>>> - Move sync related functions to new file sync.c. >>>>> - Introduce OVL_EVICT_PENDING flag to avoid race conditions. >>>>> - If ovl indoe flag is I_WILL_FREE or I_FREEING then syncfs >>>>> will wait until OVL_EVICT_PENDING to be cleared. >>>>> - Move inode sync operation into evict_inode from drop_inode. >>>>> - Call upper_sb->sync_fs with no-wait after sync dirty upper >>>>> inodes. >>>>> - Hold s_inode_wblist_lock until deleting item from list. >>>>> - Remove write_inode fuction because no more used. >>>>> [...] >>>> >>>> >>>> >>>>> + /* We need to do this as 'inode' can be freed by now */ >>>> >>>> Could oi_prev have been freed or removed from upper_inodes list? >>>> I don't see why not. >>>> You could try to continue iteration from iput_inode if it is not NULL, >>>> but what if it is NULL? >>> >>> When iterating the list we either get reference or wait until the item gets deleted, >>> so I think there is no chance let previous inode becomes NULL. >>> >> >> Maybe I am missing something. >> When iterating the upper_inodes list you get a reference on current (oi) >> and maybe have a reference on previous via iput_inode. >> previous may have a reference from wb_list, but that reference can be lost >> while we are not holding the wblist lock. No? > > The reference of previous inode can be lost if the inode is not our target, > but it must be happened after iput(iput_inode), before this will be safe. > I was talking about the case where iput_inode is NULL. > When deleting item from list we take lock to protect from concurrent operations, > so if we have already got list lock, it will be safe to touch items in the list. > But we have let go of the list lock. >> >>> >>>> Maybe instead of iput_inode = NULL when adding to inodes_wb list >>>> get another reference to inode and always store iput_inode = inode >>>> to use as prev in this case? >>> >>> Maybe it’s better than current, but I think it’s more complex to understand >>> the behavior, so please let me wait for more comments for the approach. >>> >> >> Sure, let's wait for other comments, but I can't say I understand what is >> complex to understand about change that I propose. If anything it is simpler >> to understand, especially if you rename iput_inode to prev_inode: >> >> + /* >> + * If ovl inode is not in sb->s_inodes_wb list, add to the >> + * list and increment inode reference until IO to be >> completed on >> + * the inode. >> + */ >> + spin_lock(&sb->s_inode_wblist_lock); >> + if (list_empty(&inode->i_wb_list)) { >> + list_add_tail(&inode->i_wb_list, &sb->s_inodes_wb); >> + ihold(inode); >> + } >> + iput_inode = inode; > > > Actually, I forgot to compare whether the previous item is the list head or not, > so I think there is a problem when deleting very first item in the list, > even we use iput_inode here, we can’t get rid of list operation because of that. > if you think the code should be more readable, how about modify like below? > > 1. take sb->s_sync_lock mutex lock > 2. splice to local temp list > 3. get first entry every time to operate > 4. put into sb->s_inode_wblist again one by one > > I think there is no much concurrent syncfs running at the same time. This could work. Need to see it. > > >> >>> >>>> >>>> It would be good if you wrote the "life cycle" of ovl_inode in a >>>> header comment to >>>> this file because it has become non-trivial to follow. >>>> An explicit mention of locking order is also very good to have in >>>> header comment. >>>> >>>> >>> >>> Let me add in next version. >>> >>> Talking about life cycle of ovl_inode, currently I only mark OVL_EVICT_EXPENDING >>> for ovl_inode which has upper inode, is it acceptable? >>> >> >> Until your change, ovl_inode life cycle was 'generic' when refcount >> dropped to 0. >> Now, inode can be in OVL_EVICT_PENDING state, it can be waiting for sync >> on or off upper_inodes list. I'm just saying it would be nice to >> document all the >> possible states of an ovl_inode during eviction and document all the locking >> and refcount guaranties in one place to see that we did not miss anything. >> >> The volume of this documentation is entirely up to you. >> I think it will be good for you as well to tell the whole story from >> start to end >> and verify there are no holes in the plot. > > > I did not mean only write above asking to the document, I’m just asking option for OVL_EVICT_PEDING flag. If you meant the flag name or using a flag in general, that seems good to me. > Currently the flag cycle of ovl indoe split into different cycles by having upper inode or not. > > 1. having upper inode > (allocation) -> new -> none -> OVL_EVICT_PENDING|I_FREEING -> I_FREEING|I_CLEAR -> (destruction) > That's good, if there are other conditions that guaranty that state transitions are not racy like list locks concurrent sync lock and elevated reflink you should mention them. Cheers, Amir. -- To unsubscribe from this list: send the line "unsubscribe linux-unionfs" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html