Re: [PATCH v6] ovl: Improving syncfs efficiency

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Tue, Jan 23, 2018 at 10:12 AM, Chengguang Xu <cgxu519@xxxxxxxxxx> wrote:
>>
>> 在 2018年1月23日,下午2:39,Amir Goldstein <amir73il@xxxxxxxxx> 写道:
>>
>> On Tue, Jan 23, 2018 at 2:24 AM, Chengguang Xu <cgxu519@xxxxxxxxxx> wrote:
>>>>
>>>> 在 2018年1月23日,上午2:45,Amir Goldstein <amir73il@xxxxxxxxx> 写道:
>>>>
>>>> On Mon, Jan 22, 2018 at 6:00 PM, Chengguang Xu <cgxu519@xxxxxxxxxx> wrote:
>>>>>
>>>>>> 在 2018年1月22日,下午8:58,Amir Goldstein <amir73il@xxxxxxxxx> 写道:
>>>>>>
>>>>>> On Mon, Jan 22, 2018 at 2:05 PM, Chengguang Xu <cgxu519@xxxxxxxxxx> wrote:
>>>>>>>>
>>>>>>>> 在 2018年1月22日,下午7:27,Amir Goldstein <amir73il@xxxxxxxxx> 写道:
>>>>>>>>
>>>>>>>> On Mon, Jan 22, 2018 at 11:47 AM, Chengguang Xu <cgxu519@xxxxxxxxxx> wrote:
>>>>>>>>> Currently syncfs(2) call on overlayfs just simply call sync_filesystem()
>>>>>>>>> on upper_sb to synchronize whole dirty inodes in upper filesystem
>>>>>>>>> regardless of the overlay ownership of the inode. In the use case of
>>>>>>>>> container, when multiple containers using the same underlying upper
>>>>>>>>> filesystem, it has some shortcomings as below.
>>>>>>>>>
>>>>>>>>> (1) Performance
>>>>>>>>> Synchronization is probably heavy because it actually syncs unnecessary
>>>>>>>>> inodes for target overlayfs.
>>>>>>>>>
>>>>>>>>> (2) Interference
>>>>>>>>> Unplanned synchronization will probably impact IO performance of
>>>>>>>>> unrelated container processes on the other overlayfs.
>>>>>>>>>
>>>>>>>>> This patch iterates upper inode list in overlayfs to only sync target
>>>>>>>>> dirty inodes and wait for completion. By doing this, It is able to reduce
>>>>>>>>> cost of synchronization and will not seriously impact IO performance of
>>>>>>>>> unrelated processes. In special case, when having very few dirty inodes
>>>>>>>>> and a lot of clean upper inodes in overlayfs, then iteration may waste
>>>>>>>>> time so that synchronization is slower than before, but we think the
>>>>>>>>> potential for performance improvement far surpass the potential for
>>>>>>>>> performance regression in most cases.
>>>>>>>>>
>>>>>>>>> Signed-off-by: Chengguang Xu <cgxu519@xxxxxxxxxx>
>>>>>>>>> ---
>>>>>>>>> Changes since v5:
>>>>>>>>> - Move sync related functions to new file sync.c.
>>>>>>>>> - Introduce OVL_EVICT_PENDING flag to avoid race conditions.
>>>>>>>>> - If ovl indoe flag is I_WILL_FREE or I_FREEING then syncfs
>>>>>>>>> will wait until OVL_EVICT_PENDING to be cleared.
>>>>>>>>> - Move inode sync operation into evict_inode from drop_inode.
>>>>>>>>> - Call upper_sb->sync_fs with no-wait after sync dirty upper
>>>>>>>>> inodes.
>>>>>>>>> - Hold s_inode_wblist_lock until deleting item from list.
>>>>>>>>> - Remove write_inode fuction because no more used.
>>>>>>>>>
>>>> [...]
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>> +                               /* We need to do this as 'inode' can be freed by now */
>>>>>>>>
>>>>>>>> Could oi_prev have been freed or removed from upper_inodes list?
>>>>>>>> I don't see why not.
>>>>>>>> You could try to continue iteration from iput_inode if it is not NULL,
>>>>>>>> but what if it is NULL?
>>>>>>>
>>>>>>> When iterating the list we either get reference or wait until the item gets deleted,
>>>>>>> so I think there is no chance let previous inode becomes NULL.
>>>>>>>
>>>>>>
>>>>>> Maybe I am missing something.
>>>>>> When iterating the upper_inodes list you get a reference on current (oi)
>>>>>> and maybe have a reference on previous via iput_inode.
>>>>>> previous may have a reference from wb_list, but that reference can be lost
>>>>>> while we are not holding the wblist lock. No?
>>>>>
>>>>> The reference of previous inode can be lost if the inode is not our target,
>>>>> but it must be happened after iput(iput_inode), before this will be safe.
>>>>>
>>>>
>>>> I was talking about the case where iput_inode is NULL.
>>>
>>> If iput_inode is NULL previous inode will not put reference until finishing wait,
>>> so this case is safer than iput_inode having value. Meanwhile if inode has reference,
>>> it won’t get deleted from list.
>>>
>>
>> Everything you write may be true, but when writing "list" and "lock" and
>> not specifying which "list" and which "lock" the explanation is not clear.
>>
>> If inode has reference it won't get deleted from upper_inodes list,
>> but it can be deleted from  s_inodes_wb list once wait for io is completed
>> and then loose the last reference and get deleted from upper_inodes list.
>>
>> As far as I can tell, ovl_sync_inodes() and ovl_wait_inodes() are not
>> synchronized by any exclusive lock, so when you let go of all the
>> ovl_wait_inodes() locks for finish_wait(), you have no guaranty on any
>> of the inodes on upper_inodes anymore, and specifically no guaranty
>> that oi_prev is alive.
>
> I’m guessing you must be worrying about one syncfs(A) is waiting the IO to
> be completed on inode 100, at the same time another syncfs(B) is iterating
> inode 200 which is next of inode 100 in the upper_inodes list. Right?

Not exactly. inode 100 can be put on wb_list by syncfs(B) without holding
a reference in iput_inode. Then syncfs(C) can start, see that inode 100 is
already on wb_list and don't see inode 200 on upper list at all. syncfs(C) will
call ovl_wait_inodes() which will drop the reference on inode 100 and then
evict of inode 100 can start and complete while syncfs(B) is still waiting on
EVICT_PENDING of inode 200. When wait for inode 200 in finished inode
100 is already gone.
This is a complex situation and I can't really say if it is possible or if
something will prevent it, but the best way to avoid it is to have clear
semantics, so we won't need to have this discussion at all.

>
> In this case, second syncfs(B) will get another refcount of inode 100 in
> ovl_sync_inodes() so even ater decreasing refcount of inode 100 in
> ovl_wait_inodes() for syncfs(A), inode 100 is still safe to use in
> the ovl_sync_inodes() for syncfs(B).
>
>
>>
>> As far as I can tell, your suggestion to synchronize both ovl_sync_inodes()
>> and ovl_wait_inodes() with sb->s_sync_lock will take care of that problem.
>> Since in current implementation both functions are called with wait == 1
>> anyway, I don't see any problem with synchronizing the two functions. do you?
>>
>
> I did some basic tests(heavy untar && concurrent syncfs) to verify my proposal,
> and up to now seems there is no problem, but I want to say my proposal is
> not mainly for addressing synchronizing problem between ovl_sync_inodes()
> and ovl_wait_inodes(), I just want the code looks simple and easy to understand.
>
>
>> In any case, if you do agree with my analysis and add sb->s_sync_lock
>> to ovl_wait_inodes(), please document the reasons that make the upper_inode
>> in ovl_wait_inodes() iteration safe, because I don't think they are as
>> trivial as
>> you make them sound.
>
> I think if we modify to my proposal, then the code will be easy enough to understand,
> so there will be no much doubt for safety of list iterating.
>

I'm not sure if you are new to kernel development, but understandable code
is not less important then correct code. You are not really writing to
the compiler
you are writing for reviewers and future developers that can make subtle changes
to your code. If the semantics are not perfectly clear, then even if
your code is
correct now it could be easily broken by future developer.

So yes, your proposal sounds good in that regard.

Cheers,
Amir.
--
To unsubscribe from this list: send the line "unsubscribe linux-unionfs" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html




[Index of Archives]     [Linux Filesystems Devel]     [Linux NFS]     [Linux NILFS]     [Linux USB Devel]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux