Re: [PATCH v4 5/6] mm: vmscan: Avoid split during shrink_folio_list()

Yin Fengwei <fengwei.yin@xxxxxxxxx> · Tue, 19 Mar 2024 10:20:08 +0800



On 3/18/24 23:35, Ryan Roberts wrote:
> On 18/03/2024 10:05, David Hildenbrand wrote:
>> On 18.03.24 11:00, Yin, Fengwei wrote:
>>>
>>>
>>> On 3/18/2024 10:16 AM, Huang, Ying wrote:
>>>> Ryan Roberts <ryan.roberts@xxxxxxx> writes:
>>>>
>>>>> Hi Yin Fengwei,
>>>>>
>>>>> On 15/03/2024 11:12, David Hildenbrand wrote:
>>>>>> On 15.03.24 11:49, Ryan Roberts wrote:
>>>>>>> On 15/03/2024 10:43, David Hildenbrand wrote:
>>>>>>>> On 11.03.24 16:00, Ryan Roberts wrote:
>>>>>>>>> Now that swap supports storing all mTHP sizes, avoid splitting large
>>>>>>>>> folios before swap-out. This benefits performance of the swap-out path
>>>>>>>>> by eliding split_folio_to_list(), which is expensive, and also sets us
>>>>>>>>> up for swapping in large folios in a future series.
>>>>>>>>>
>>>>>>>>> If the folio is partially mapped, we continue to split it since we want
>>>>>>>>> to avoid the extra IO overhead and storage of writing out pages
>>>>>>>>> uneccessarily.
>>>>>>>>>
>>>>>>>>> Signed-off-by: Ryan Roberts <ryan.roberts@xxxxxxx>
>>>>>>>>> ---
>>>>>>>>>      mm/vmscan.c | 9 +++++----
>>>>>>>>>      1 file changed, 5 insertions(+), 4 deletions(-)
>>>>>>>>>
>>>>>>>>> diff --git a/mm/vmscan.c b/mm/vmscan.c
>>>>>>>>> index cf7d4cf47f1a..0ebec99e04c6 100644
>>>>>>>>> --- a/mm/vmscan.c
>>>>>>>>> +++ b/mm/vmscan.c
>>>>>>>>> @@ -1222,11 +1222,12 @@ static unsigned int shrink_folio_list(struct
>>>>>>>>> list_head
>>>>>>>>> *folio_list,
>>>>>>>>>                          if (!can_split_folio(folio, NULL))
>>>>>>>>>                              goto activate_locked;
>>>>>>>>>                          /*
>>>>>>>>> -                     * Split folios without a PMD map right
>>>>>>>>> -                     * away. Chances are some or all of the
>>>>>>>>> -                     * tail pages can be freed without IO.
>>>>>>>>> +                     * Split partially mapped folios map
>>>>>>>>> +                     * right away. Chances are some or all
>>>>>>>>> +                     * of the tail pages can be freed
>>>>>>>>> +                     * without IO.
>>>>>>>>>                           */
>>>>>>>>> -                    if (!folio_entire_mapcount(folio) &&
>>>>>>>>> +                    if (!list_empty(&folio->_deferred_list) &&
>>>>>>>>>                              split_folio_to_list(folio,
>>>>>>>>>                                      folio_list))
>>>>>>>>>                              goto activate_locked;
>>>>>>>>
>>>>>>>> Not sure if we might have to annotate that with data_race().
>>>>>>>
>>>>>>> I asked that exact question to Matthew in another context bt didn't get a
>>>>>>> response. There are examples of checking if the deferred list is empty
>>>>>>> with and
>>>>>>> without data_race() in the code base. But list_empty() is implemented like
>>>>>>> this:
>>>>>>>
>>>>>>> static inline int list_empty(const struct list_head *head)
>>>>>>> {
>>>>>>>       return READ_ONCE(head->next) == head;
>>>>>>> }
>>>>>>>
>>>>>>> So I assumed the READ_ONCE() makes everything safe without a lock? Perhaps
>>>>>>> not
>>>>>>> sufficient for KCSAN?
>>> I don't think READ_ONCE() can replace the lock.
> 
> But it doesn't ensure we get a consistent value and that the compiler orders the
> load correctly. There are lots of patterns in the kernel that use READ_ONCE()
> without a lock and they don't use data_race() - e.g. ptep_get_lockless().
They (ptep_get_lockless() and deferred_list) have different access pattern
(or race pattern) here. I don't think they are comparable.

> 
> It sounds like none of us really understand what data_race() is for, so I guess
> I'll just do a KCSAN build and invoke the code path to see if it complains.
READ_ONCE() in list_empty will shutdown the KCSAN also.

> 
> 
>>>
>>>>>>
>>>>>> Yeah, there is only one use of data_race with that list.
>>>>>>
>>>>>> It was added in f3ebdf042df4 ("THP: avoid lock when check whether THP is in
>>>>>> deferred list").
>>>>>>
>>>>>> Looks like that was added right in v1 of that change [1], so my best guess is
>>>>>> that it is not actually required.
>>>>>>
>>>>>> If not required, likely we should just cleanup the single user.
>>>>>>
>>>>>> [1]
>>>>>> https://lore.kernel.org/linux-mm/20230417075643.3287513-2-fengwei.yin@xxxxxxxxx/
>>>>>
>>>>> Do you have any recollection of why you added the data_race() markup?
>>>>
>>>> Per my understanding, this is used to mark that the code accesses
>>>> folio->_deferred_list without lock intentionally, while
>>>> folio->_deferred_list may be changed in parallel.  IIUC, this is what
>>>> data_race() is used for.  Or, my understanding is wrong?
>>> Yes. This is my understanding also.
>>
>> Why don't we have a data_race() in deferred_split_folio() then, before taking
>> the lock?
>>
>> It's used a bit inconsistently here.
>>
>