Re: [PATCH v2 1/2] THP: avoid lock when check whether THP is in deferred list

"Yin, Fengwei" <fengwei.yin@xxxxxxxxx> · Mon, 1 May 2023 13:50:25 +0800

Hi Kirill,

On 4/29/2023 4:46 PM, Kirill A. Shutemov wrote:
> On Sat, Apr 29, 2023 at 04:32:34PM +0800, Yin, Fengwei wrote:
>> Hi Kirill,
>>
>> On 4/28/2023 10:02 PM, Kirill A. Shutemov wrote:
>>> On Fri, Apr 28, 2023 at 02:28:07PM +0800, Yin, Fengwei wrote:
>>>> Hi Kirill,
>>>>
>>>> On 4/25/2023 8:38 PM, Kirill A. Shutemov wrote:
>>>>> On Tue, Apr 25, 2023 at 04:46:26PM +0800, Yin Fengwei wrote:
>>>>>> free_transhuge_page() acquires split queue lock then check
>>>>>> whether the THP was added to deferred list or not.
>>>>>>
>>>>>> It's safe to check whether the THP is in deferred list or not.
>>>>>>    When code hit free_transhuge_page(), there is no one tries
>>>>>>    to update the folio's _deferred_list.
>>>>>>
>>>>>>    If folio is not in deferred_list, it's safe to check without
>>>>>>    acquiring lock.
>>>>>>
>>>>>>    If folio is in deferred_list, the other node in deferred_list
>>>>>>    adding/deleteing doesn't impact the return value of
>>>>>>    list_epmty(@folio->_deferred_list).
>>>>>
>>>>> Typo.
>>>>>
>>>>>>
>>>>>> Running page_fault1 of will-it-scale + order 2 folio for anonymous
>>>>>> mapping with 96 processes on an Ice Lake 48C/96T test box, we could
>>>>>> see the 61% split_queue_lock contention:
>>>>>> -   71.28%     0.35%  page_fault1_pro  [kernel.kallsyms]           [k]
>>>>>>     release_pages
>>>>>>    - 70.93% release_pages
>>>>>>       - 61.42% free_transhuge_page
>>>>>>          + 60.77% _raw_spin_lock_irqsave
>>>>>>
>>>>>> With this patch applied, the split_queue_lock contention is less
>>>>>> than 1%.
>>>>>>
>>>>>> Signed-off-by: Yin Fengwei <fengwei.yin@xxxxxxxxx>
>>>>>> Tested-by: Ryan Roberts <ryan.roberts@xxxxxxx>
>>>>>> ---
>>>>>>  mm/huge_memory.c | 19 ++++++++++++++++---
>>>>>>  1 file changed, 16 insertions(+), 3 deletions(-)
>>>>>>
>>>>>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>>>>>> index 032fb0ef9cd1..c620f1f12247 100644
>>>>>> --- a/mm/huge_memory.c
>>>>>> +++ b/mm/huge_memory.c
>>>>>> @@ -2799,12 +2799,25 @@ void free_transhuge_page(struct page *page)
>>>>>>  	struct deferred_split *ds_queue = get_deferred_split_queue(folio);
>>>>>>  	unsigned long flags;
>>>>>>  
>>>>>> -	spin_lock_irqsave(&ds_queue->split_queue_lock, flags);
>>>>>> -	if (!list_empty(&folio->_deferred_list)) {
>>>>>> +	/*
>>>>>> +	 * At this point, there is no one trying to queue the folio
>>>>>> +	 * to deferred_list. folio->_deferred_list is not possible
>>>>>> +	 * being updated.
>>>>>> +	 *
>>>>>> +	 * If folio is already added to deferred_list, add/delete to/from
>>>>>> +	 * deferred_list will not impact list_empty(&folio->_deferred_list).
>>>>>> +	 * It's safe to check list_empty(&folio->_deferred_list) without
>>>>>> +	 * acquiring the lock.
>>>>>> +	 *
>>>>>> +	 * If folio is not in deferred_list, it's safe to check without
>>>>>> +	 * acquiring the lock.
>>>>>> +	 */
>>>>>> +	if (data_race(!list_empty(&folio->_deferred_list))) {
>>>>>> +		spin_lock_irqsave(&ds_queue->split_queue_lock, flags);
>>>>>
>>>>> Recheck under lock?
>>>> In function deferred_split_scan(), there is following code block:
>>>>                 if (folio_try_get(folio)) {
>>>>                         list_move(&folio->_deferred_list, &list);
>>>>                 } else {
>>>>                         /* We lost race with folio_put() */
>>>>                         list_del_init(&folio->_deferred_list);
>>>>                         ds_queue->split_queue_len--;
>>>>                 }
>>>>
>>>> I am wondering what kind of "lost race with folio_put()" can be.
>>>>
>>>> My understanding is that it's not necessary to handle this case here
>>>> because free_transhuge_page() will handle it once folio get zero ref.
>>>> But I must miss something here. Thanks.
>>>
>>> free_transhuge_page() got when refcount is already zero. Both
>>> deferred_split_scan() and free_transhuge_page() can see the page with zero
>>> refcount. The check makes deferred_split_scan() to leave the page to the
>>> free_transhuge_page().
>>>
>> If deferred_split_scan() leaves the page to free_transhuge_page(), is it
>> necessary to do
>>         list_del_init(&folio->_deferred_list);
>>         ds_queue->split_queue_len--;
>>
>> Can these two line be left to free_transhuge_page() either? Thanks.
> 
> I *think* (my cache is cold on deferred split) we can. But since we
> already hold the lock, why not take care of it? It makes your change more
> efficient.
Thanks a lot for your confirmation. I just wanted to make sure I understand
the race here correctly (I didn't notice this part of code before Ying pointed
it out).

Regards
Yin, Fengwei

>