Hi Kirill, On 4/29/2023 4:46 PM, Kirill A. Shutemov wrote: > On Sat, Apr 29, 2023 at 04:32:34PM +0800, Yin, Fengwei wrote: >> Hi Kirill, >> >> On 4/28/2023 10:02 PM, Kirill A. Shutemov wrote: >>> On Fri, Apr 28, 2023 at 02:28:07PM +0800, Yin, Fengwei wrote: >>>> Hi Kirill, >>>> >>>> On 4/25/2023 8:38 PM, Kirill A. Shutemov wrote: >>>>> On Tue, Apr 25, 2023 at 04:46:26PM +0800, Yin Fengwei wrote: >>>>>> free_transhuge_page() acquires split queue lock then check >>>>>> whether the THP was added to deferred list or not. >>>>>> >>>>>> It's safe to check whether the THP is in deferred list or not. >>>>>> When code hit free_transhuge_page(), there is no one tries >>>>>> to update the folio's _deferred_list. >>>>>> >>>>>> If folio is not in deferred_list, it's safe to check without >>>>>> acquiring lock. >>>>>> >>>>>> If folio is in deferred_list, the other node in deferred_list >>>>>> adding/deleteing doesn't impact the return value of >>>>>> list_epmty(@folio->_deferred_list). >>>>> >>>>> Typo. >>>>> >>>>>> >>>>>> Running page_fault1 of will-it-scale + order 2 folio for anonymous >>>>>> mapping with 96 processes on an Ice Lake 48C/96T test box, we could >>>>>> see the 61% split_queue_lock contention: >>>>>> - 71.28% 0.35% page_fault1_pro [kernel.kallsyms] [k] >>>>>> release_pages >>>>>> - 70.93% release_pages >>>>>> - 61.42% free_transhuge_page >>>>>> + 60.77% _raw_spin_lock_irqsave >>>>>> >>>>>> With this patch applied, the split_queue_lock contention is less >>>>>> than 1%. >>>>>> >>>>>> Signed-off-by: Yin Fengwei <fengwei.yin@xxxxxxxxx> >>>>>> Tested-by: Ryan Roberts <ryan.roberts@xxxxxxx> >>>>>> --- >>>>>> mm/huge_memory.c | 19 ++++++++++++++++--- >>>>>> 1 file changed, 16 insertions(+), 3 deletions(-) >>>>>> >>>>>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c >>>>>> index 032fb0ef9cd1..c620f1f12247 100644 >>>>>> --- a/mm/huge_memory.c >>>>>> +++ b/mm/huge_memory.c >>>>>> @@ -2799,12 +2799,25 @@ void free_transhuge_page(struct page *page) >>>>>> struct deferred_split *ds_queue = get_deferred_split_queue(folio); >>>>>> unsigned long flags; >>>>>> >>>>>> - spin_lock_irqsave(&ds_queue->split_queue_lock, flags); >>>>>> - if (!list_empty(&folio->_deferred_list)) { >>>>>> + /* >>>>>> + * At this point, there is no one trying to queue the folio >>>>>> + * to deferred_list. folio->_deferred_list is not possible >>>>>> + * being updated. >>>>>> + * >>>>>> + * If folio is already added to deferred_list, add/delete to/from >>>>>> + * deferred_list will not impact list_empty(&folio->_deferred_list). >>>>>> + * It's safe to check list_empty(&folio->_deferred_list) without >>>>>> + * acquiring the lock. >>>>>> + * >>>>>> + * If folio is not in deferred_list, it's safe to check without >>>>>> + * acquiring the lock. >>>>>> + */ >>>>>> + if (data_race(!list_empty(&folio->_deferred_list))) { >>>>>> + spin_lock_irqsave(&ds_queue->split_queue_lock, flags); >>>>> >>>>> Recheck under lock? >>>> In function deferred_split_scan(), there is following code block: >>>> if (folio_try_get(folio)) { >>>> list_move(&folio->_deferred_list, &list); >>>> } else { >>>> /* We lost race with folio_put() */ >>>> list_del_init(&folio->_deferred_list); >>>> ds_queue->split_queue_len--; >>>> } >>>> >>>> I am wondering what kind of "lost race with folio_put()" can be. >>>> >>>> My understanding is that it's not necessary to handle this case here >>>> because free_transhuge_page() will handle it once folio get zero ref. >>>> But I must miss something here. Thanks. >>> >>> free_transhuge_page() got when refcount is already zero. Both >>> deferred_split_scan() and free_transhuge_page() can see the page with zero >>> refcount. The check makes deferred_split_scan() to leave the page to the >>> free_transhuge_page(). >>> >> If deferred_split_scan() leaves the page to free_transhuge_page(), is it >> necessary to do >> list_del_init(&folio->_deferred_list); >> ds_queue->split_queue_len--; >> >> Can these two line be left to free_transhuge_page() either? Thanks. > > I *think* (my cache is cold on deferred split) we can. But since we > already hold the lock, why not take care of it? It makes your change more > efficient. Thanks a lot for your confirmation. I just wanted to make sure I understand the race here correctly (I didn't notice this part of code before Ying pointed it out). Regards Yin, Fengwei >