On 01/09/2024 08:48, Bang Li wrote: > hi, Usama > > On 2024/8/22 3:04, Usama Arif wrote: > >> >> On 20/08/2024 17:30, Barry Song wrote: >> >>> Hi Usama, >>> thanks! I can't judge if we need this partially_mapped flag. but if we >>> need, the code >>> looks correct to me. I'd like to leave this to David and other experts to ack. >>> >> Thanks for the reviews! >> >>> an alternative approach might be two lists? one for entirely_mapped, >>> the other one >>> for split_deferred. also seems ugly ? >>> >> That was my very first prototype! I shifted to using a bool which I sent in v1, and then a bit in _flags_1 as David suggested. I believe a bit in _flags_1 is the best way forward, as it leaves the most space in folio for future work. >> >>> On the other hand, when we want to extend your patchset to mTHP other than PMD- >>> order, will the only deferred_list create huge lock contention while >>> adding or removing >>> folios from it? >>> >> Yes, I would imagine so. the deferred_split_queue is per memcg/node, so that helps. >> >> Also, this work is tied to khugepaged. So would need some thought when doing it for mTHP. >> >> I would imagine doing underused shrinker for mTHP would be less beneficial compared to doing it for 2M THP. But probably needs experimentation. >> >> Thanks > > Below is the core code snippet to support "split underused mTHP". Can we extend the > khugepaged_max_ptes_none value to mthp and keep its semantics unchanged? With a small > modification, Only folios with page numbers greater than khugepaged_max_ptes_none - 1 > can be added to the deferred_split list and can be split. What do you think? > hmm, so I believe its not as simple as that. First mTHP support would need to be added to khugepaged. The entire khugepaged code needs to be audited for it and significantly tested. If you just look at all the instances of HPAGE_PMD_NR in khugepaged.c, you will see it will be a significant change and needs to be a series of its own. Also, different values of max_ptes_none can have different significance for mTHP sizes. max_ptes_none of 200 does not mean anything for 512K and below, it means 78% of a 1M mTHP and 39% of a 2M THP. We might want different max_ptes_none for each mTHP if we were to do this. The benefit of splitting underused mTHPs of lower order might not be worth the work needed to split. Someone needs to experiment with different workloads and see some improvement for these lower orders. So probably a lot more than the below diff is needed for mTHP support! > diff --git a/mm/memory.c b/mm/memory.c > index b95fce7d190f..ef503958d6a0 100644 > --- a/mm/memory.c > +++ b/mm/memory.c > @@ -4789,6 +4789,8 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf) > } > > folio_ref_add(folio, nr_pages - 1); > + if (nr_pages > 1 && nr_pages > khugepaged_max_ptes_none - 1) > + deferred_split_folio(folio, false); > add_mm_counter(vma->vm_mm, MM_ANONPAGES, nr_pages); > count_mthp_stat(folio_order(folio), MTHP_STAT_ANON_FAULT_ALLOC); > folio_add_new_anon_rmap(folio, vma, addr, RMAP_EXCLUSIVE); > > shmem THP has the same memory expansion problem when the shmem_enabled configuration is > set to always. In my opinion, it is necessary to support "split underused shmem THP", > but I am not sure if there is any gap in the implementation? > > Bang > Thanks >