> On Sep 9, 2023, at 04:53, Mike Kravetz <mike.kravetz@xxxxxxxxxx> wrote: > > On 09/07/23 11:54, Mike Kravetz wrote: >> On 09/07/23 11:33, Muchun Song wrote: >>> >>> >>>> On Sep 7, 2023, at 05:12, Mike Kravetz <mike.kravetz@xxxxxxxxxx> wrote: >>>> >>>> On 09/06/23 16:07, Muchun Song wrote: >>>>>> On Sep 6, 2023, at 15:33, Muchun Song <muchun.song@xxxxxxxxx> wrote: >>>>>> On 2023/9/6 05:44, Mike Kravetz wrote: >>>>>>> When removing hugetlb pages from the pool, we first create a list >>>>>>> of removed pages and then free those pages back to low level allocators. >>>>>>> Part of the 'freeing process' is to restore vmemmap for all base pages >>>>>>> if necessary. Pass this list of pages to a new routine >>>>>>> hugetlb_vmemmap_restore_folios() so that vmemmap restoration can be >>>>>>> performed in bulk. >>>>>>> >>>>>>> Signed-off-by: Mike Kravetz <mike.kravetz@xxxxxxxxxx> >>>>>>> --- >>>>>>> mm/hugetlb.c | 3 +++ >>>>>>> mm/hugetlb_vmemmap.c | 13 +++++++++++++ >>>>>>> mm/hugetlb_vmemmap.h | 5 +++++ >>>>>>> 3 files changed, 21 insertions(+) >>>>>>> >>>>>>> diff --git a/mm/hugetlb.c b/mm/hugetlb.c >>>>>>> index 554be94b07bd..dd2dbc256172 100644 >>>>>>> --- a/mm/hugetlb.c >>>>>>> +++ b/mm/hugetlb.c >>>>>>> @@ -1838,6 +1838,9 @@ static void update_and_free_pages_bulk(struct hstate *h, struct list_head *list) >>>>>>> { >>>>>>> struct folio *folio, *t_folio; >>>>>>> + /* First restore vmemmap for all pages on list. */ >>>>>>> + hugetlb_vmemmap_restore_folios(h, list); >>>>>>> + >>>>>>> list_for_each_entry_safe(folio, t_folio, list, lru) { >>>>>>> update_and_free_hugetlb_folio(h, folio, false); >>>>>>> cond_resched(); >>>>>>> diff --git a/mm/hugetlb_vmemmap.c b/mm/hugetlb_vmemmap.c >>>>>>> index ac5577d372fe..79de984919ef 100644 >>>>>>> --- a/mm/hugetlb_vmemmap.c >>>>>>> +++ b/mm/hugetlb_vmemmap.c >>>>>>> @@ -481,6 +481,19 @@ int hugetlb_vmemmap_restore(const struct hstate *h, struct page *head) >>>>>>> return ret; >>>>>>> } >>>>>>> +/* >>>>>>> + * This function will attempt to resore vmemmap for a list of folios. There >>>>>>> + * is no guarantee that restoration will be successful for all or any folios. >>>>>>> + * This is used in bulk operations, and no feedback is given to the caller. >>>>>>> + */ >>>>>>> +void hugetlb_vmemmap_restore_folios(const struct hstate *h, struct list_head *folio_list) >>>>>>> +{ >>>>>>> + struct folio *folio; >>>>>>> + >>>>>>> + list_for_each_entry(folio, folio_list, lru) >>>>>>> + (void)hugetlb_vmemmap_restore(h, &folio->page); >>>>>> >>>>>> I am curious about the purpose of "void" here, seems it it not necessnary, >>>>>> ritgh? We cound see so many palces where we do not add the void if the caller >>>>>> does not care about the return value of the callee. >>>>> >>>>> Another question: should we stop restoring vmemmap pages when >>>>> hugetlb_vmemmap_restore() fails? In which case, I suspect there >>>>> is no memory probably, there is no need to continue, right? >>>> >>>> Recall that the list of hugetlb pages may be from multiple nodes. My first >>>> thought was that we should continue because memory allocation may fail on one >>>> node but succeed on another. However, with >>>> https://lore.kernel.org/linux-mm/20230905031312.91929-1-yuancan@xxxxxxxxxx/ >>>> memory allocation should fall back to other nodes. So, yes I do believe it >>>> would make sense to stop when hugetlb_vmemmap_restore returns ENOMEM as >>>> we are unlikely to make forward progress. >>> >>> Agree. >>> >>>> >>>> Today's behavior will try to restore vmemmap for all pages. No stopping >>>> on error. >>>> >>>> I have mixed thoughts on this. Quitting on error 'seems reasonable'. >>>> However, if we continue we 'might' be able to allocate vmemmap for one >>>> hugetlb page. And, if we free one hugetlb page that should provide >>>> vmemmap for several more and we may be able to free most pages on the >>>> list. >>> >>> Yes. A good point. But there should be a non-optimized huge page been >>> freed somewhere in parallel, otherwise we still cannot allocate memory. >> >> It does not have to be another huge page being freed in parallel. It >> could be that when allocating vmemmap for a 1G hugetlb page we were one >> (4K) page short of what was required. If someone else frees a 4K page, >> freeing the next 1G page may succeed. Right. I missed that. >> -- >> Mike Kravetz >> >>> However, the freeing operation happens after hugetlb_vmemmap_restore_folios. >>> If we want to handle this, we should rework update_and_free_pages_bulk() >>> to do a try when at least a huge pages is freed. > > This seemed familiar. Recall this patch which Muchun Reviewed and James Acked, > https://lore.kernel.org/linux-mm/20230718004942.113174-3-mike.kravetz@xxxxxxxxxx/ > > If we can not restore vmemmap for a page, then it must be turned into a > surplus huge page. In this patch (not the previous one referenced), we > will try to restore vmemmap one more time in a later call to > update_and_free_hugetlb_folio. Certainly, we do not want to try twice! > > My 'plan' is to include the previous patch as part of this series. With > that patch in place, the list_for_each_entry calls to hugetlb_vmemmap_restore > can be replaced with a call to hugetlb_vmemmap_restore_folios. We would > change the behavior of hugetlb_vmemmap_restore_folios to return an error > instead of being of type void. If an error is returned, then we will > make another pass through the list looking for unoptimized pages and add > them as surplus. > > I think it best if we try to restore vmemmap at least once before > converting to a surplus page. Make sense. Muchun > -- > Mike Kravetz