Re: [RFC PATCH 3/5] mm: thp: split huge page to any lower order pages.

Zi Yan <ziy@xxxxxxxxxx> · Wed, 23 Mar 2022 18:10:30 -0400

On 22 Mar 2022, at 22:31, Miaohe Lin wrote:

> On 2022/3/22 22:30, Zi Yan wrote:
>> On 21 Mar 2022, at 23:21, Miaohe Lin wrote:
>>
>>> On 2022/3/21 22:21, Zi Yan wrote:
>>>> From: Zi Yan <ziy@xxxxxxxxxx>
>>>>
>>>> To split a THP to any lower order pages, we need to reform THPs on
>>>> subpages at given order and add page refcount based on the new page
>>>> order. Also we need to reinitialize page_deferred_list after removing
>>>> the page from the split_queue, otherwise a subsequent split will see
>>>> list corruption when checking the page_deferred_list again.
>>>>
>>>> It has many uses, like minimizing the number of pages after
>>>> truncating a pagecache THP. For anonymous THPs, we can only split them
>>>> to order-0 like before until we add support for any size anonymous THPs.
>>>>
>>>> Signed-off-by: Zi Yan <ziy@xxxxxxxxxx>
>>>> ---
>>>>  include/linux/huge_mm.h |   8 +++
>>>>  mm/huge_memory.c        | 111 ++++++++++++++++++++++++++++++----------
>>>>  2 files changed, 91 insertions(+), 28 deletions(-)
>>>>
>>>> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
>>>> index 2999190adc22..c7153cd7e9e4 100644
>>>> --- a/include/linux/huge_mm.h
>>>> +++ b/include/linux/huge_mm.h
>>>> @@ -186,6 +186,8 @@ void free_transhuge_page(struct page *page);
>>>>
>>>>  bool can_split_folio(struct folio *folio, int *pextra_pins);
>>>>  int split_huge_page_to_list(struct page *page, struct list_head *list);
>>>> +int split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
>>>> +		unsigned int new_order);
>>>>  static inline int split_huge_page(struct page *page)
>>>>  {
>>>>  	return split_huge_page_to_list(page, NULL);
>>>> @@ -355,6 +357,12 @@ split_huge_page_to_list(struct page *page, struct list_head *list)
>>>>  {
>>>>  	return 0;
>>>>  }
>>>> +static inline int
>>>> +split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
>>>> +		unsigned int new_order)
>>>> +{
>>>> +	return 0;
>>>> +}
>>>>  static inline int split_huge_page(struct page *page)
>>>>  {
>>>>  	return 0;
>>>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>>>> index fcfa46af6c4c..3617aa3ad0b1 100644
>>>> --- a/mm/huge_memory.c
>>>> +++ b/mm/huge_memory.c
>>>> @@ -2236,11 +2236,13 @@ void vma_adjust_trans_huge(struct vm_area_struct *vma,
>>>>  static void unmap_page(struct page *page)
>>>>  {
>>>>  	struct folio *folio = page_folio(page);
>>>> -	enum ttu_flags ttu_flags = TTU_RMAP_LOCKED | TTU_SPLIT_HUGE_PMD |
>>>> -		TTU_SYNC;
>>>> +	enum ttu_flags ttu_flags = TTU_RMAP_LOCKED | TTU_SYNC;
>>>>
>>>>  	VM_BUG_ON_PAGE(!PageHead(page), page);
>>>>
>>>> +	if (folio_order(folio) >= HPAGE_PMD_ORDER)
>>>> +		ttu_flags |= TTU_SPLIT_HUGE_PMD;
>>>> +
>>>>  	/*
>>>>  	 * Anon pages need migration entries to preserve them, but file
>>>>  	 * pages can simply be left unmapped, then faulted back on demand.
>>>> @@ -2254,9 +2256,9 @@ static void unmap_page(struct page *page)
>>>>  	VM_WARN_ON_ONCE_PAGE(page_mapped(page), page);
>>>>  }
>>>>
>>>> -static void remap_page(struct folio *folio, unsigned long nr)
>>>> +static void remap_page(struct folio *folio, unsigned short nr)
>>>>  {
>>>> -	int i = 0;
>>>> +	unsigned int i;
>>>>
>>>>  	/* If unmap_page() uses try_to_migrate() on file, remove this check */
>>>>  	if (!folio_test_anon(folio))
>>>> @@ -2274,7 +2276,6 @@ static void lru_add_page_tail(struct page *head, struct page *tail,
>>>>  		struct lruvec *lruvec, struct list_head *list)
>>>>  {
>>>>  	VM_BUG_ON_PAGE(!PageHead(head), head);
>>>> -	VM_BUG_ON_PAGE(PageCompound(tail), head);
>>>>  	VM_BUG_ON_PAGE(PageLRU(tail), head);
>>>>  	lockdep_assert_held(&lruvec->lru_lock);
>>>>
>>>> @@ -2295,9 +2296,10 @@ static void lru_add_page_tail(struct page *head, struct page *tail,
>>>>  }
>>>>
>>>>  static void __split_huge_page_tail(struct page *head, int tail,
>>>> -		struct lruvec *lruvec, struct list_head *list)
>>>> +		struct lruvec *lruvec, struct list_head *list, unsigned int new_order)
>>>>  {
>>>>  	struct page *page_tail = head + tail;
>>>> +	unsigned long compound_head_flag = new_order ? (1L << PG_head) : 0;
>>>>
>>>>  	VM_BUG_ON_PAGE(atomic_read(&page_tail->_mapcount) != -1, page_tail);
>>>>
>>>> @@ -2321,6 +2323,7 @@ static void __split_huge_page_tail(struct page *head, int tail,
>>>>  #ifdef CONFIG_64BIT
>>>>  			 (1L << PG_arch_2) |
>>>>  #endif
>>>> +			 compound_head_flag |
>>>>  			 (1L << PG_dirty)));
>>>>
>>>>  	/* ->mapping in first tail page is compound_mapcount */
>>>> @@ -2329,7 +2332,10 @@ static void __split_huge_page_tail(struct page *head, int tail,
>>>>  	page_tail->mapping = head->mapping;
>>>>  	page_tail->index = head->index + tail;
>>>>
>>>> -	/* Page flags must be visible before we make the page non-compound. */
>>>> +	/*
>>>> +	 * Page flags must be visible before we make the page non-compound or
>>>> +	 * a compound page in new_order.
>>>> +	 */
>>>>  	smp_wmb();
>>>>
>>>>  	/*
>>>> @@ -2339,10 +2345,15 @@ static void __split_huge_page_tail(struct page *head, int tail,
>>>>  	 * which needs correct compound_head().
>>>>  	 */
>>>>  	clear_compound_head(page_tail);
>>>> +	if (new_order) {
>>>> +		prep_compound_page(page_tail, new_order);
>>>> +		prep_transhuge_page(page_tail);
>>>> +	}
>>>
>>> Many thanks for your series. It looks really good. One question:
>>> IIUC, It seems there has assumption that LRU compound_pages should
>>> be PageTransHuge. So PageTransHuge just checks PageHead:
>>>
>>> static inline int PageTransHuge(struct page *page)
>>> {
>>> 	VM_BUG_ON_PAGE(PageTail(page), page);
>>> 	return PageHead(page);
>>> }
>>>
>>> So LRU pages with any order( > 0) will might be wrongly treated as THP which
>>> has order = HPAGE_PMD_ORDER. We should ensure thp_nr_pages is used instead of
>>> hard coded HPAGE_PMD_ORDER.
>>>
>>> Looks at the below code snippet:
>>> mm/mempolicy.c:
>>> static struct page *new_page(struct page *page, unsigned long start)
>>> {
>>> ...
>>> 	} else if (PageTransHuge(page)) {
>>> 		struct page *thp;
>>>
>>> 		thp = alloc_hugepage_vma(GFP_TRANSHUGE, vma, address,
>>> 					 HPAGE_PMD_ORDER);
>>> 					 ^^^^^^^^^^^^^^^^
>>> 		if (!thp)
>>> 			return NULL;
>>> 		prep_transhuge_page(thp);
>>> 		return thp;
>>> 	}
>>> ...
>>> }
>>>
>>> HPAGE_PMD_ORDER is used instead of thp_nr_pages. So the lower order pages might be
>>> used as if its order is HPAGE_PMD_ORDER. All of such usage might need to be fixed.
>>> Or am I miss something ?
>>>
>>> Thanks again for your work. :)
>>
>> THP will still only have HPAGE_PMD_ORDER and will not be split into any order
>> other than 0. This series only allows to split huge page cache folio (added by Matthew)
>> into any lower order. I have an explicit VM_BUG_ON() to ensure new_order
>> is only 0 when non page cache page is the input. Since there is still non-trivial
>> amount of work to add any order THP support in the kernel. IIRC, Yu Zhao (cc’d) was
>> planning to work on that.
>>
>
> Many thanks for clarifying. I'm sorry but I haven't followed Matthew's patches. I am
> wondering could huge page cache folio be treated as THP ? If so, how to ensure the
> correctness of huge page cache ?

You are right. All these HPAGE_PMD_ORDRE locations should be replaced by thp_nr_pages().
I will look into it.

Thanks a lot.

--
Best Regards,
Yan, Zi
Attachment:
signature.asc

Description: OpenPGP digital signature