On 22 Mar 2023, at 3:55, Ryan Roberts wrote: > Hi, > > I'm working to enable large, variable-order folios for anonymous memory (see > RFC, replete with bugs at [1]). This patch set is going to be very useful to me. > But I have a few questions that I wonder if you can answer, below? I wonder if > they might relate to the bugs I'm seeing at [1]. > > [1] https://lore.kernel.org/linux-mm/20230317105802.2634004-1-ryan.roberts@xxxxxxx/ > > > > On 21/03/2023 00:48, Zi Yan wrote: >> From: Zi Yan <ziy@xxxxxxxxxx> >> >> To split a THP to any lower order pages, we need to reform THPs on >> subpages at given order and add page refcount based on the new page >> order. Also we need to reinitialize page_deferred_list after removing >> the page from the split_queue, otherwise a subsequent split will see >> list corruption when checking the page_deferred_list again. >> >> It has many uses, like minimizing the number of pages after >> truncating a huge pagecache page. For anonymous THPs, we can only split >> them to order-0 like before until we add support for any size anonymous >> THPs. >> >> Signed-off-by: Zi Yan <ziy@xxxxxxxxxx> >> --- >> include/linux/huge_mm.h | 10 ++-- >> mm/huge_memory.c | 103 +++++++++++++++++++++++++++++----------- >> 2 files changed, 82 insertions(+), 31 deletions(-) >> >> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h >> index 20284387b841..32c91e1b59cd 100644 >> --- a/include/linux/huge_mm.h >> +++ b/include/linux/huge_mm.h >> @@ -147,10 +147,11 @@ void prep_transhuge_page(struct page *page); >> void free_transhuge_page(struct page *page); >> >> bool can_split_folio(struct folio *folio, int *pextra_pins); >> -int split_huge_page_to_list(struct page *page, struct list_head *list); >> +int split_huge_page_to_list_to_order(struct page *page, struct list_head *list, >> + unsigned int new_order); >> static inline int split_huge_page(struct page *page) >> { >> - return split_huge_page_to_list(page, NULL); >> + return split_huge_page_to_list_to_order(page, NULL, 0); >> } >> void deferred_split_folio(struct folio *folio); >> >> @@ -297,7 +298,8 @@ can_split_folio(struct folio *folio, int *pextra_pins) >> return false; >> } >> static inline int >> -split_huge_page_to_list(struct page *page, struct list_head *list) >> +split_huge_page_to_list_to_order(struct page *page, struct list_head *list, >> + unsigned int new_order) >> { >> return 0; >> } >> @@ -397,7 +399,7 @@ static inline bool thp_migration_supported(void) >> static inline int split_folio_to_list(struct folio *folio, >> struct list_head *list) >> { >> - return split_huge_page_to_list(&folio->page, list); >> + return split_huge_page_to_list_to_order(&folio->page, list, 0); >> } >> >> static inline int split_folio(struct folio *folio) >> diff --git a/mm/huge_memory.c b/mm/huge_memory.c >> index 710189885402..f119b9be33f2 100644 >> --- a/mm/huge_memory.c >> +++ b/mm/huge_memory.c >> @@ -2359,11 +2359,13 @@ void vma_adjust_trans_huge(struct vm_area_struct *vma, >> >> static void unmap_folio(struct folio *folio) >> { >> - enum ttu_flags ttu_flags = TTU_RMAP_LOCKED | TTU_SPLIT_HUGE_PMD | >> - TTU_SYNC; >> + enum ttu_flags ttu_flags = TTU_RMAP_LOCKED | TTU_SYNC; >> >> VM_BUG_ON_FOLIO(!folio_test_large(folio), folio); >> >> + if (folio_order(folio) >= HPAGE_PMD_ORDER) >> + ttu_flags |= TTU_SPLIT_HUGE_PMD; >> + > > Why have you changed the code so that this flag is added conditionally on the > folio being large enough? I've previously looked at this in the context of my > bug, and concluded that the consumer would ignore the flag if the folio wasn't > PMD mapped. Did I conclude incorrectly? Since if folio order is not larger than PMD order, there is no way of mapping a PMD to the folio. Thus, TTU_SPLIT_HUGE_PMD does not make sense. Yes, the consumer will not split any PMD, but will still do page table locks and mmu notifier work, which cost unnecessary overheads. I think I better change the if condition to folio_test_pmd_mappable(). > > >> /* >> * Anon pages need migration entries to preserve them, but file >> * pages can simply be left unmapped, then faulted back on demand. >> @@ -2395,7 +2397,6 @@ static void lru_add_page_tail(struct page *head, struct page *tail, >> struct lruvec *lruvec, struct list_head *list) >> { >> VM_BUG_ON_PAGE(!PageHead(head), head); >> - VM_BUG_ON_PAGE(PageCompound(tail), head); >> VM_BUG_ON_PAGE(PageLRU(tail), head); >> lockdep_assert_held(&lruvec->lru_lock); >> >> @@ -2416,9 +2417,10 @@ static void lru_add_page_tail(struct page *head, struct page *tail, >> } >> > > [...] > >> -int split_huge_page_to_list(struct page *page, struct list_head *list) >> +int split_huge_page_to_list_to_order(struct page *page, struct list_head *list, >> + unsigned int new_order) >> { >> struct folio *folio = page_folio(page); >> struct deferred_split *ds_queue = get_deferred_split_queue(folio); >> - XA_STATE(xas, &folio->mapping->i_pages, folio->index); >> + /* reset xarray order to new order after split */ >> + XA_STATE_ORDER(xas, &folio->mapping->i_pages, folio->index, new_order); >> struct anon_vma *anon_vma = NULL; >> struct address_space *mapping = NULL; >> int extra_pins, ret; >> @@ -2649,6 +2676,18 @@ int split_huge_page_to_list(struct page *page, struct list_head *list) >> VM_BUG_ON_FOLIO(!folio_test_locked(folio), folio); >> VM_BUG_ON_FOLIO(!folio_test_large(folio), folio); >> >> + /* Cannot split THP to order-1 (no order-1 THPs) */ >> + if (new_order == 1) { >> + VM_WARN_ONCE(1, "Cannot split to order-1 folio"); >> + return -EINVAL; >> + } > > Why can't you split to order-1? I vaguely understand that some data is kept in > the first 3 struct pages, but I would naively expect the allocator to fail to > allocate compound pages of order-1 if it was a problem? My large anon folios > patch is currently allocating order-1 in some circumstances. Perhaps its related > to my bug? > Yes, some data is kept in first 3 struct pages, so order-1 THP is not possible. The page allocator does not know this restriction, but still allocate an order-1 page. That might be related to your bug. You can have order-1 compound pages, but it does not mean you can use them for THPs. AFAIK, slab uses order-1 compound pages, but it does not store slab information on the 3rd struct page. Basically, page allocator can allocate an order-N page, and it can be: 1. 2^N consecutive physical pages (not a compound page), 2. an order-N compound page, 3. an order-N THP (also an order-N compound page), 4. an order-N hugetlb page (also an order-N compound page). For THP and hugetlb page, there are prep_transhuge_page() and prep_new_hugetlb_folio() are called respectively after the page is allocated. That makes them kinda subclasses of a compound page. > >> + >> + /* Split anonymous folio to non-zero order not support */ >> + if (folio_test_anon(folio) && new_order) { >> + VM_WARN_ONCE(1, "Split anon folio to non-0 order not support"); >> + return -EINVAL; >> + } > > Why don't you support this? What is special about anon folios that means this > code doesn't work for them? split_huge_page() code can split to non-0 order anon folios, but the rest of the mm code might not have proper support yet. That is why we need your patchset. :) > > >> + >> is_hzp = is_huge_zero_page(&folio->page); >> VM_WARN_ON_ONCE_FOLIO(is_hzp, folio); >> if (is_hzp) >> @@ -2744,7 +2783,13 @@ int split_huge_page_to_list(struct page *page, struct list_head *list) >> if (folio_ref_freeze(folio, 1 + extra_pins)) { >> if (!list_empty(&folio->_deferred_list)) { >> ds_queue->split_queue_len--; >> - list_del(&folio->_deferred_list); >> + /* >> + * Reinitialize page_deferred_list after removing the >> + * page from the split_queue, otherwise a subsequent >> + * split will see list corruption when checking the >> + * page_deferred_list. >> + */ >> + list_del_init(&folio->_deferred_list); >> } >> spin_unlock(&ds_queue->split_queue_lock); >> if (mapping) { >> @@ -2754,14 +2799,18 @@ int split_huge_page_to_list(struct page *page, struct list_head *list) >> if (folio_test_swapbacked(folio)) { >> __lruvec_stat_mod_folio(folio, NR_SHMEM_THPS, >> -nr); >> - } else { >> + } else if (!new_order) { >> + /* >> + * Decrease THP stats only if split to normal >> + * pages >> + */ >> __lruvec_stat_mod_folio(folio, NR_FILE_THPS, >> -nr); >> filemap_nr_thps_dec(mapping); >> } >> } >> >> - __split_huge_page(page, list, end); >> + __split_huge_page(page, list, end, new_order); >> ret = 0; >> } else { >> spin_unlock(&ds_queue->split_queue_lock); -- Best Regards, Yan, Zi
Attachment:
signature.asc
Description: OpenPGP digital signature