On 09/08/2023 09:32, David Hildenbrand wrote: > Let's track the total mapcount for all large folios in the first subpage. > > The total mapcount is what we actually want to know in folio_mapcount() > and it is also sufficient for implementing folio_mapped(). This also > gets rid of any "raceiness" concerns as expressed in > folio_total_mapcount(). > > With sub-PMD THP becoming more important and things looking promising > that we will soon get support for such anon THP, we want to avoid looping > over all pages of a folio just to calculate the total mapcount. Further, > we may soon want to use the total mapcount in other context more > frequently, so prepare for reading it efficiently and atomically. > > Make room for the total mapcount in page[1] of the folio by moving > _nr_pages_mapped to page[2] of the folio: it is not applicable to hugetlb > -- and with the total mapcount in place probably also not desirable even > if PMD-mappable hugetlb pages could get PTE-mapped at some point -- so we > can overlay the hugetlb fields. > > Note that we currently don't expect any order-1 compound pages / THP in > rmap code, and that such support is not planned. If ever desired, we could > move the compound mapcount to another page, because it only applies to > PMD-mappable folios that are definitely larger. Let's avoid consuming > more space elsewhere for now -- we might need more space for a different > purpose in some subpages soon. > > Maintain the total mapcount also for hugetlb pages. Use the total mapcount > to implement folio_mapcount(), total_mapcount(), folio_mapped() and > page_mapped(). > > We can now get rid of folio_total_mapcount() and > folio_large_is_mapped(), by just inlining reading of the total mapcount. > > _nr_pages_mapped is now only used in rmap code, so not accidentially > externally where it might be used on arbitrary order-1 pages. The remaining > usage is: > > (1) Detect how to adjust stats: NR_ANON_MAPPED and NR_FILE_MAPPED > -> If we would account the total folio as mapped when mapping a > page (based on the total mapcount), we could remove that usage. > > (2) Detect when to add a folio to the deferred split queue > -> If we would apply a different heuristic, or scan using the rmap on > the memory reclaim path for partially mapped anon folios to > split them, we could remove that usage as well. > > So maybe, we can simplify things in the future and remove > _nr_pages_mapped. For now, leave these things as they are, they need more > thought. Hugh really did a nice job implementing that precise tracking > after all. > > Note: Not adding order-1 sanity checks to the file_rmap functions for > now. For anon pages, they are certainly not required right now. > > Cc: Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx> > Cc: Jonathan Corbet <corbet@xxxxxxx> > Cc: Mike Kravetz <mike.kravetz@xxxxxxxxxx> > Cc: Hugh Dickins <hughd@xxxxxxxxxx> > Cc: "Matthew Wilcox (Oracle)" <willy@xxxxxxxxxxxxx> > Cc: Ryan Roberts <ryan.roberts@xxxxxxx> > Cc: Yin Fengwei <fengwei.yin@xxxxxxxxx> > Cc: Yang Shi <shy828301@xxxxxxxxx> > Cc: Zi Yan <ziy@xxxxxxxxxx> > Signed-off-by: David Hildenbrand <david@xxxxxxxxxx> Other than the nits and query on zeroing _total_mapcount below, LGTM. If zeroing is correct: Reviewed-by: Ryan Roberts <ryan.roberts@xxxxxxx> > --- > Documentation/mm/transhuge.rst | 12 +++++----- > include/linux/mm.h | 31 ++++++------------------ > include/linux/mm_types.h | 6 +++-- > include/linux/rmap.h | 7 ++++-- > mm/debug.c | 4 ++-- > mm/huge_memory.c | 2 ++ > mm/hugetlb.c | 4 ++-- > mm/internal.h | 17 ++++--------- > mm/page_alloc.c | 4 ++-- > mm/rmap.c | 44 ++++++++++++++-------------------- > 10 files changed, 52 insertions(+), 79 deletions(-) > > diff --git a/Documentation/mm/transhuge.rst b/Documentation/mm/transhuge.rst > index 9a607059ea11..b0d3b1d3e8ea 100644 > --- a/Documentation/mm/transhuge.rst > +++ b/Documentation/mm/transhuge.rst > @@ -116,14 +116,14 @@ pages: > succeeds on tail pages. > > - map/unmap of a PMD entry for the whole THP increment/decrement > - folio->_entire_mapcount and also increment/decrement > - folio->_nr_pages_mapped by COMPOUND_MAPPED when _entire_mapcount > - goes from -1 to 0 or 0 to -1. > + folio->_entire_mapcount, increment/decrement folio->_total_mapcount > + and also increment/decrement folio->_nr_pages_mapped by COMPOUND_MAPPED > + when _entire_mapcount goes from -1 to 0 or 0 to -1. > > - map/unmap of individual pages with PTE entry increment/decrement > - page->_mapcount and also increment/decrement folio->_nr_pages_mapped > - when page->_mapcount goes from -1 to 0 or 0 to -1 as this counts > - the number of pages mapped by PTE. > + page->_mapcount, increment/decrement folio->_total_mapcount and also > + increment/decrement folio->_nr_pages_mapped when page->_mapcount goes > + from -1 to 0 or 0 to -1 as this counts the number of pages mapped by PTE. > > split_huge_page internally has to distribute the refcounts in the head > page to the tail pages before clearing all PG_head/tail bits from the page > diff --git a/include/linux/mm.h b/include/linux/mm.h > index 6a95dfed4957..30ac004fa0ef 100644 > --- a/include/linux/mm.h > +++ b/include/linux/mm.h > @@ -1187,41 +1187,24 @@ static inline int page_mapcount(struct page *page) > return mapcount; > } > > -int folio_total_mapcount(struct folio *folio); > - > /** > - * folio_mapcount() - Calculate the number of mappings of this folio. > + * folio_mapcount() - Return the total number of mappings of this folio. > * @folio: The folio. > * > - * A large folio tracks both how many times the entire folio is mapped, > - * and how many times each individual page in the folio is mapped. > - * This function calculates the total number of times the folio is > - * mapped. > - * > - * Return: The number of times this folio is mapped. > + * Return: The total number of times this folio is mapped. > */ > static inline int folio_mapcount(struct folio *folio) > { > if (likely(!folio_test_large(folio))) > return atomic_read(&folio->_mapcount) + 1; > - return folio_total_mapcount(folio); > + return atomic_read(&folio->_total_mapcount) + 1; > } > > static inline int total_mapcount(struct page *page) nit: couldn't total_mapcount() just be implemented as a wrapper around folio_mapcount()? What's the benefit of PageCompound() check instead of just getting the folio and checking if it's large? i.e: static inline int total_mapcount(struct page *page) { return folio_mapcount(page_folio(page)); } > { > if (likely(!PageCompound(page))) > return atomic_read(&page->_mapcount) + 1; > - return folio_total_mapcount(page_folio(page)); > -} > - > -static inline bool folio_large_is_mapped(struct folio *folio) > -{ > - /* > - * Reading _entire_mapcount below could be omitted if hugetlb > - * participated in incrementing nr_pages_mapped when compound mapped. > - */ > - return atomic_read(&folio->_nr_pages_mapped) > 0 || > - atomic_read(&folio->_entire_mapcount) >= 0; > + return atomic_read(&page_folio(page)->_total_mapcount) + 1; > } > > /** > @@ -1234,7 +1217,7 @@ static inline bool folio_mapped(struct folio *folio) > { > if (likely(!folio_test_large(folio))) > return atomic_read(&folio->_mapcount) >= 0; > - return folio_large_is_mapped(folio); > + return atomic_read(&folio->_total_mapcount) >= 0; > } > > /* > @@ -1246,7 +1229,7 @@ static inline bool page_mapped(struct page *page) > { > if (likely(!PageCompound(page))) > return atomic_read(&page->_mapcount) >= 0; > - return folio_large_is_mapped(page_folio(page)); > + return atomic_read(&page_folio(page)->_total_mapcount) >= 0; nit: same here; couldn't this be a wrapper around folio_mapped()? > } > > static inline struct page *virt_to_head_page(const void *x) > @@ -2148,7 +2131,7 @@ static inline size_t folio_size(struct folio *folio) > * looking at the precise mapcount of the first subpage in the folio, and > * assuming the other subpages are the same. This may not be true for large > * folios. If you want exact mapcounts for exact calculations, look at > - * page_mapcount() or folio_total_mapcount(). > + * page_mapcount() or folio_mapcount(). > * > * Return: The estimated number of processes sharing a folio. > */ > diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h > index 291c05cacd48..16e66b3332c6 100644 > --- a/include/linux/mm_types.h > +++ b/include/linux/mm_types.h > @@ -267,7 +267,8 @@ static inline struct page *encoded_page_ptr(struct encoded_page *page) > * @_folio_dtor: Which destructor to use for this folio. > * @_folio_order: Do not use directly, call folio_order(). > * @_entire_mapcount: Do not use directly, call folio_entire_mapcount(). > - * @_nr_pages_mapped: Do not use directly, call folio_mapcount(). > + * @_total_mapcount: Do not use directly, call folio_mapcount(). > + * @_nr_pages_mapped: Do not use outside of rmap code (and not for hugetlb). > * @_pincount: Do not use directly, call folio_maybe_dma_pinned(). > * @_folio_nr_pages: Do not use directly, call folio_nr_pages(). > * @_hugetlb_subpool: Do not use directly, use accessor in hugetlb.h. > @@ -321,7 +322,7 @@ struct folio { > unsigned char _folio_dtor; > unsigned char _folio_order; > atomic_t _entire_mapcount; > - atomic_t _nr_pages_mapped; > + atomic_t _total_mapcount; > atomic_t _pincount; > #ifdef CONFIG_64BIT > unsigned int _folio_nr_pages; > @@ -346,6 +347,7 @@ struct folio { > unsigned long _head_2a; > /* public: */ > struct list_head _deferred_list; > + atomic_t _nr_pages_mapped; > /* private: the union with struct page is transitional */ > }; > struct page __page_2; > diff --git a/include/linux/rmap.h b/include/linux/rmap.h > index a3825ce81102..a7b1c005e0c9 100644 > --- a/include/linux/rmap.h > +++ b/include/linux/rmap.h > @@ -210,14 +210,17 @@ void hugepage_add_new_anon_rmap(struct folio *, struct vm_area_struct *, > > static inline void __page_dup_rmap(struct page *page, bool compound) > { > - if (compound) { > - struct folio *folio = (struct folio *)page; > + struct folio *folio = page_folio(page); > > + if (compound) { > VM_BUG_ON_PAGE(compound && !PageHead(page), page); > atomic_inc(&folio->_entire_mapcount); > } else { > atomic_inc(&page->_mapcount); > } > + > + if (folio_test_large(folio)) > + atomic_inc(&folio->_total_mapcount); > } > > static inline void page_dup_file_rmap(struct page *page, bool compound) > diff --git a/mm/debug.c b/mm/debug.c > index ee533a5ceb79..97f6f6b32ae7 100644 > --- a/mm/debug.c > +++ b/mm/debug.c > @@ -99,10 +99,10 @@ static void __dump_page(struct page *page) > page, page_ref_count(head), mapcount, mapping, > page_to_pgoff(page), page_to_pfn(page)); > if (compound) { > - pr_warn("head:%p order:%u entire_mapcount:%d nr_pages_mapped:%d pincount:%d\n", > + pr_warn("head:%p order:%u entire_mapcount:%d total_mapcount:%d pincount:%d\n", > head, compound_order(head), > folio_entire_mapcount(folio), > - folio_nr_pages_mapped(folio), > + folio_mapcount(folio), > atomic_read(&folio->_pincount)); > } > > diff --git a/mm/huge_memory.c b/mm/huge_memory.c > index 154c210892a1..43a2150e41e3 100644 > --- a/mm/huge_memory.c > +++ b/mm/huge_memory.c > @@ -583,6 +583,7 @@ void prep_transhuge_page(struct page *page) > > VM_BUG_ON_FOLIO(folio_order(folio) < 2, folio); > INIT_LIST_HEAD(&folio->_deferred_list); > + atomic_set(&folio->_nr_pages_mapped, 0); > folio_set_compound_dtor(folio, TRANSHUGE_PAGE_DTOR); > } > > @@ -2795,6 +2796,7 @@ void free_transhuge_page(struct page *page) > } > spin_unlock_irqrestore(&ds_queue->split_queue_lock, flags); > } > + VM_WARN_ON_ONCE(atomic_read(&folio->_nr_pages_mapped)); > free_compound_page(page); > } > > diff --git a/mm/hugetlb.c b/mm/hugetlb.c > index 5f498e8025cc..6a614c559ccf 100644 > --- a/mm/hugetlb.c > +++ b/mm/hugetlb.c > @@ -1479,7 +1479,7 @@ static void __destroy_compound_gigantic_folio(struct folio *folio, > struct page *p; > > atomic_set(&folio->_entire_mapcount, 0); > - atomic_set(&folio->_nr_pages_mapped, 0); > + atomic_set(&folio->_total_mapcount, 0); Just checking this is definitely what you intended? _total_mapcount is -1 when it means "no pages mapped", so 0 means 1 page mapped? > atomic_set(&folio->_pincount, 0); > > for (i = 1; i < nr_pages; i++) { > @@ -2027,7 +2027,7 @@ static bool __prep_compound_gigantic_folio(struct folio *folio, > /* we rely on prep_new_hugetlb_folio to set the destructor */ > folio_set_order(folio, order); > atomic_set(&folio->_entire_mapcount, -1); > - atomic_set(&folio->_nr_pages_mapped, 0); > + atomic_set(&folio->_total_mapcount, -1); > atomic_set(&folio->_pincount, 0); > return true; > > diff --git a/mm/internal.h b/mm/internal.h > index 02a6fd36f46e..e4646fed44a5 100644 > --- a/mm/internal.h > +++ b/mm/internal.h > @@ -53,10 +53,10 @@ struct folio_batch; > void page_writeback_init(void); > > /* > - * If a 16GB hugetlb folio were mapped by PTEs of all of its 4kB pages, > + * If a 16GB folio were mapped by PTEs of all of its 4kB pages, > * its nr_pages_mapped would be 0x400000: choose the COMPOUND_MAPPED bit > - * above that range, instead of 2*(PMD_SIZE/PAGE_SIZE). Hugetlb currently > - * leaves nr_pages_mapped at 0, but avoid surprise if it participates later. > + * above that range, instead of 2*(PMD_SIZE/PAGE_SIZE). Not applicable to > + * hugetlb. > */ > #define COMPOUND_MAPPED 0x800000 > #define FOLIO_PAGES_MAPPED (COMPOUND_MAPPED - 1) > @@ -67,15 +67,6 @@ void page_writeback_init(void); > */ > #define SHOW_MEM_FILTER_NODES (0x0001u) /* disallowed nodes */ > > -/* > - * How many individual pages have an elevated _mapcount. Excludes > - * the folio's entire_mapcount. > - */ > -static inline int folio_nr_pages_mapped(struct folio *folio) > -{ > - return atomic_read(&folio->_nr_pages_mapped) & FOLIO_PAGES_MAPPED; > -} > - > static inline void *folio_raw_mapping(struct folio *folio) > { > unsigned long mapping = (unsigned long)folio->mapping; > @@ -420,7 +411,7 @@ static inline void prep_compound_head(struct page *page, unsigned int order) > folio_set_compound_dtor(folio, COMPOUND_PAGE_DTOR); > folio_set_order(folio, order); > atomic_set(&folio->_entire_mapcount, -1); > - atomic_set(&folio->_nr_pages_mapped, 0); > + atomic_set(&folio->_total_mapcount, -1); > atomic_set(&folio->_pincount, 0); > } > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > index 96b7c1a7d1f2..5bbd5782b543 100644 > --- a/mm/page_alloc.c > +++ b/mm/page_alloc.c > @@ -1009,8 +1009,8 @@ static int free_tail_page_prepare(struct page *head_page, struct page *page) > bad_page(page, "nonzero entire_mapcount"); > goto out; > } > - if (unlikely(atomic_read(&folio->_nr_pages_mapped))) { > - bad_page(page, "nonzero nr_pages_mapped"); > + if (unlikely(atomic_read(&folio->_total_mapcount) + 1)) { > + bad_page(page, "nonzero total_mapcount"); > goto out; > } > if (unlikely(atomic_read(&folio->_pincount))) { > diff --git a/mm/rmap.c b/mm/rmap.c > index 1f04debdc87a..a7dcae48245b 100644 > --- a/mm/rmap.c > +++ b/mm/rmap.c > @@ -1070,29 +1070,6 @@ int pfn_mkclean_range(unsigned long pfn, unsigned long nr_pages, pgoff_t pgoff, > return page_vma_mkclean_one(&pvmw); > } > > -int folio_total_mapcount(struct folio *folio) > -{ > - int mapcount = folio_entire_mapcount(folio); > - int nr_pages; > - int i; > - > - /* In the common case, avoid the loop when no pages mapped by PTE */ > - if (folio_nr_pages_mapped(folio) == 0) > - return mapcount; > - /* > - * Add all the PTE mappings of those pages mapped by PTE. > - * Limit the loop to folio_nr_pages_mapped()? > - * Perhaps: given all the raciness, that may be a good or a bad idea. > - */ > - nr_pages = folio_nr_pages(folio); > - for (i = 0; i < nr_pages; i++) > - mapcount += atomic_read(&folio_page(folio, i)->_mapcount); > - > - /* But each of those _mapcounts was based on -1 */ > - mapcount += nr_pages; > - return mapcount; > -} > - > /** > * page_move_anon_rmap - move a page to our anon_vma > * @page: the page to move to our anon_vma > @@ -1236,6 +1213,9 @@ void page_add_anon_rmap(struct page *page, struct vm_area_struct *vma, > } > } > > + if (folio_test_large(folio)) > + atomic_inc(&folio->_total_mapcount); > + > VM_BUG_ON_PAGE(!first && (flags & RMAP_EXCLUSIVE), page); > VM_BUG_ON_PAGE(!first && PageAnonExclusive(page), page); > > @@ -1289,6 +1269,10 @@ void folio_add_new_anon_rmap(struct folio *folio, struct vm_area_struct *vma, > __lruvec_stat_mod_folio(folio, NR_ANON_THPS, nr); > } > > + if (folio_test_large(folio)) > + /* increment count (starts at -1) */ > + atomic_set(&folio->_total_mapcount, 0); > + > __lruvec_stat_mod_folio(folio, NR_ANON_MAPPED, nr); > __page_set_anon_rmap(folio, &folio->page, vma, address, 1); > } > @@ -1310,14 +1294,14 @@ void folio_add_file_rmap_range(struct folio *folio, struct page *page, > bool compound) > { > atomic_t *mapped = &folio->_nr_pages_mapped; > - unsigned int nr_pmdmapped = 0, first; > + unsigned int nr_pmdmapped = 0, i, first; > int nr = 0; > > VM_WARN_ON_FOLIO(compound && !folio_test_pmd_mappable(folio), folio); > > /* Is page being mapped by PTE? Is this its first map to be added? */ > if (likely(!compound)) { > - do { > + for (i = 0; i < nr_pages; i++, page++) { > first = atomic_inc_and_test(&page->_mapcount); > if (first && folio_test_large(folio)) { > first = atomic_inc_return_relaxed(mapped); > @@ -1326,7 +1310,7 @@ void folio_add_file_rmap_range(struct folio *folio, struct page *page, > > if (first) > nr++; > - } while (page++, --nr_pages > 0); > + } > } else if (folio_test_pmd_mappable(folio)) { > /* That test is redundant: it's for safety or to optimize out */ > > @@ -1346,6 +1330,9 @@ void folio_add_file_rmap_range(struct folio *folio, struct page *page, > } > } > > + if (folio_test_large(folio)) > + atomic_add(compound ? 1 : nr_pages, &folio->_total_mapcount); > + > if (nr_pmdmapped) > __lruvec_stat_mod_folio(folio, folio_test_swapbacked(folio) ? > NR_SHMEM_PMDMAPPED : NR_FILE_PMDMAPPED, nr_pmdmapped); > @@ -1398,6 +1385,9 @@ void page_remove_rmap(struct page *page, struct vm_area_struct *vma, > > VM_BUG_ON_PAGE(compound && !PageHead(page), page); > > + if (folio_test_large(folio)) > + atomic_dec(&folio->_total_mapcount); > + > /* Hugetlb pages are not counted in NR_*MAPPED */ > if (unlikely(folio_test_hugetlb(folio))) { > /* hugetlb pages are always mapped with pmds */ > @@ -2538,6 +2528,7 @@ void hugepage_add_anon_rmap(struct page *page, struct vm_area_struct *vma, > BUG_ON(!anon_vma); > /* address might be in next vma when migration races vma_merge */ > first = atomic_inc_and_test(&folio->_entire_mapcount); > + atomic_inc(&folio->_total_mapcount); > VM_BUG_ON_PAGE(!first && (flags & RMAP_EXCLUSIVE), page); > VM_BUG_ON_PAGE(!first && PageAnonExclusive(page), page); > if (first) > @@ -2551,6 +2542,7 @@ void hugepage_add_new_anon_rmap(struct folio *folio, > BUG_ON(address < vma->vm_start || address >= vma->vm_end); > /* increment count (starts at -1) */ > atomic_set(&folio->_entire_mapcount, 0); > + atomic_set(&folio->_total_mapcount, 0); > folio_clear_hugetlb_restore_reserve(folio); > __page_set_anon_rmap(folio, &folio->page, vma, address, 1); > }