Re: [PATCH mm-unstable v1] mm: add a total mapcount for large folios

Ryan Roberts <ryan.roberts@xxxxxxx> · Wed, 9 Aug 2023 20:07:43 +0100

On 09/08/2023 09:32, David Hildenbrand wrote:
> Let's track the total mapcount for all large folios in the first subpage.
> 
> The total mapcount is what we actually want to know in folio_mapcount()
> and it is also sufficient for implementing folio_mapped(). This also
> gets rid of any "raceiness" concerns as expressed in
> folio_total_mapcount().
> 
> With sub-PMD THP becoming more important and things looking promising
> that we will soon get support for such anon THP, we want to avoid looping
> over all pages of a folio just to calculate the total mapcount. Further,
> we may soon want to use the total mapcount in other context more
> frequently, so prepare for reading it efficiently and atomically.
> 
> Make room for the total mapcount in page[1] of the folio by moving
> _nr_pages_mapped to page[2] of the folio: it is not applicable to hugetlb
> -- and with the total mapcount in place probably also not desirable even
> if PMD-mappable hugetlb pages could get PTE-mapped at some point -- so we
> can overlay the hugetlb fields.
> 
> Note that we currently don't expect any order-1 compound pages / THP in
> rmap code, and that such support is not planned. If ever desired, we could
> move the compound mapcount to another page, because it only applies to
> PMD-mappable folios that are definitely larger. Let's avoid consuming
> more space elsewhere for now -- we might need more space for a different
> purpose in some subpages soon.
> 
> Maintain the total mapcount also for hugetlb pages. Use the total mapcount
> to implement folio_mapcount(), total_mapcount(), folio_mapped() and
> page_mapped().
> 
> We can now get rid of folio_total_mapcount() and
> folio_large_is_mapped(), by just inlining reading of the total mapcount.
> 
> _nr_pages_mapped is now only used in rmap code, so not accidentially
> externally where it might be used on arbitrary order-1 pages. The remaining
> usage is:
> 
> (1) Detect how to adjust stats: NR_ANON_MAPPED and NR_FILE_MAPPED
>  -> If we would account the total folio as mapped when mapping a
>     page (based on the total mapcount), we could remove that usage.
> 
> (2) Detect when to add a folio to the deferred split queue
>  -> If we would apply a different heuristic, or scan using the rmap on
>     the memory reclaim path for partially mapped anon folios to
>     split them, we could remove that usage as well.
> 
> So maybe, we can simplify things in the future and remove
> _nr_pages_mapped. For now, leave these things as they are, they need more
> thought. Hugh really did a nice job implementing that precise tracking
> after all.
> 
> Note: Not adding order-1 sanity checks to the file_rmap functions for
>       now. For anon pages, they are certainly not required right now.
> 
> Cc: Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx>
> Cc: Jonathan Corbet <corbet@xxxxxxx>
> Cc: Mike Kravetz <mike.kravetz@xxxxxxxxxx>
> Cc: Hugh Dickins <hughd@xxxxxxxxxx>
> Cc: "Matthew Wilcox (Oracle)" <willy@xxxxxxxxxxxxx>
> Cc: Ryan Roberts <ryan.roberts@xxxxxxx>
> Cc: Yin Fengwei <fengwei.yin@xxxxxxxxx>
> Cc: Yang Shi <shy828301@xxxxxxxxx>
> Cc: Zi Yan <ziy@xxxxxxxxxx>
> Signed-off-by: David Hildenbrand <david@xxxxxxxxxx>

Other than the nits and query on zeroing _total_mapcount below, LGTM. If zeroing
is correct:

Reviewed-by: Ryan Roberts <ryan.roberts@xxxxxxx>

> ---
>  Documentation/mm/transhuge.rst | 12 +++++-----
>  include/linux/mm.h             | 31 ++++++------------------
>  include/linux/mm_types.h       |  6 +++--
>  include/linux/rmap.h           |  7 ++++--
>  mm/debug.c                     |  4 ++--
>  mm/huge_memory.c               |  2 ++
>  mm/hugetlb.c                   |  4 ++--
>  mm/internal.h                  | 17 ++++---------
>  mm/page_alloc.c                |  4 ++--
>  mm/rmap.c                      | 44 ++++++++++++++--------------------
>  10 files changed, 52 insertions(+), 79 deletions(-)
> 
> diff --git a/Documentation/mm/transhuge.rst b/Documentation/mm/transhuge.rst
> index 9a607059ea11..b0d3b1d3e8ea 100644
> --- a/Documentation/mm/transhuge.rst
> +++ b/Documentation/mm/transhuge.rst
> @@ -116,14 +116,14 @@ pages:
>      succeeds on tail pages.
>  
>    - map/unmap of a PMD entry for the whole THP increment/decrement
> -    folio->_entire_mapcount and also increment/decrement
> -    folio->_nr_pages_mapped by COMPOUND_MAPPED when _entire_mapcount
> -    goes from -1 to 0 or 0 to -1.
> +    folio->_entire_mapcount, increment/decrement folio->_total_mapcount
> +    and also increment/decrement folio->_nr_pages_mapped by COMPOUND_MAPPED
> +    when _entire_mapcount goes from -1 to 0 or 0 to -1.
>  
>    - map/unmap of individual pages with PTE entry increment/decrement
> -    page->_mapcount and also increment/decrement folio->_nr_pages_mapped
> -    when page->_mapcount goes from -1 to 0 or 0 to -1 as this counts
> -    the number of pages mapped by PTE.
> +    page->_mapcount, increment/decrement folio->_total_mapcount and also
> +    increment/decrement folio->_nr_pages_mapped when page->_mapcount goes
> +    from -1 to 0 or 0 to -1 as this counts the number of pages mapped by PTE.
>  
>  split_huge_page internally has to distribute the refcounts in the head
>  page to the tail pages before clearing all PG_head/tail bits from the page
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 6a95dfed4957..30ac004fa0ef 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -1187,41 +1187,24 @@ static inline int page_mapcount(struct page *page)
>  	return mapcount;
>  }
>  
> -int folio_total_mapcount(struct folio *folio);
> -
>  /**
> - * folio_mapcount() - Calculate the number of mappings of this folio.
> + * folio_mapcount() - Return the total number of mappings of this folio.
>   * @folio: The folio.
>   *
> - * A large folio tracks both how many times the entire folio is mapped,
> - * and how many times each individual page in the folio is mapped.
> - * This function calculates the total number of times the folio is
> - * mapped.
> - *
> - * Return: The number of times this folio is mapped.
> + * Return: The total number of times this folio is mapped.
>   */
>  static inline int folio_mapcount(struct folio *folio)
>  {
>  	if (likely(!folio_test_large(folio)))
>  		return atomic_read(&folio->_mapcount) + 1;
> -	return folio_total_mapcount(folio);
> +	return atomic_read(&folio->_total_mapcount) + 1;
>  }
>  
>  static inline int total_mapcount(struct page *page)

nit: couldn't total_mapcount() just be implemented as a wrapper around
folio_mapcount()? What's the benefit of PageCompound() check instead of just
getting the folio and checking if it's large? i.e:

static inline int total_mapcount(struct page *page)
{
	return folio_mapcount(page_folio(page));
}

>  {
>  	if (likely(!PageCompound(page)))
>  		return atomic_read(&page->_mapcount) + 1;
> -	return folio_total_mapcount(page_folio(page));
> -}
> -
> -static inline bool folio_large_is_mapped(struct folio *folio)
> -{
> -	/*
> -	 * Reading _entire_mapcount below could be omitted if hugetlb
> -	 * participated in incrementing nr_pages_mapped when compound mapped.
> -	 */
> -	return atomic_read(&folio->_nr_pages_mapped) > 0 ||
> -		atomic_read(&folio->_entire_mapcount) >= 0;
> +	return atomic_read(&page_folio(page)->_total_mapcount) + 1;
>  }
>  
>  /**
> @@ -1234,7 +1217,7 @@ static inline bool folio_mapped(struct folio *folio)
>  {
>  	if (likely(!folio_test_large(folio)))
>  		return atomic_read(&folio->_mapcount) >= 0;
> -	return folio_large_is_mapped(folio);
> +	return atomic_read(&folio->_total_mapcount) >= 0;
>  }
>  
>  /*
> @@ -1246,7 +1229,7 @@ static inline bool page_mapped(struct page *page)
>  {
>  	if (likely(!PageCompound(page)))
>  		return atomic_read(&page->_mapcount) >= 0;
> -	return folio_large_is_mapped(page_folio(page));
> +	return atomic_read(&page_folio(page)->_total_mapcount) >= 0;

nit: same here; couldn't this be a wrapper around folio_mapped()?

>  }
>  
>  static inline struct page *virt_to_head_page(const void *x)
> @@ -2148,7 +2131,7 @@ static inline size_t folio_size(struct folio *folio)
>   * looking at the precise mapcount of the first subpage in the folio, and
>   * assuming the other subpages are the same. This may not be true for large
>   * folios. If you want exact mapcounts for exact calculations, look at
> - * page_mapcount() or folio_total_mapcount().
> + * page_mapcount() or folio_mapcount().
>   *
>   * Return: The estimated number of processes sharing a folio.
>   */
> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index 291c05cacd48..16e66b3332c6 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -267,7 +267,8 @@ static inline struct page *encoded_page_ptr(struct encoded_page *page)
>   * @_folio_dtor: Which destructor to use for this folio.
>   * @_folio_order: Do not use directly, call folio_order().
>   * @_entire_mapcount: Do not use directly, call folio_entire_mapcount().
> - * @_nr_pages_mapped: Do not use directly, call folio_mapcount().
> + * @_total_mapcount: Do not use directly, call folio_mapcount().
> + * @_nr_pages_mapped: Do not use outside of rmap code (and not for hugetlb).
>   * @_pincount: Do not use directly, call folio_maybe_dma_pinned().
>   * @_folio_nr_pages: Do not use directly, call folio_nr_pages().
>   * @_hugetlb_subpool: Do not use directly, use accessor in hugetlb.h.
> @@ -321,7 +322,7 @@ struct folio {
>  			unsigned char _folio_dtor;
>  			unsigned char _folio_order;
>  			atomic_t _entire_mapcount;
> -			atomic_t _nr_pages_mapped;
> +			atomic_t _total_mapcount;
>  			atomic_t _pincount;
>  #ifdef CONFIG_64BIT
>  			unsigned int _folio_nr_pages;
> @@ -346,6 +347,7 @@ struct folio {
>  			unsigned long _head_2a;
>  	/* public: */
>  			struct list_head _deferred_list;
> +			atomic_t _nr_pages_mapped;
>  	/* private: the union with struct page is transitional */
>  		};
>  		struct page __page_2;
> diff --git a/include/linux/rmap.h b/include/linux/rmap.h
> index a3825ce81102..a7b1c005e0c9 100644
> --- a/include/linux/rmap.h
> +++ b/include/linux/rmap.h
> @@ -210,14 +210,17 @@ void hugepage_add_new_anon_rmap(struct folio *, struct vm_area_struct *,
>  
>  static inline void __page_dup_rmap(struct page *page, bool compound)
>  {
> -	if (compound) {
> -		struct folio *folio = (struct folio *)page;
> +	struct folio *folio = page_folio(page);
>  
> +	if (compound) {
>  		VM_BUG_ON_PAGE(compound && !PageHead(page), page);
>  		atomic_inc(&folio->_entire_mapcount);
>  	} else {
>  		atomic_inc(&page->_mapcount);
>  	}
> +
> +	if (folio_test_large(folio))
> +		atomic_inc(&folio->_total_mapcount);
>  }
>  
>  static inline void page_dup_file_rmap(struct page *page, bool compound)
> diff --git a/mm/debug.c b/mm/debug.c
> index ee533a5ceb79..97f6f6b32ae7 100644
> --- a/mm/debug.c
> +++ b/mm/debug.c
> @@ -99,10 +99,10 @@ static void __dump_page(struct page *page)
>  			page, page_ref_count(head), mapcount, mapping,
>  			page_to_pgoff(page), page_to_pfn(page));
>  	if (compound) {
> -		pr_warn("head:%p order:%u entire_mapcount:%d nr_pages_mapped:%d pincount:%d\n",
> +		pr_warn("head:%p order:%u entire_mapcount:%d total_mapcount:%d pincount:%d\n",
>  				head, compound_order(head),
>  				folio_entire_mapcount(folio),
> -				folio_nr_pages_mapped(folio),
> +				folio_mapcount(folio),
>  				atomic_read(&folio->_pincount));
>  	}
>  
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 154c210892a1..43a2150e41e3 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -583,6 +583,7 @@ void prep_transhuge_page(struct page *page)
>  
>  	VM_BUG_ON_FOLIO(folio_order(folio) < 2, folio);
>  	INIT_LIST_HEAD(&folio->_deferred_list);
> +	atomic_set(&folio->_nr_pages_mapped, 0);
>  	folio_set_compound_dtor(folio, TRANSHUGE_PAGE_DTOR);
>  }
>  
> @@ -2795,6 +2796,7 @@ void free_transhuge_page(struct page *page)
>  		}
>  		spin_unlock_irqrestore(&ds_queue->split_queue_lock, flags);
>  	}
> +	VM_WARN_ON_ONCE(atomic_read(&folio->_nr_pages_mapped));
>  	free_compound_page(page);
>  }
>  
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 5f498e8025cc..6a614c559ccf 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -1479,7 +1479,7 @@ static void __destroy_compound_gigantic_folio(struct folio *folio,
>  	struct page *p;
>  
>  	atomic_set(&folio->_entire_mapcount, 0);
> -	atomic_set(&folio->_nr_pages_mapped, 0);
> +	atomic_set(&folio->_total_mapcount, 0);

Just checking this is definitely what you intended? _total_mapcount is -1 when
it means "no pages mapped", so 0 means 1 page mapped?

>  	atomic_set(&folio->_pincount, 0);
>  
>  	for (i = 1; i < nr_pages; i++) {
> @@ -2027,7 +2027,7 @@ static bool __prep_compound_gigantic_folio(struct folio *folio,
>  	/* we rely on prep_new_hugetlb_folio to set the destructor */
>  	folio_set_order(folio, order);
>  	atomic_set(&folio->_entire_mapcount, -1);
> -	atomic_set(&folio->_nr_pages_mapped, 0);
> +	atomic_set(&folio->_total_mapcount, -1);
>  	atomic_set(&folio->_pincount, 0);
>  	return true;
>  
> diff --git a/mm/internal.h b/mm/internal.h
> index 02a6fd36f46e..e4646fed44a5 100644
> --- a/mm/internal.h
> +++ b/mm/internal.h
> @@ -53,10 +53,10 @@ struct folio_batch;
>  void page_writeback_init(void);
>  
>  /*
> - * If a 16GB hugetlb folio were mapped by PTEs of all of its 4kB pages,
> + * If a 16GB folio were mapped by PTEs of all of its 4kB pages,
>   * its nr_pages_mapped would be 0x400000: choose the COMPOUND_MAPPED bit
> - * above that range, instead of 2*(PMD_SIZE/PAGE_SIZE).  Hugetlb currently
> - * leaves nr_pages_mapped at 0, but avoid surprise if it participates later.
> + * above that range, instead of 2*(PMD_SIZE/PAGE_SIZE). Not applicable to
> + * hugetlb.
>   */
>  #define COMPOUND_MAPPED		0x800000
>  #define FOLIO_PAGES_MAPPED	(COMPOUND_MAPPED - 1)
> @@ -67,15 +67,6 @@ void page_writeback_init(void);
>   */
>  #define SHOW_MEM_FILTER_NODES		(0x0001u)	/* disallowed nodes */
>  
> -/*
> - * How many individual pages have an elevated _mapcount.  Excludes
> - * the folio's entire_mapcount.
> - */
> -static inline int folio_nr_pages_mapped(struct folio *folio)
> -{
> -	return atomic_read(&folio->_nr_pages_mapped) & FOLIO_PAGES_MAPPED;
> -}
> -
>  static inline void *folio_raw_mapping(struct folio *folio)
>  {
>  	unsigned long mapping = (unsigned long)folio->mapping;
> @@ -420,7 +411,7 @@ static inline void prep_compound_head(struct page *page, unsigned int order)
>  	folio_set_compound_dtor(folio, COMPOUND_PAGE_DTOR);
>  	folio_set_order(folio, order);
>  	atomic_set(&folio->_entire_mapcount, -1);
> -	atomic_set(&folio->_nr_pages_mapped, 0);
> +	atomic_set(&folio->_total_mapcount, -1);
>  	atomic_set(&folio->_pincount, 0);
>  }
>  
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 96b7c1a7d1f2..5bbd5782b543 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1009,8 +1009,8 @@ static int free_tail_page_prepare(struct page *head_page, struct page *page)
>  			bad_page(page, "nonzero entire_mapcount");
>  			goto out;
>  		}
> -		if (unlikely(atomic_read(&folio->_nr_pages_mapped))) {
> -			bad_page(page, "nonzero nr_pages_mapped");
> +		if (unlikely(atomic_read(&folio->_total_mapcount) + 1)) {
> +			bad_page(page, "nonzero total_mapcount");
>  			goto out;
>  		}
>  		if (unlikely(atomic_read(&folio->_pincount))) {
> diff --git a/mm/rmap.c b/mm/rmap.c
> index 1f04debdc87a..a7dcae48245b 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -1070,29 +1070,6 @@ int pfn_mkclean_range(unsigned long pfn, unsigned long nr_pages, pgoff_t pgoff,
>  	return page_vma_mkclean_one(&pvmw);
>  }
>  
> -int folio_total_mapcount(struct folio *folio)
> -{
> -	int mapcount = folio_entire_mapcount(folio);
> -	int nr_pages;
> -	int i;
> -
> -	/* In the common case, avoid the loop when no pages mapped by PTE */
> -	if (folio_nr_pages_mapped(folio) == 0)
> -		return mapcount;
> -	/*
> -	 * Add all the PTE mappings of those pages mapped by PTE.
> -	 * Limit the loop to folio_nr_pages_mapped()?
> -	 * Perhaps: given all the raciness, that may be a good or a bad idea.
> -	 */
> -	nr_pages = folio_nr_pages(folio);
> -	for (i = 0; i < nr_pages; i++)
> -		mapcount += atomic_read(&folio_page(folio, i)->_mapcount);
> -
> -	/* But each of those _mapcounts was based on -1 */
> -	mapcount += nr_pages;
> -	return mapcount;
> -}
> -
>  /**
>   * page_move_anon_rmap - move a page to our anon_vma
>   * @page:	the page to move to our anon_vma
> @@ -1236,6 +1213,9 @@ void page_add_anon_rmap(struct page *page, struct vm_area_struct *vma,
>  		}
>  	}
>  
> +	if (folio_test_large(folio))
> +		atomic_inc(&folio->_total_mapcount);
> +
>  	VM_BUG_ON_PAGE(!first && (flags & RMAP_EXCLUSIVE), page);
>  	VM_BUG_ON_PAGE(!first && PageAnonExclusive(page), page);
>  
> @@ -1289,6 +1269,10 @@ void folio_add_new_anon_rmap(struct folio *folio, struct vm_area_struct *vma,
>  		__lruvec_stat_mod_folio(folio, NR_ANON_THPS, nr);
>  	}
>  
> +	if (folio_test_large(folio))
> +		/* increment count (starts at -1) */
> +		atomic_set(&folio->_total_mapcount, 0);
> +
>  	__lruvec_stat_mod_folio(folio, NR_ANON_MAPPED, nr);
>  	__page_set_anon_rmap(folio, &folio->page, vma, address, 1);
>  }
> @@ -1310,14 +1294,14 @@ void folio_add_file_rmap_range(struct folio *folio, struct page *page,
>  			bool compound)
>  {
>  	atomic_t *mapped = &folio->_nr_pages_mapped;
> -	unsigned int nr_pmdmapped = 0, first;
> +	unsigned int nr_pmdmapped = 0, i, first;
>  	int nr = 0;
>  
>  	VM_WARN_ON_FOLIO(compound && !folio_test_pmd_mappable(folio), folio);
>  
>  	/* Is page being mapped by PTE? Is this its first map to be added? */
>  	if (likely(!compound)) {
> -		do {
> +		for (i = 0; i < nr_pages; i++, page++) {
>  			first = atomic_inc_and_test(&page->_mapcount);
>  			if (first && folio_test_large(folio)) {
>  				first = atomic_inc_return_relaxed(mapped);
> @@ -1326,7 +1310,7 @@ void folio_add_file_rmap_range(struct folio *folio, struct page *page,
>  
>  			if (first)
>  				nr++;
> -		} while (page++, --nr_pages > 0);
> +		}
>  	} else if (folio_test_pmd_mappable(folio)) {
>  		/* That test is redundant: it's for safety or to optimize out */
>  
> @@ -1346,6 +1330,9 @@ void folio_add_file_rmap_range(struct folio *folio, struct page *page,
>  		}
>  	}
>  
> +	if (folio_test_large(folio))
> +		atomic_add(compound ? 1 : nr_pages, &folio->_total_mapcount);
> +
>  	if (nr_pmdmapped)
>  		__lruvec_stat_mod_folio(folio, folio_test_swapbacked(folio) ?
>  			NR_SHMEM_PMDMAPPED : NR_FILE_PMDMAPPED, nr_pmdmapped);
> @@ -1398,6 +1385,9 @@ void page_remove_rmap(struct page *page, struct vm_area_struct *vma,
>  
>  	VM_BUG_ON_PAGE(compound && !PageHead(page), page);
>  
> +	if (folio_test_large(folio))
> +		atomic_dec(&folio->_total_mapcount);
> +
>  	/* Hugetlb pages are not counted in NR_*MAPPED */
>  	if (unlikely(folio_test_hugetlb(folio))) {
>  		/* hugetlb pages are always mapped with pmds */
> @@ -2538,6 +2528,7 @@ void hugepage_add_anon_rmap(struct page *page, struct vm_area_struct *vma,
>  	BUG_ON(!anon_vma);
>  	/* address might be in next vma when migration races vma_merge */
>  	first = atomic_inc_and_test(&folio->_entire_mapcount);
> +	atomic_inc(&folio->_total_mapcount);
>  	VM_BUG_ON_PAGE(!first && (flags & RMAP_EXCLUSIVE), page);
>  	VM_BUG_ON_PAGE(!first && PageAnonExclusive(page), page);
>  	if (first)
> @@ -2551,6 +2542,7 @@ void hugepage_add_new_anon_rmap(struct folio *folio,
>  	BUG_ON(address < vma->vm_start || address >= vma->vm_end);
>  	/* increment count (starts at -1) */
>  	atomic_set(&folio->_entire_mapcount, 0);
> +	atomic_set(&folio->_total_mapcount, 0);
>  	folio_clear_hugetlb_restore_reserve(folio);
>  	__page_set_anon_rmap(folio, &folio->page, vma, address, 1);
>  }