Re: [PATCH v2 16/20] fs/proc/page: remove per-page mapcount dependency for /proc/kpagecount (CONFIG_NO_PAGE_MAPCOUNT)

David Hildenbrand <david@xxxxxxxxxx> · Mon, 24 Feb 2025 22:15:40 +0100

On 24.02.25 22:10, Zi Yan wrote:
On 24 Feb 2025, at 16:02, David Hildenbrand wrote:

On 24.02.25 21:40, Zi Yan wrote:
On Mon Feb 24, 2025 at 11:55 AM EST, David Hildenbrand wrote:
Let's implement an alternative when per-page mapcounts in large folios
are no longer maintained -- soon with CONFIG_NO_PAGE_MAPCOUNT.

For large folios, we'll return the per-page average mapcount within the
folio, except when the average is 0 but the folio is mapped: then we
return 1.

For hugetlb folios and for large folios that are fully mapped
into all address spaces, there is no change.

As an alternative, we could simply return 0 for non-hugetlb large folios,
or disable this legacy interface with CONFIG_NO_PAGE_MAPCOUNT.

But the information exposed by this interface can still be valuable, and
frequently we deal with fully-mapped large folios where the average
corresponds to the actual page mapcount. So we'll leave it like this for
now and document the new behavior.

Note: this interface is likely not very relevant for performance. If
ever required, we could try doing a rather expensive rmap walk to collect
precisely how often this folio page is mapped.

Signed-off-by: David Hildenbrand <david@xxxxxxxxxx>
---
   Documentation/admin-guide/mm/pagemap.rst |  7 +++++-
   fs/proc/internal.h                       | 31 ++++++++++++++++++++++++
   fs/proc/page.c                           | 19 ++++++++++++---
   3 files changed, 53 insertions(+), 4 deletions(-)

diff --git a/Documentation/admin-guide/mm/pagemap.rst b/Documentation/admin-guide/mm/pagemap.rst
index caba0f52dd36c..49590306c61a0 100644
--- a/Documentation/admin-guide/mm/pagemap.rst
+++ b/Documentation/admin-guide/mm/pagemap.rst
@@ -42,7 +42,12 @@ There are four components to pagemap:
      skip over unmapped regions.
     * ``/proc/kpagecount``.  This file contains a 64-bit count of the number of
-   times each page is mapped, indexed by PFN.
+   times each page is mapped, indexed by PFN. Some kernel configurations do
+   not track the precise number of times a page part of a larger allocation
+   (e.g., THP) is mapped. In these configurations, the average number of
+   mappings per page in this larger allocation is returned instead. However,
+   if any page of the large allocation is mapped, the returned value will
+   be at least 1.
    The page-types tool in the tools/mm directory can be used to query the
   number of times a page is mapped.
diff --git a/fs/proc/internal.h b/fs/proc/internal.h
index 1695509370b88..16aa1fd260771 100644
--- a/fs/proc/internal.h
+++ b/fs/proc/internal.h
@@ -174,6 +174,37 @@ static inline int folio_precise_page_mapcount(struct folio *folio,
   	return mapcount;
   }
  +/**
+ * folio_average_page_mapcount() - Average number of mappings per page in this
+ *				   folio
+ * @folio: The folio.
+ *
+ * The average number of present user page table entries that reference each
+ * page in this folio as tracked via the RMAP: either referenced directly
+ * (PTE) or as part of a larger area that covers this page (e.g., PMD).
+ *
+ * Returns: The average number of mappings per page in this folio. 0 for
+ * folios that are not mapped to user space or are not tracked via the RMAP
+ * (e.g., shared zeropage).
+ */
+static inline int folio_average_page_mapcount(struct folio *folio)
+{
+	int mapcount, entire_mapcount;
+	unsigned int adjust;
+
+	if (!folio_test_large(folio))
+		return atomic_read(&folio->_mapcount) + 1;
+
+	mapcount = folio_large_mapcount(folio);
+	entire_mapcount = folio_entire_mapcount(folio);
+	if (mapcount <= entire_mapcount)
+		return entire_mapcount;
+	mapcount -= entire_mapcount;
+
+	adjust = folio_large_nr_pages(folio) / 2;

Thanks for the review!


Is there any reason for choosing this adjust number? A comment might be
helpful in case people want to change it later, either with some reasoning
or just saying it is chosen empirically.

We're dividing by folio_large_nr_pages(folio) (shifting by folio_large_order(folio)), so this is not a magic number at all.

So this should be "ordinary" rounding.

I thought the rounding would be (mapcount + 511) / 512.

Yes, that's "rounding up".

But
that means if one subpage is mapped, the average will be 1.
Your rounding means if at least half of the subpages is mapped,
the average will be 1. Others might think 1/3 is mapped,
the average will be 1. That is why I think adjust looks like
a magic number.

I think all callers could tolerate (or benefit) from 
folio_average_page_mapcount() returning at least 1 in case any page is 
mapped.

There was a reason why I decided to round to the nearest integer instead.

Let me think about this once more, I went back and forth a couple of 
times on this.

--
Cheers,

David / dhildenb