[PATCH v1 12/17] mm: remove per-page mapcount dependency in folio_likely_mapped_shared() (CONFIG_NO_PAGE_MAPCOUNT)

David Hildenbrand <david@xxxxxxxxxx> · Thu, 29 Aug 2024 18:56:15 +0200

Let's remove the dependency on the mapcount of the first folio page in
large folios and consequently any "false negatives" from
folio_likely_mapped_shared().

In theory, we could implement this change only with CONFIG_MM_ID,
without gluing it to another config option. But we'll be a bit
careful for the time being, because folio_likely_mapped_shared() can now
return "false positives" more frequently. Glue it to
CONFIG_NO_PAGE_MAPCOUNT, which expresses the "EXPERIMENTAL" character for
now.

Let's reuse our new MM ownership tracking infrastructure for large folios.
Thoroughly document the changed semantics. We might now detect that a
folio as "mapped shared" although it no longer is -- this can only happen
if more than two MMs mapped a folio at the same time, and neither of the
first two is the last one mapping the folio.

"false positives" in this context are certainly better than "false
negatives" when it comes to enforcing policies (e.g., is process 1
allowed to migrate a folio that might also be used by another process?),
but in an ideal world we wouldn't have these "false positives" either.

It's worth noting that there will not be a change for small folios and
hugetlb folios. In general, for PMD-mapped THP we don't expect a change,
only for PTE-mapped THP.

This will affect various users of folio_likely_mapped_shared():

(1) khugepaged counts PTEs that target shared folios towards the
    max_ptes_shared. With false positives we might collapse too little,
    with false negatives too much.

(2) NUMA hinting: PROT_NONE NUMA protection will be skipped for shared
    folios in COW mappings. With false positives we skip too many, with
    false negatives we don't skip some we should be skipping.

    During NUMA hinting faults, we will set TNF_SHARED with shared folios
    in shared mappings. With false positives we set it too often, with
    false negatives not often enough.

    During NUMA hinting faults, we will reject to migrate shared folios in
    mappings with execute permissions (expectation: shared libraries).
    With false positives we reject to migrate some, with false negatives
    we migrate too many.

(3) MADV_COLD / MADV_PAGEOUT / MADV_FREE will not try splitting PTE-mapped
    THPs that are considered shared but not fully covered by the
    requested range, consequently not processing them. With false
    positives we will not split+process some we could have processed, with
    false negatives we split some folios we probably shouldn't have split.

(4) mbind() / migrate_pages() / move_pages() will refuse to migrate shared
    folios unless MPOL_MF_MOVE_ALL is effective (requires CAP_SYS_NICE).
    With false positives we reject to migrate some folios that could be
    migrated, with false negatives we migrate some folios that shouldn't
    have been migrated.

(5) folio_referenced_one() will skip exclusive swapbacked folios in
    dying processes. Shared folios will not be skipped. With false
    positives we might skip this optimization, with false negatives we
    might apply this optimization wrongly.

Likely (3) and (4) are not really used a lot on folios that are heavily
shared among processes -- rather on anonymous memory (mostly from a
single parent process) or almost-exclusively mmap'ed files.

Similarly (1) is not expected to matter much in practice, and if so,
only for long-running child processes after fork(). But even here, it's
unlikely that it matters in practice.

(5) is not expected to matter much at all, it's a new optimization
either way.

(2) is interesting: the expectation here is that for anon folios it
might not make a big difference. For file-backed pages it might,
we'll have to learn about that.

Long story short: this paves the way for a complete
CONFIG_NO_PAGE_MAPCOUNT implementation, but maybe we'll have to
switch to another MM ownership tracking later.

Signed-off-by: David Hildenbrand <david@xxxxxxxxxx>
---
 include/linux/mm.h | 24 ++++++++++++++++++------
 1 file changed, 18 insertions(+), 6 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 98411e53da916..b37f20b26776d 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2142,9 +2142,9 @@ static inline size_t folio_size(const struct folio *folio)
  * are independent.
  *
  * As precise information is not easily available for all folios, this function
- * estimates the number of MMs ("sharers") that are currently mapping a folio
- * using the number of times the first page of the folio is currently mapped
- * into page tables.
+ * must sometimes estimate the number of MMs ("sharers") that are currently
+ * mapping a folio using the number of times the first page of the folio is
+ * currently mapped into page tables.
  *
  * For small anonymous folios and anonymous hugetlb folios, the return
  * value will be exactly correct: non-KSM folios can only be mapped at most once
@@ -2152,13 +2152,21 @@ static inline size_t folio_size(const struct folio *folio)
  * considered shared even if mapped multiple times into the same MM.
  *
  * For other folios, the result can be fuzzy:
- *    #. For partially-mappable large folios (THP), the return value can wrongly
- *       indicate "mapped exclusively" (false negative) when the folio is
- *       only partially mapped into at least one MM.
+ *    #. With CONFIG_PAGE_MAPCOUNT: For partially-mappable large folios (THP),
+ *       the return value can wrongly indicate "mapped exclusively" (false
+ *       negative) when the folio is only partially mapped into at least one MM.
+ *    #. With CONFIG_NO_PAGE_MAPCOUNT: For partially-mappable large folios
+ *       (THP), the return value can wrongly indicate "mapped shared" (false
+ *       positive) in some scenarios. This can only happen if two MMs are
+ *       already mapping a folio and a more MM starts mapping the folio. We
+ *       would still the detect the folio as "mapped shared" after the first
+ *       two MMs no longer map the folio.
  *    #. For pagecache folios (including hugetlb), the return value can wrongly
  *       indicate "mapped shared" (false positive) when two VMAs in the same MM
  *       cover the same file range.
  *
+ * With CONFIG_MM_ID, this function will never return "false negatives".
+ *
  * Further, this function only considers current page table mappings that
  * are tracked using the folio mapcount(s).
  *
@@ -2183,12 +2191,16 @@ static inline bool folio_likely_mapped_shared(struct folio *folio)
 	if (mapcount <= 1)
 		return false;
 
+#ifdef CONFIG_PAGE_MAPCOUNT
 	/* If any page is mapped more than once we treat it "mapped shared". */
 	if (folio_entire_mapcount(folio) || mapcount > folio_large_nr_pages(folio))
 		return true;
 
 	/* Let's guess based on the first subpage. */
 	return atomic_read(&folio->_mapcount) > 0;
+#else /* !CONFIG_PAGE_MAPCOUNT */
+	return !folio_test_large_mapped_exclusively(folio);
+#endif /* !CONFIG_PAGE_MAPCOUNT */
 }
 
 #ifndef HAVE_ARCH_MAKE_FOLIO_ACCESSIBLE
-- 
2.46.0