[RFC PATCH 2/2] mm: Add support for nomlock to avoid folios beling mlocked in a memcg

Yafang Shao <laoar.shao@xxxxxxxxx> · Sun, 15 Dec 2024 15:34:15 +0800

The Use Case
============

We have a scenario where multiple services (cgroups) may share the same
file cache, as illustrated below:

    download-proxy       application
                   \    /
         /shared_path/shared_files

When the application needs specific types of files, it sends an RPC request
to the download-proxy. The download-proxy then downloads the files to
shared paths, after which the application reads these shared files. All
disk I/O operations are performed using buffered I/O.

The reason for using buffered I/O, rather than direct I/O, is that the
download-proxy itself may also read these shared files. This is because it
serves as a peer-to-peer (P2P) service:

   download-proxy of server1    <- P2P ->    download-proxy of server2

   /shared_path/shared_files                 /shared_path/shared_files

The Problem
===========

Applications reading these shared files may use mlock to pin the files in
memory for performance reasons. However, the shared file cache is charged
to the memory cgroup of the download-proxy during the download or P2P
process. Consequently, the page cache pages of the shared files might be
mlocked within the download-proxy's memcg, as shown:

    download-proxy     application
          |            /
        (charged)    (mlocked)
          |         /
    pagecache pages
           \
            \
          /shared_path/shared_files

This setup leads to a frequent scenario where the memory usage of the
download-proxy's memcg reaches its limit, potentially resulting in OOM
events. This behavior is undesirable.

The Solution
============

To address this, we propose introducing a new cgroup file, memory.nomlock,
which prevents page cache pages from being mlocked in a specific memcg when
set to 1.

Implementation Options
----------------------

- Solution A: Allow file caches on the unevictable list to become
  reclaimable. 
  This approach would require significant refactoring of the page reclaim
  logic.

- Solution B: Prevent file caches from being moved to the unevictable list
  during mlock and ignore the VM_LOCKED flag during page reclaim.
  This is a more straightforward solution and is the one we have chosen.
  If the file caches are reclaimed from the download-proxy's memcg and
  subsequently accessed by tasks in the application’s memcg, a filemap
  fault will occur. A new file cache will be faulted in, charged to the
  application’s memcg, and locked there.

Current limitations
==================

This solution is in its early stages and has the following limitations:

- Timing Dependency:
  memory.nomlock must be set before file caches are moved to the
  unevictable list. Otherwise, the file caches cannot be reclaimed.

- Metrics Inaccuracy:
  The "unevictable" metric in memory.stat and the "Mlocked" metric in
  /proc/meminfo may not be reliable. However, these metrics are already
  affected by the use of large folios.

If this solution is deemed acceptable, I will proceed with refining the
implementation.

Signed-off-by: Yafang Shao <laoar.shao@xxxxxxxxx>
---
 mm/mlock.c  | 9 +++++++++
 mm/rmap.c   | 8 +++++++-
 mm/vmscan.c | 5 +++++
 3 files changed, 21 insertions(+), 1 deletion(-)

diff --git a/mm/mlock.c b/mm/mlock.c
index cde076fa7d5e..9cebcf13929f 100644
--- a/mm/mlock.c
+++ b/mm/mlock.c
@@ -186,6 +186,7 @@ static inline struct folio *mlock_new(struct folio *folio)
 static void mlock_folio_batch(struct folio_batch *fbatch)
 {
 	struct lruvec *lruvec = NULL;
+	struct mem_cgroup *memcg;
 	unsigned long mlock;
 	struct folio *folio;
 	int i;
@@ -196,6 +197,10 @@ static void mlock_folio_batch(struct folio_batch *fbatch)
 		folio = (struct folio *)((unsigned long)folio - mlock);
 		fbatch->folios[i] = folio;
 
+		memcg = folio_memcg(folio);
+		if (memcg && memcg->nomlock && mlock)
+			continue;
+
 		if (mlock & LRU_FOLIO)
 			lruvec = __mlock_folio(folio, lruvec);
 		else if (mlock & NEW_FOLIO)
@@ -241,8 +246,12 @@ bool need_mlock_drain(int cpu)
  */
 void mlock_folio(struct folio *folio)
 {
+	struct mem_cgroup *memcg = folio_memcg(folio);
 	struct folio_batch *fbatch;
 
+	if (memcg && memcg->nomlock)
+		return;
+
 	local_lock(&mlock_fbatch.lock);
 	fbatch = this_cpu_ptr(&mlock_fbatch.fbatch);
 
diff --git a/mm/rmap.c b/mm/rmap.c
index c6c4d4ea29a7..6f16f86f9274 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -853,11 +853,17 @@ static bool folio_referenced_one(struct folio *folio,
 	DEFINE_FOLIO_VMA_WALK(pvmw, folio, vma, address, 0);
 	int referenced = 0;
 	unsigned long start = address, ptes = 0;
+	bool ignore_mlock = false;
+	struct mem_cgroup *memcg;
+
+	memcg = folio_memcg(folio);
+	if (memcg && memcg->nomlock)
+		ignore_mlock = true;
 
 	while (page_vma_mapped_walk(&pvmw)) {
 		address = pvmw.address;
 
-		if (vma->vm_flags & VM_LOCKED) {
+		if (!ignore_mlock && vma->vm_flags & VM_LOCKED) {
 			if (!folio_test_large(folio) || !pvmw.pte) {
 				/* Restore the mlock which got missed */
 				mlock_vma_folio(folio, vma);
diff --git a/mm/vmscan.c b/mm/vmscan.c
index fd55c3ec0054..defd36be28e9 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1283,6 +1283,7 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
 		if (folio_mapped(folio)) {
 			enum ttu_flags flags = TTU_BATCH_FLUSH;
 			bool was_swapbacked = folio_test_swapbacked(folio);
+			struct mem_cgroup *memcg;
 
 			if (folio_test_pmd_mappable(folio))
 				flags |= TTU_SPLIT_HUGE_PMD;
@@ -1301,6 +1302,10 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
 			if (folio_test_large(folio))
 				flags |= TTU_SYNC;
 
+			memcg = folio_memcg(folio);
+			if (memcg && memcg->nomlock)
+				flags |= TTU_IGNORE_MLOCK;
+
 			try_to_unmap(folio, flags);
 			if (folio_mapped(folio)) {
 				stat->nr_unmap_fail += nr_pages;
-- 
2.43.5