[merged mm-hotfixes-stable] mm-swap-fix-allocation-and-scanning-race-with-swapoff.patch removed from -mm tree

Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx> · Thu, 14 Nov 2024 15:25:44 -0800

The quilt patch titled
     Subject: mm, swap: fix allocation and scanning race with swapoff
has been removed from the -mm tree.  Its filename was
     mm-swap-fix-allocation-and-scanning-race-with-swapoff.patch

This patch was dropped because it was merged into the mm-hotfixes-stable branch
of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

------------------------------------------------------
From: Kairui Song <kasong@xxxxxxxxxxx>
Subject: mm, swap: fix allocation and scanning race with swapoff
Date: Tue, 12 Nov 2024 16:34:14 +0800

There are two flags used to synchronize allocation and scanning with
swapoff: SWP_WRITEOK and SWP_SCANNING.

SWP_WRITEOK: Swapoff will first unset this flag, at this point any further
swap allocation or scanning on this device should just abort so no more
new entries will be referencing this device.  Swapoff will then unuse all
existing swap entries.

SWP_SCANNING: This flag is set when device is being scanned.  Swapoff will
wait for all scanner to stop before the final release of the swap device
structures to avoid UAF.  Note this flag is the highest used bit of
si->flags so it could be added up arithmetically, if there are multiple
scanner.

commit 5f843a9a3a1e ("mm: swap: separate SSD allocation from
scan_swap_map_slots()") ignored SWP_SCANNING and SWP_WRITEOK flags while
separating cluster allocation path from the old allocation path.  Add the
flags back to fix swapoff race.  The race is hard to trigger as si->lock
prevents most parallel operations, but si->lock could be dropped for
reclaim or discard.  This issue is found during code review.

This commit fixes this problem.  For SWP_SCANNING, Just like before, set
the flag before scan and remove it afterwards.

For SWP_WRITEOK, there are several places where si->lock could be dropped,
it will be error-prone and make the code hard to follow if we try to cover
these places one by one.  So just do one check before the real allocation,
which is also very similar like before.  With new cluster allocator it may
waste a bit of time iterating the clusters but won't take long, and
swapoff is not performance sensitive.

Link: https://lkml.kernel.org/r/20241112083414.78174-1-ryncsn@xxxxxxxxx
Fixes: 5f843a9a3a1e ("mm: swap: separate SSD allocation from scan_swap_map_slots()")
Reported-by: "Huang, Ying" <ying.huang@xxxxxxxxx>
Closes: https://lore.kernel.org/linux-mm/87a5es3f1f.fsf@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx/
Signed-off-by: Kairui Song <kasong@xxxxxxxxxxx>
Cc: Barry Song <v-songbaohua@xxxxxxxx>
Cc: Chris Li <chrisl@xxxxxxxxxx>
Cc: Hugh Dickins <hughd@xxxxxxxxxx>
Cc: Kalesh Singh <kaleshsingh@xxxxxxxxxx>
Cc: Ryan Roberts <ryan.roberts@xxxxxxx>
Signed-off-by: Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx>
---

 mm/swapfile.c |   22 +++++++++++++++++++---
 1 file changed, 19 insertions(+), 3 deletions(-)

--- a/mm/swapfile.c~mm-swap-fix-allocation-and-scanning-race-with-swapoff
+++ a/mm/swapfile.c
@@ -664,12 +664,15 @@ static bool cluster_scan_range(struct sw
 	return true;
 }
 
-static void cluster_alloc_range(struct swap_info_struct *si, struct swap_cluster_info *ci,
+static bool cluster_alloc_range(struct swap_info_struct *si, struct swap_cluster_info *ci,
 				unsigned int start, unsigned char usage,
 				unsigned int order)
 {
 	unsigned int nr_pages = 1 << order;
 
+	if (!(si->flags & SWP_WRITEOK))
+		return false;
+
 	if (cluster_is_free(ci)) {
 		if (nr_pages < SWAPFILE_CLUSTER) {
 			list_move_tail(&ci->list, &si->nonfull_clusters[order]);
@@ -690,6 +693,8 @@ static void cluster_alloc_range(struct s
 		list_move_tail(&ci->list, &si->full_clusters);
 		ci->flags = CLUSTER_FLAG_FULL;
 	}
+
+	return true;
 }
 
 static unsigned int alloc_swap_scan_cluster(struct swap_info_struct *si, unsigned long offset,
@@ -713,7 +718,10 @@ static unsigned int alloc_swap_scan_clus
 
 	while (offset <= end) {
 		if (cluster_scan_range(si, ci, offset, nr_pages)) {
-			cluster_alloc_range(si, ci, offset, usage, order);
+			if (!cluster_alloc_range(si, ci, offset, usage, order)) {
+				offset = SWAP_NEXT_INVALID;
+				goto done;
+			}
 			*foundp = offset;
 			if (ci->count == SWAPFILE_CLUSTER) {
 				offset = SWAP_NEXT_INVALID;
@@ -805,7 +813,11 @@ new_cluster:
 	if (!list_empty(&si->free_clusters)) {
 		ci = list_first_entry(&si->free_clusters, struct swap_cluster_info, list);
 		offset = alloc_swap_scan_cluster(si, cluster_offset(si, ci), &found, order, usage);
-		VM_BUG_ON(!found);
+		/*
+		 * Either we didn't touch the cluster due to swapoff,
+		 * or the allocation must success.
+		 */
+		VM_BUG_ON((si->flags & SWP_WRITEOK) && !found);
 		goto done;
 	}
 
@@ -1041,6 +1053,8 @@ static int cluster_alloc_swap(struct swa
 
 	VM_BUG_ON(!si->cluster_info);
 
+	si->flags += SWP_SCANNING;
+
 	while (n_ret < nr) {
 		unsigned long offset = cluster_alloc_swap_entry(si, order, usage);
 
@@ -1049,6 +1063,8 @@ static int cluster_alloc_swap(struct swa
 		slots[n_ret++] = swp_entry(si->type, offset);
 	}
 
+	si->flags -= SWP_SCANNING;
+
 	return n_ret;
 }
 
_

Patches currently in -mm which might be from kasong@xxxxxxxxxxx are