Re: [PATCH 4/8] hugetlb: handle truncate racing with page faults

Mike Kravetz <mike.kravetz@xxxxxxxxxx> · Tue, 6 Sep 2022 20:30:53 -0700

On 09/07/22 11:07, Miaohe Lin wrote:
> On 2022/9/7 10:37, Mike Kravetz wrote:
> > On 09/07/22 10:11, Miaohe Lin wrote:
> >> On 2022/9/7 7:08, Mike Kravetz wrote:
> >>> On 09/06/22 11:05, Mike Kravetz wrote:
> >>>> On 09/06/22 09:48, Mike Kravetz wrote:
> >>>>> On 09/06/22 15:57, Sven Schnelle wrote:
> >>>>>>
> >>>>>> With linux next starting from next-20220831 i see hangs with this
> >>>>>> patch applied while running the glibc test suite. The patch doesn't
> >>>>>> revert cleanly on top, so i checked out one commit before that one and
> >>>>>> with that revision everything works.
> >>>>>>
> >>>>>> It looks like the malloc test suite in glibc triggers this. I cannot
> >>>>>> identify a single test causing it, but instead the combination of
> >>>>>> multiple tests. Running the test suite on a single CPU works. Given the
> >>>>>> subject of the patch that's likely not a surprise.
> >>>>>>
> >>>>>> This is on s390, and the warning i get from RCU is:
> >>>>>>
> >>>>>> [ 1951.906997] rcu: INFO: rcu_sched self-detected stall on CPU
> >>>>>> [ 1951.907009] rcu:     60-....: (6000 ticks this GP) idle=968c/1/0x4000000000000000 softirq=43971/43972 fqs=2765
> >>>>>> [ 1951.907018]  (t=6000 jiffies g=116125 q=1008072 ncpus=64)
> >>>>>> [ 1951.907024] CPU: 60 PID: 1236661 Comm: ld64.so.1 Not tainted 6.0.0-rc3-next-20220901 #340
> >>>>>> [ 1951.907027] Hardware name: IBM 3906 M04 704 (z/VM 7.1.0)
> >>>>>> [ 1951.907029] Krnl PSW : 0704e00180000000 00000000003d9042 (hugetlb_fault_mutex_hash+0x2a/0xd8)
> >>>>>> [ 1951.907044]            R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:3 CC:2 PM:0 RI:0 EA:3
> >>>>>> [ 1951.907095] Call Trace:
> >>>>>> [ 1951.907098]  [<00000000003d9042>] hugetlb_fault_mutex_hash+0x2a/0xd8
> >>>>>> [ 1951.907101] ([<00000000005845a6>] fault_lock_inode_indicies+0x8e/0x128)
> >>>>>> [ 1951.907107]  [<0000000000584876>] remove_inode_hugepages+0x236/0x280
> >>>>>> [ 1951.907109]  [<0000000000584a7c>] hugetlbfs_evict_inode+0x3c/0x60
> >>>>>> [ 1951.907111]  [<000000000044fe96>] evict+0xe6/0x1c0
> >>>>>> [ 1951.907116]  [<000000000044a608>] __dentry_kill+0x108/0x1e0
> >>>>>> [ 1951.907119]  [<000000000044ac64>] dentry_kill+0x6c/0x290
> >>>>>> [ 1951.907121]  [<000000000044afec>] dput+0x164/0x1c0
> >>>>>> [ 1951.907123]  [<000000000042a4d6>] __fput+0xee/0x290
> >>>>>> [ 1951.907127]  [<00000000001794a8>] task_work_run+0x88/0xe0
> >>>>>> [ 1951.907133]  [<00000000001f77a0>] exit_to_user_mode_prepare+0x1a0/0x1a8
> >>>>>> [ 1951.907137]  [<0000000000d0e42e>] __do_syscall+0x11e/0x200
> >>>>>> [ 1951.907142]  [<0000000000d1d392>] system_call+0x82/0xb0
> >>>>>> [ 1951.907145] Last Breaking-Event-Address:
> >>>>>> [ 1951.907146]  [<0000038001d839c0>] 0x38001d839c0
> >>>>>>
> >>>>>> One of the hanging test cases is usually malloc/tst-malloc-too-large-malloc-hugetlb2.
> >>>>>>
> >>>>>> Any thoughts?
> >>>>>
> >>>>> Thanks for the report, I will take a look.
> >>>>>
> >>>>> My first thought is that this fix may not be applied,
> >>>>> https://lore.kernel.org/linux-mm/Ywepr7C2X20ZvLdn@monkey/
> >>>>> However, I see that that is in next-20220831.
> >>>>>
> >>>>> Hopefully, this will recreate on x86.
> >>>>
> >>>> One additional thought ...
> >>>>
> >>>> With this patch, we will take the hugetlb fault mutex for EVERY index in the
> >>>> range being truncated or hole punched.  In the case of a very large file, that
> >>>> is no different than code today where we take the mutex when removing pages
> >>>> from the file.  What is different is taking the mutex for indices that are
> >>>> part of holes in the file.  Consider a very large file with only one page at
> >>>> the very large offset.  We would then take the mutex for each index in that
> >>>> very large hole.  Depending on the size of the hole, this could appear as a
> >>>> hang.
> >>>>
> >>>> For the above locking scheme to work, we need to take the mutex for indices
> >>>> in holes in case there would happen to be a racing page fault.  However, there
> >>>> are only a limited number of fault mutexes (it is a table).  So, we only really
> >>>> need to take at a maximum num_fault_mutexes mutexes.  We could keep track of
> >>>> these with a bitmap.
> >>>>
> >>>> I am not sure this is the issue you are seeing, but a test named
> >>>> tst-malloc-too-large-malloc-hugetlb2 may be doing this.
> >>>>
> >>>> In any case, I think this issue needs to be addressed before this series can
> >>>> move forward.
> >>>
> >>> Well, even if we address the issue of taking the same mutex multiple times,
> >>
> >> Can we change to take all the hugetlb fault mutex at the same time to ensure every possible
> >> future hugetlb page fault will see a truncated i_size? Then we could just drop all the hugetlb
> >> fault mutex before doing any heavy stuff? It seems hugetlb fault mutex could be dropped when
> >> new i_size is guaranteed to be visible for any future hugetlb page fault users?
> >> But I might miss something...
> > 
> > Yes, that is the general direction and would work well for truncation.  However,
> > the same routine remove_inode_hugepages is used for hole punch, and I am pretty
> > sure we want to take the fault mutex there as it can race with page faults.
> 
> Oh, sorry. I missed that case.
> 
> > 
> >>
> >>> this new synchronization scheme requires a folio lookup for EVERY index in
> >>> the truncated or hole punched range.  This can easily 'stall' a CPU if there
> >>
> >> If above thought holds, we could do batch folio lookup instead. Hopes my thought will help. ;)
> >>
> > 
> > Yes, I have some promising POC code with two batch lookups in case of holes.
> > Hope to send something soon.
> 
> That will be really nice. ;)
> 

Hi Sven,

Would you be willing to try the patch below in your environment?
It addresses the stall I can create with a file that has a VERY large hole.
In addition, it passes libhugetlbfs tests and has run for a while in my
truncate/page fault race stress test.  However, it is very early code.
It would be nice to see if it addresses the issue in your environment.


>From 10c58195db9ed8aeff84c63ea2baf6c007651e42 Mon Sep 17 00:00:00 2001
From: Mike Kravetz <mike.kravetz@xxxxxxxxxx>
Date: Tue, 6 Sep 2022 19:59:47 -0700
Subject: [PATCH] hugetlb: redo remove_inode_hugepages algorithm for racing
 page faults

Previously, remove_inode_hugepages would take the fault mutes for EVERY
index in a file and look for a folio.  This included holes in the file.
For very large sparse files, this could result in minutes(or more) or
kernel time.  Taking the fault mutex for every index is a requirement if
we want to use it for synchronization with page faults.

This patch adjusts the algorithm slightly to take large holes into
account.  It tracks which fault mutexes have been taken so that it will
not needlessly take the same mutex more than once.  Also, we skip
looking for folios in holes.  Instead, we make a second scan of the file
with a bulk lookup.

Signed-off-by: Mike Kravetz <mike.kravetz@xxxxxxxxxx>
---
 fs/hugetlbfs/inode.c    | 70 ++++++++++++++++++++++-------------------
 include/linux/hugetlb.h | 16 ++++++++++
 mm/hugetlb.c            | 48 ++++++++++++++++++++++++++++
 3 files changed, 102 insertions(+), 32 deletions(-)

diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index 2f1d6da1bafb..ce1eb6202179 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -578,37 +578,27 @@ static bool remove_inode_single_folio(struct hstate *h, struct inode *inode,
 
 /*
  * Take hugetlb fault mutex for a set of inode indicies.
- * Check for and remove any found folios.  Return the number of
- * any removed folios.
- *
  */
-static long fault_lock_inode_indicies(struct hstate *h,
+static void fault_lock_inode_indicies(struct hstate *h,
 					struct inode *inode,
 					struct address_space *mapping,
 					pgoff_t start, pgoff_t end,
-					bool truncate_op)
+					bool truncate_op,
+					struct hugetlb_fault_mutex_track *hfmt)
 {
-	struct folio *folio;
-	long freed = 0;
+	long holes = 0;
 	pgoff_t index;
 	u32 hash;
 
-	for (index = start; index < end; index++) {
+	for (index = start; index < end &&
+			    !hugetlb_fault_mutex_track_all_set(hfmt);
+			    index++) {
 		hash = hugetlb_fault_mutex_hash(mapping, index);
-		mutex_lock(&hugetlb_fault_mutex_table[hash]);
-
-		folio = filemap_get_folio(mapping, index);
-		if (folio) {
-			if (remove_inode_single_folio(h, inode, mapping, folio,
-							index, truncate_op))
-				freed++;
-			folio_put(folio);
-		}
-
-		mutex_unlock(&hugetlb_fault_mutex_table[hash]);
+		hugetlb_fault_mutex_track_lock(hfmt, hash, false);
+		hugetlb_fault_mutex_track_unlock(hfmt, hash, false);
+		holes++;
+		cond_resched();
 	}
-
-	return freed;
 }
 
 /*
@@ -656,7 +646,14 @@ static void remove_inode_hugepages(struct inode *inode, loff_t lstart,
 	long freed = 0;
 	u32 hash;
 	bool truncate_op = (lend == LLONG_MAX);
+	struct hugetlb_fault_mutex_track *hfmt;
+	bool rescan = true;
+	unsigned long holes;
 
+	hfmt = hugetlb_fault_mutex_track_alloc();
+
+rescan:
+	holes = 0;
 	folio_batch_init(&fbatch);
 	next = m_start = start;
 	while (filemap_get_folios(mapping, &next, end - 1, &fbatch)) {
@@ -665,36 +662,45 @@ static void remove_inode_hugepages(struct inode *inode, loff_t lstart,
 
 			index = folio->index;
 			/*
-			 * Take fault mutex for missing folios before index,
-			 * while checking folios that might have been added
-			 * due to a race with fault code.
+			 * Take fault mutex for missing folios before index
 			 */
-			freed += fault_lock_inode_indicies(h, inode, mapping,
-						m_start, index, truncate_op);
+			holes += (index - m_start);
+			if (rescan)	/* no need on second pass */
+				fault_lock_inode_indicies(h, inode, mapping,
+					m_start, index, truncate_op, hfmt);
 
 			/*
 			 * Remove folio that was part of folio_batch.
+			 * Force taking fault mutex.
 			 */
 			hash = hugetlb_fault_mutex_hash(mapping, index);
-			mutex_lock(&hugetlb_fault_mutex_table[hash]);
+			hugetlb_fault_mutex_track_lock(hfmt, hash, true);
 			if (remove_inode_single_folio(h, inode, mapping, folio,
 							index, truncate_op))
 				freed++;
-			mutex_unlock(&hugetlb_fault_mutex_table[hash]);
+			hugetlb_fault_mutex_track_unlock(hfmt, hash, true);
 		}
 		folio_batch_release(&fbatch);
 		cond_resched();
 	}
 
 	/*
-	 * Take fault mutex for missing folios at end of range while checking
-	 * for folios that might have been added due to a race with fault code.
+	 * Take fault mutex for missing folios at end of range
 	 */
-	freed += fault_lock_inode_indicies(h, inode, mapping, m_start, m_end,
-						truncate_op);
+	holes += (m_end - m_start);
+	if (rescan)
+		fault_lock_inode_indicies(h, inode, mapping, m_start, m_end,
+						truncate_op, hfmt);
+
+	if (holes && rescan) {
+		rescan = false;
+		goto rescan;	/* can happen at most once */
+	}
 
 	if (truncate_op)
 		(void)hugetlb_unreserve_pages(inode, start, LONG_MAX, freed);
+
+	hugetlb_fault_mutex_track_free(hfmt);
 }
 
 static void hugetlbfs_evict_inode(struct inode *inode)
diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index 852f911d676e..dc532d2e42d0 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -180,8 +180,24 @@ void putback_active_hugepage(struct page *page);
 void move_hugetlb_state(struct page *oldpage, struct page *newpage, int reason);
 void free_huge_page(struct page *page);
 void hugetlb_fix_reserve_counts(struct inode *inode);
+
 extern struct mutex *hugetlb_fault_mutex_table;
 u32 hugetlb_fault_mutex_hash(struct address_space *mapping, pgoff_t idx);
+struct hugetlb_fault_mutex_track {
+	bool all_set;
+	unsigned long *track_bits;
+};
+struct hugetlb_fault_mutex_track *hugetlb_fault_mutex_track_alloc(void);
+void hugetlb_fault_mutex_track_lock(struct hugetlb_fault_mutex_track *hfmt,
+				u32 hash, bool force);
+void hugetlb_fault_mutex_track_unlock(struct hugetlb_fault_mutex_track *hfmt,
+				u32 hash, bool force);
+static inline bool hugetlb_fault_mutex_track_all_set(
+				struct hugetlb_fault_mutex_track *hfmt)
+{
+	return hfmt->all_set;
+}
+void hugetlb_fault_mutex_track_free(struct hugetlb_fault_mutex_track *hfmt);
 
 pte_t *huge_pmd_share(struct mm_struct *mm, struct vm_area_struct *vma,
 		      unsigned long addr, pud_t *pud);
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index d0617d64d718..d9dfc4736928 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -5788,6 +5788,54 @@ u32 hugetlb_fault_mutex_hash(struct address_space *mapping, pgoff_t idx)
 }
 #endif
 
+struct hugetlb_fault_mutex_track *hugetlb_fault_mutex_track_alloc(void)
+{
+	struct hugetlb_fault_mutex_track *hfmt;
+
+	hfmt = kmalloc(ALIGN(sizeof(struct hugetlb_fault_mutex_track),
+					sizeof(unsigned long)) +
+			sizeof(unsigned long) *
+					BITS_TO_LONGS(num_fault_mutexes),
+			GFP_KERNEL);
+	if (hfmt) {
+		hfmt->track_bits = (void *)hfmt +
+			ALIGN(sizeof(struct hugetlb_fault_mutex_track),
+					sizeof(unsigned long));
+
+		hfmt->all_set = false;
+		bitmap_zero(hfmt->track_bits, num_fault_mutexes);
+	}
+
+	return hfmt;
+}
+
+void hugetlb_fault_mutex_track_lock(struct hugetlb_fault_mutex_track *hfmt,
+				u32 hash, bool force)
+{
+	if (!hfmt || !test_bit((int)hash, hfmt->track_bits) || force) {
+		mutex_lock(&hugetlb_fault_mutex_table[hash]);
+		/* set bit when unlocking */
+	}
+}
+
+void hugetlb_fault_mutex_track_unlock(struct hugetlb_fault_mutex_track *hfmt,
+				u32 hash, bool force)
+{
+	if (!hfmt || !test_bit((int)hash, hfmt->track_bits) || force) {
+		mutex_unlock(&hugetlb_fault_mutex_table[hash]);
+		if (hfmt && !hfmt->all_set) {
+			set_bit((int)hash, hfmt->track_bits);
+			if (bitmap_full(hfmt->track_bits, num_fault_mutexes))
+				hfmt->all_set = true;
+		}
+	}
+}
+
+void hugetlb_fault_mutex_track_free(struct hugetlb_fault_mutex_track *hfmt)
+{
+	kfree(hfmt);
+}
+
 vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 			unsigned long address, unsigned int flags)
 {
-- 
2.37.2