Re: xfs/folio splat with v6.14-rc1

Qu Wenruo <quwenruo.btrfs@xxxxxxx> · Mon, 10 Feb 2025 19:20:58 +1030

在 2025/2/10 18:48, Qi Zheng 写道:
Hi all,

On 2025/2/10 12:02, Qi Zheng wrote:
Hi Zi,

On 2025/2/10 11:35, Zi Yan wrote:
On 7 Feb 2025, at 17:17, Matthew Wilcox wrote:

On Fri, Feb 07, 2025 at 04:29:36PM +0100, Christian Brauner wrote:
while true; do ./xfs.run.sh "generic/437"; done

allows me to reproduce this fairly quickly.

on holiday, back monday

git bisect points to commit
4817f70c25b6 ("x86: select ARCH_SUPPORTS_PT_RECLAIM if X86_64").
Qi is cc'd.

After deselect PT_RECLAIM on v6.14-rc1, the issue is gone.
At least, no splat after running for more than 300s,
whereas the splat is usually triggered after ~20s with
PT_RECLAIM set.

The PT_RECLAIM mainly made the following two changes:

1) try to reclaim page table pages during madvise(MADV_DONTNEED)
2) Unconditionally select MMU_GATHER_RCU_TABLE_FREE

Will ./xfs.run.sh "generic/437" perform the madvise(MADV_DONTNEED)?

Anyway, I will try to reproduce it locally and troubleshoot it.

I reproduced it locally and it was indeed caused by PT_RECLAIM.

The root cause is that the pte lock may be released midway in
zap_pte_range() and then retried. In this case, the originally none pte
entry may be refilled with physical pages.

So we should recheck all pte entries in this case:

diff --git a/mm/memory.c b/mm/memory.c
index a8196ae72e9ae..ca1b133a288b5 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1721,7 +1721,7 @@ static unsigned long zap_pte_range(struct 
mmu_gather *tlb,
         pmd_t pmdval;
         unsigned long start = addr;
         bool can_reclaim_pt = reclaim_pt_is_enabled(start, end, details);
-       bool direct_reclaim = false;
+       bool direct_reclaim = true;
         int nr;

  retry:
@@ -1736,8 +1736,10 @@ static unsigned long zap_pte_range(struct 
mmu_gather *tlb,
         do {
                 bool any_skipped = false;

-               if (need_resched())
+               if (need_resched()) {
+                       direct_reclaim = false;
                         break;
+               }

                 nr = do_zap_pte_range(tlb, vma, pte, addr, end, 
details, rss,
                                       &force_flush, &force_break, 
&any_skipped);
@@ -1745,11 +1747,12 @@ static unsigned long zap_pte_range(struct 
mmu_gather *tlb,
                         can_reclaim_pt = false;
                 if (unlikely(force_break)) {
                         addr += nr * PAGE_SIZE;
+                       direct_reclaim = false;
                         break;
                 }
         } while (pte += nr, addr += PAGE_SIZE * nr, addr != end);

-       if (can_reclaim_pt && addr == end)
+       if (can_reclaim_pt && direct_reclaim && addr == end)
                 direct_reclaim = try_get_and_clear_pmd(mm, pmd, &pmdval);

         add_mm_rss_vec(mm, rss);

I tested the above code and no bugs were reported for a while. Does it
work for you?

Tested 128 generic/437 runs with CONFIG_PT_RECLAIM on btrfs.
No more crash, will do a longer run, but it looks like to get the bug fixed.

Before the fix merged, I'll deselect PT_RECLAIM as a workaround for my 
runs on btrfs/for-next branch.

Thanks,
Qu


Thanks,
Qi


Thanks!


--
Best Regards,
Yan, Zi