Re: [RFC PATCH] mm: move xa forward when run across zombie page

Zhaoyang Huang <huangzhaoyang@xxxxxxxxx> · Thu, 11 Apr 2024 15:04:32 +0800

On Tue, Nov 1, 2022 at 3:17 PM Dave Chinner <david@xxxxxxxxxxxxx> wrote:
>
> On Thu, Oct 20, 2022 at 10:52:14PM +0100, Matthew Wilcox wrote:
> > On Thu, Oct 20, 2022 at 09:04:24AM +1100, Dave Chinner wrote:
> > > On Wed, Oct 19, 2022 at 04:23:10PM +0100, Matthew Wilcox wrote:
> > > > On Wed, Oct 19, 2022 at 09:30:42AM +1100, Dave Chinner wrote:
> > > > > This is reading and writing the same amount of file data at the
> > > > > application level, but once the data has been written and kicked out
> > > > > of the page cache it seems to require an awful lot more read IO to
> > > > > get it back to the application. i.e. this looks like mmap() is
> > > > > readahead thrashing severely, and eventually it livelocks with this
> > > > > sort of report:
> > > > >
> > > > > [175901.982484] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
> > > > > [175901.985095] rcu:    Tasks blocked on level-1 rcu_node (CPUs 0-15): P25728
> > > > > [175901.987996]         (detected by 0, t=97399871 jiffies, g=15891025, q=1972622 ncpus=32)
> > > > > [175901.991698] task:test_write      state:R  running task     stack:12784 pid:25728 ppid: 25696 flags:0x00004002
> > > > > [175901.995614] Call Trace:
> > > > > [175901.996090]  <TASK>
> > > > > [175901.996594]  ? __schedule+0x301/0xa30
> > > > > [175901.997411]  ? sysvec_apic_timer_interrupt+0xb/0x90
> > > > > [175901.998513]  ? sysvec_apic_timer_interrupt+0xb/0x90
> > > > > [175901.999578]  ? asm_sysvec_apic_timer_interrupt+0x16/0x20
> > > > > [175902.000714]  ? xas_start+0x53/0xc0
> > > > > [175902.001484]  ? xas_load+0x24/0xa0
> > > > > [175902.002208]  ? xas_load+0x5/0xa0
> > > > > [175902.002878]  ? __filemap_get_folio+0x87/0x340
> > > > > [175902.003823]  ? filemap_fault+0x139/0x8d0
> > > > > [175902.004693]  ? __do_fault+0x31/0x1d0
> > > > > [175902.005372]  ? __handle_mm_fault+0xda9/0x17d0
> > > > > [175902.006213]  ? handle_mm_fault+0xd0/0x2a0
> > > > > [175902.006998]  ? exc_page_fault+0x1d9/0x810
> > > > > [175902.007789]  ? asm_exc_page_fault+0x22/0x30
> > > > > [175902.008613]  </TASK>
> > > > >
> > > > > Given that filemap_fault on XFS is probably trying to map large
> > > > > folios, I do wonder if this is a result of some kind of race with
> > > > > teardown of a large folio...
> > > >
> > > > It doesn't matter whether we're trying to map a large folio; it
> > > > matters whether a large folio was previously created in the cache.
> > > > Through the magic of readahead, it may well have been.  I suspect
> > > > it's not teardown of a large folio, but splitting.  Removing a
> > > > page from the page cache stores to the pointer in the XArray
> > > > first (either NULL or a shadow entry), then decrements the refcount.
> > > >
> > > > We must be observing a frozen folio.  There are a number of places
> > > > in the MM which freeze a folio, but the obvious one is splitting.
> > > > That looks like this:
> > > >
> > > >         local_irq_disable();
> > > >         if (mapping) {
> > > >                 xas_lock(&xas);
> > > > (...)
> > > >         if (folio_ref_freeze(folio, 1 + extra_pins)) {
> > >
> > > But the lookup is not doing anything to prevent the split on the
> > > frozen page from making progress, right? It's not holding any folio
> > > references, and it's not holding the mapping tree lock, either. So
> > > how does the lookup in progress prevent the page split from making
> > > progress?
> >
> > My thinking was that it keeps hammering the ->refcount field in
> > struct folio.  That might prevent a thread on a different socket
> > from making forward progress.  In contrast, spinlocks are designed
> > to be fair under contention, so by spinning on an actual lock, we'd
> > remove contention on the folio.
> >
> > But I think the tests you've done refute that theory.  I'm all out of
> > ideas at the moment.  Either we have a frozen folio from somebody who
> > doesn't hold the lock, or we have someone who's left a frozen folio in
> > the page cache.  I'm leaning towards that explanation at the moment,
> > but I don't have a good suggestion for debugging.
>
> It's something else. I got gdb attached to qemu and single stepped
> the looping lookup. The context I caught this time is truncate after
> unlink:
>
> (gdb) bt
> #0  find_get_entry (mark=<optimized out>, max=<optimized out>, xas=<optimized out>) at mm/filemap.c:2014
> #1  find_lock_entries (mapping=mapping@entry=0xffff8882445e2118, start=start@entry=25089, end=end@entry=18446744073709551614,
>     fbatch=fbatch@entry=0xffffc900082a7dd8, indices=indices@entry=0xffffc900082a7d60) at mm/filemap.c:2095
> #2  0xffffffff8128f024 in truncate_inode_pages_range (mapping=mapping@entry=0xffff8882445e2118, lstart=lstart@entry=0, lend=lend@entry=-1)
>     at mm/truncate.c:364
> #3  0xffffffff8128f452 in truncate_inode_pages (lstart=0, mapping=0xffff8882445e2118) at mm/truncate.c:452
> #4  0xffffffff8136335d in evict (inode=inode@entry=0xffff8882445e1f78) at fs/inode.c:666
> #5  0xffffffff813636cc in iput_final (inode=0xffff8882445e1f78) at fs/inode.c:1747
> #6  0xffffffff81355b8b in do_unlinkat (dfd=dfd@entry=10, name=0xffff88834170e000) at fs/namei.c:4326
> #7  0xffffffff81355cc3 in __do_sys_unlinkat (flag=<optimized out>, pathname=<optimized out>, dfd=<optimized out>) at fs/namei.c:4362
> #8  __se_sys_unlinkat (flag=<optimized out>, pathname=<optimized out>, dfd=<optimized out>) at fs/namei.c:4355
> #9  __x64_sys_unlinkat (regs=<optimized out>) at fs/namei.c:4355
> #10 0xffffffff81e92e35 in do_syscall_x64 (nr=<optimized out>, regs=0xffffc900082a7f58) at arch/x86/entry/common.c:50
> #11 do_syscall_64 (regs=0xffffc900082a7f58, nr=<optimized out>) at arch/x86/entry/common.c:80
> #12 0xffffffff82000087 in entry_SYSCALL_64 () at arch/x86/entry/entry_64.S:120
> #13 0x0000000000000000 in ?? ()
>
> The find_lock_entries() call is being asked to start at index
> 25089, and we are spinning on a folio we find because
> folio_try_get_rcu(folio) is failing - the folio ref count is zero.
>
> The xas state on lookup is:
>
> (gdb) p *xas
> $6 = {xa = 0xffff8882445e2120, xa_index = 25092, xa_shift = 0 '\000', xa_sibs = 0 '\000', xa_offset = 4 '\004', xa_pad = 0 '\000',
>   xa_node = 0xffff888144c15918, xa_alloc = 0x0 <fixed_percpu_data>, xa_update = 0x0 <fixed_percpu_data>, xa_lru = 0x0 <fixed_percpu_data>
>
> indicating that we are trying to look up index 25092 (3 pages
> further in than the start of the batch), and the folio that this
> keeps returning is this:
>
> (gdb) p *folio
> $7 = {{{flags = 24769796876795904, {lru = {next = 0xffffea0005690008, prev = 0xffff88823ffd5f50}, {__filler = 0xffffea0005690008,
>           mlock_count = 1073569616}}, mapping = 0x0 <fixed_percpu_data>, index = 18688, private = 0x8 <fixed_percpu_data+8>, _mapcount = {
>         counter = -129}, _refcount = {counter = 0}, memcg_data = 0}, page = {flags = 24769796876795904, {{{lru = {next = 0xffffea0005690008,
>               prev = 0xffff88823ffd5f50}, {__filler = 0xffffea0005690008, mlock_count = 1073569616}, buddy_list = {
>               next = 0xffffea0005690008, prev = 0xffff88823ffd5f50}, pcp_list = {next = 0xffffea0005690008, prev = 0xffff88823ffd5f50}},
>           mapping = 0x0 <fixed_percpu_data>, index = 18688, private = 8}, {pp_magic = 18446719884544507912, pp = 0xffff88823ffd5f50,
>           _pp_mapping_pad = 0, dma_addr = 18688, {dma_addr_upper = 8, pp_frag_count = {counter = 8}}}, {
>           compound_head = 18446719884544507912, compound_dtor = 80 'P', compound_order = 95 '_', compound_mapcount = {counter = -30590},
>           compound_pincount = {counter = 0}, compound_nr = 0}, {_compound_pad_1 = 18446719884544507912,
>           _compound_pad_2 = 18446612691733536592, deferred_list = {next = 0x0 <fixed_percpu_data>,
>             prev = 0x4900 <irq_stack_backing_store+10496>}}, {_pt_pad_1 = 18446719884544507912, pmd_huge_pte = 0xffff88823ffd5f50,
>           _pt_pad_2 = 0, {pt_mm = 0x4900 <irq_stack_backing_store+10496>, pt_frag_refcount = {counter = 18688}},
>           ptl = 0x8 <fixed_percpu_data+8>}, {pgmap = 0xffffea0005690008, zone_device_data = 0xffff88823ffd5f50}, callback_head = {
>           next = 0xffffea0005690008, func = 0xffff88823ffd5f50}}, {_mapcount = {counter = -129}, page_type = 4294967167}, _refcount = {
>         counter = 0}, memcg_data = 0}}, _flags_1 = 24769796876795904, __head = 0, _folio_dtor = 3 '\003', _folio_order = 8 '\b',
>   _total_mapcount = {counter = -1}, _pincount = {counter = 0}, _folio_nr_pages = 0}
> (gdb)
>
> The folio has a NULL mapping, and an index of 18688, which means
> even if it was not a folio that has been invalidated or freed, the
> index is way outside the range we are looking for.
>
> If I step it round the lookup loop, xas does not change, and the
> same folio is returned every time through the loop. Perhaps
> the mapping tree itself might be corrupt???
>
> It's simple enough to stop the machine once it has become stuck to
> observe the iteration and dump structures, just tell me what you
> need to know from here...
>
> Cheers,
>
> Dave.
> --
> Dave Chinner
> david@xxxxxxxxxxxxx

This bug emerges again and I would like to propose a reproduce
sequence of this bug which has nothing to do with scheduler stuff (
this could be wrong and sorry for wasting your time if so)

Thread_isolate:
1. alloc_contig_range->isolate_migratepages_block isolate a certain of
pages to cc->migratepages
       (folio has refcount: 1 + n (alloc_pages, page_cache))

2. alloc_contig_range->migrate_pages->folio_ref_freeze(folio, 1 +
extra_pins) set the folio->refcnt to 0

3. alloc_contig_range->migrate_pages->xas_split split the folios to
each slot as folio from slot[offset] to slot[offset + sibs]

Thread_truncate:
4. enter the livelock by the chain below
      rcu_read_lock();
        find_get_entry
            folio = xas_find
            if(!folio_try_get_rcu)
                xas_reset;
      rcu_read_unlock();

4'. alloc_contig_range->migrate_pages->__split_huge_page which will
modify folio's refcnt to 2 and breaks the livelock but is blocked by
lruvec->lock's contention

If the above call chain makes sense, could we solve this by below
modification which has split_folio and __split_huge_page be atomic by
taking lruvec->lock earlier than now.

int split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
                                     unsigned int new_order)
{

+        lruvec = folio_lruvec_lock(folio);
                if (mapping) {
                        int nr = folio_nr_pages(folio);

                        xas_split(&xas, folio, folio_order(folio));
                        if (folio_test_pmd_mappable(folio) &&
                            new_order < HPAGE_PMD_ORDER) {
                                if (folio_test_swapbacked(folio)) {
                                        __lruvec_stat_mod_folio(folio,
                                                        NR_SHMEM_THPS, -nr);
                                } else {
                                        __lruvec_stat_mod_folio(folio,
                                                        NR_FILE_THPS, -nr);
                                        filemap_nr_thps_dec(mapping);
                                }
                        }
                }

                __split_huge_page(page, list, end, new_order);
+        folio_lruvec_unlock(folio);