Re: Hang when swapping huge=within_size tmpfs from zram

Kairui Song <ryncsn@xxxxxxxxx> · Mon, 24 Feb 2025 02:22:03 +0800

On Mon, Feb 24, 2025 at 1:53 AM Kairui Song <ryncsn@xxxxxxxxx> wrote:
>
> On Fri, Feb 7, 2025 at 3:24 PM Baolin Wang
> <baolin.wang@xxxxxxxxxxxxxxxxx> wrote:
> >
> > On 2025/2/5 22:39, Lance Yang wrote:
> > > On Wed, Feb 5, 2025 at 2:38 PM Baolin Wang
> > > <baolin.wang@xxxxxxxxxxxxxxxxx> wrote:
> > >> On 2025/2/5 09:55, Baolin Wang wrote:
> > >>> Hi Alex,
> > >>>
> > >>> On 2025/2/5 09:23, Alex Xu (Hello71) wrote:
> > >>>> Hi all,
> > >>>>
> > >>>> On 6.14-rc1, I found that creating a lot of files in tmpfs then deleting
> > >>>> them reliably hangs when tmpfs is mounted with huge=within_size, and it
> > >>>> is swapped out to zram (zstd/zsmalloc/no backing dev). I bisected this
> > >>>> to acd7ccb284b "mm: shmem: add large folio support for tmpfs".
> > >>>>
> > >>>> When the issue occurs, rm uses 100% CPU, cannot be killed, and has no
> > >>>> output in /proc/pid/stack or wchan. Eventually, an RCU stall is
> > >>>> detected:
> > >>>
> > >>> Thanks for your report. Let me try to reproduce the issue locally and
> > >>> see what happens.
> > >>>
> > >>>> rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
> > >>>> rcu:     Tasks blocked on level-0 rcu_node (CPUs 0-11): P25160
> > >>>> rcu:     (detected by 10, t=2102 jiffies, g=532677, q=4997 ncpus=12)
> > >>>> task:rm              state:R  running task     stack:0     pid:25160
> > >>>> tgid:25160 ppid:24309  task_flags:0x400000 flags:0x00004004
> > >>>> Call Trace:
> > >>>>    <TASK>
> > >>>>    ? __schedule+0x388/0x1000
> > >>>>    ? kmem_cache_free.part.0+0x23d/0x280
> > >>>>    ? sysvec_apic_timer_interrupt+0xa/0x80
> > >>>>    ? asm_sysvec_apic_timer_interrupt+0x16/0x20
> > >>>>    ? xas_load+0x12/0xc0
> > >>>>    ? xas_load+0x8/0xc0
> > >>>>    ? xas_find+0x144/0x190
> > >>>>    ? find_lock_entries+0x75/0x260
> > >>>>    ? shmem_undo_range+0xe6/0x5f0
> > >>>>    ? shmem_evict_inode+0xe4/0x230
> > >>>>    ? mtree_erase+0x7e/0xe0
> > >>>>    ? inode_set_ctime_current+0x2e/0x1f0
> > >>>>    ? evict+0xe9/0x260
> > >>>>    ? _atomic_dec_and_lock+0x31/0x50
> > >>>>    ? do_unlinkat+0x270/0x2b0
> > >>>>    ? __x64_sys_unlinkat+0x30/0x50
> > >>>>    ? do_syscall_64+0x37/0xe0
> > >>>>    ? entry_SYSCALL_64_after_hwframe+0x50/0x58
> > >>>>    </TASK>
> > >>>>
> > >>>> Let me know what information is needed to further troubleshoot this
> > >>>> issue.
> > >>
> > >> Sorry, I can't reproduce this issue, and my testing process is as follows:
> > >> 1. Mount tmpfs with huge=within_size
> > >> 2. Create and write a tmpfs file
> > >> 3. Swap out the large folios of the tmpfs file to zram
> > >> 4. Execute 'rm' command to remove the tmpfs file
> > >
> > > I’m unable to reproduce the issue as well, and am following steps similar
> > > to Baolin's process:
> > >
> > > 1) Mount tmpfs with the huge=within_size option and enable swap (using
> > > zstd/zsmalloc without a backing device).
> > > 2) Create and write over 10,000 files in the tmpfs.
> > > 3) Swap out the large folios of these tmpfs files to zram.
> > > 4) Use the rm command to delete all the files from the tmpfs.
> > >
> > > Testing with both 2MiB and 64KiB large folio sizes, and with
> > > shmem_enabled=within_size, but everything works as expected.
> >
> > Thanks Lance for confirming again.
> >
> > Alex, could you give more hints on how to reproduce this issue?
> >
>
> Hi Baolin,
>
> I can reproduce this issue very easily with the build linux kernel
> test, and the failure rate is very high. I'm not exactly sure this is
> the same bug but very likely, my test step:
>
> 1. Create a 10G ZRAM device and set up SWAP on it.
> 2. Create a 1G memcg, and spawn a shell in it.
> 3. Mount tmpfs with huge=within_size, and then untar linux kernel
> source code into it.
> 4. Build with make -j32 (higher or lower job number may also work),
> the build will always fall within 10s due to file corrupted.
>
> After some debugging, the reason is in shmem_swapin_folio, when swap
> cache is hit `folio = swap_cache_get_folio(swap, NULL, 0);` sets folio
> to a 0 order folio, then the following shmem_add_to_page_cache will
> insert a order 0 folio overriding a high order entry in shmem's
> xarray, so data are lost. Swap cache hit could be due to many reasons,
> in this case it's the readahead.
>
> One quick fix is just always split the entry upon shmem fault of 0
> order folio like this:
>
> diff --git a/mm/shmem.c b/mm/shmem.c
> index 4ea6109a8043..c8e5c419c675 100644
> --- a/mm/shmem.c
> +++ b/mm/shmem.c
> @@ -2341,6 +2341,10 @@ static int shmem_swapin_folio(struct inode
> *inode, pgoff_t index,
>                 }
>         }
>
> +       /* Swapin of 0 order folio must always ensure the entries are split */
> +       if (!folio_order(folio))
> +               shmem_split_large_entry(inode, index, swap, gfp);
> +
>  alloced:
>         /* We have to do this with folio locked to prevent races */
>         folio_lock(folio);
>
> And Hi Alex, can you help confirm if the above patch fixes your reported bug?
>
> If we are OK with this, this should be merged into 6.14 I think, but
> for the long term, it might be a good idea to just share a similar
> logic of (or just reuse) __filemap_add_folio for shmem?
> __filemap_add_folio will split the entry on insert, and code will be
> much cleaner.

Some extra comments for above patch: If it raced with another split,
or the entry used for swap cache lookup is wrongly aligned due to
large entry, the shmem_add_to_page_cache below will fail with -EEXIST
and try again. So that seems to be working well in my test.