Re: Hang when swapping huge=within_size tmpfs from zram

Kairui Song <ryncsn@xxxxxxxxx> · Mon, 24 Feb 2025 01:53:07 +0800

On Fri, Feb 7, 2025 at 3:24 PM Baolin Wang
<baolin.wang@xxxxxxxxxxxxxxxxx> wrote:
>
> On 2025/2/5 22:39, Lance Yang wrote:
> > On Wed, Feb 5, 2025 at 2:38 PM Baolin Wang
> > <baolin.wang@xxxxxxxxxxxxxxxxx> wrote:
> >> On 2025/2/5 09:55, Baolin Wang wrote:
> >>> Hi Alex,
> >>>
> >>> On 2025/2/5 09:23, Alex Xu (Hello71) wrote:
> >>>> Hi all,
> >>>>
> >>>> On 6.14-rc1, I found that creating a lot of files in tmpfs then deleting
> >>>> them reliably hangs when tmpfs is mounted with huge=within_size, and it
> >>>> is swapped out to zram (zstd/zsmalloc/no backing dev). I bisected this
> >>>> to acd7ccb284b "mm: shmem: add large folio support for tmpfs".
> >>>>
> >>>> When the issue occurs, rm uses 100% CPU, cannot be killed, and has no
> >>>> output in /proc/pid/stack or wchan. Eventually, an RCU stall is
> >>>> detected:
> >>>
> >>> Thanks for your report. Let me try to reproduce the issue locally and
> >>> see what happens.
> >>>
> >>>> rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
> >>>> rcu:     Tasks blocked on level-0 rcu_node (CPUs 0-11): P25160
> >>>> rcu:     (detected by 10, t=2102 jiffies, g=532677, q=4997 ncpus=12)
> >>>> task:rm              state:R  running task     stack:0     pid:25160
> >>>> tgid:25160 ppid:24309  task_flags:0x400000 flags:0x00004004
> >>>> Call Trace:
> >>>>    <TASK>
> >>>>    ? __schedule+0x388/0x1000
> >>>>    ? kmem_cache_free.part.0+0x23d/0x280
> >>>>    ? sysvec_apic_timer_interrupt+0xa/0x80
> >>>>    ? asm_sysvec_apic_timer_interrupt+0x16/0x20
> >>>>    ? xas_load+0x12/0xc0
> >>>>    ? xas_load+0x8/0xc0
> >>>>    ? xas_find+0x144/0x190
> >>>>    ? find_lock_entries+0x75/0x260
> >>>>    ? shmem_undo_range+0xe6/0x5f0
> >>>>    ? shmem_evict_inode+0xe4/0x230
> >>>>    ? mtree_erase+0x7e/0xe0
> >>>>    ? inode_set_ctime_current+0x2e/0x1f0
> >>>>    ? evict+0xe9/0x260
> >>>>    ? _atomic_dec_and_lock+0x31/0x50
> >>>>    ? do_unlinkat+0x270/0x2b0
> >>>>    ? __x64_sys_unlinkat+0x30/0x50
> >>>>    ? do_syscall_64+0x37/0xe0
> >>>>    ? entry_SYSCALL_64_after_hwframe+0x50/0x58
> >>>>    </TASK>
> >>>>
> >>>> Let me know what information is needed to further troubleshoot this
> >>>> issue.
> >>
> >> Sorry, I can't reproduce this issue, and my testing process is as follows:
> >> 1. Mount tmpfs with huge=within_size
> >> 2. Create and write a tmpfs file
> >> 3. Swap out the large folios of the tmpfs file to zram
> >> 4. Execute 'rm' command to remove the tmpfs file
> >
> > I’m unable to reproduce the issue as well, and am following steps similar
> > to Baolin's process:
> >
> > 1) Mount tmpfs with the huge=within_size option and enable swap (using
> > zstd/zsmalloc without a backing device).
> > 2) Create and write over 10,000 files in the tmpfs.
> > 3) Swap out the large folios of these tmpfs files to zram.
> > 4) Use the rm command to delete all the files from the tmpfs.
> >
> > Testing with both 2MiB and 64KiB large folio sizes, and with
> > shmem_enabled=within_size, but everything works as expected.
>
> Thanks Lance for confirming again.
>
> Alex, could you give more hints on how to reproduce this issue?
>

Hi Baolin,

I can reproduce this issue very easily with the build linux kernel
test, and the failure rate is very high. I'm not exactly sure this is
the same bug but very likely, my test step:

1. Create a 10G ZRAM device and set up SWAP on it.
2. Create a 1G memcg, and spawn a shell in it.
3. Mount tmpfs with huge=within_size, and then untar linux kernel
source code into it.
4. Build with make -j32 (higher or lower job number may also work),
the build will always fall within 10s due to file corrupted.

After some debugging, the reason is in shmem_swapin_folio, when swap
cache is hit `folio = swap_cache_get_folio(swap, NULL, 0);` sets folio
to a 0 order folio, then the following shmem_add_to_page_cache will
insert a order 0 folio overriding a high order entry in shmem's
xarray, so data are lost. Swap cache hit could be due to many reasons,
in this case it's the readahead.

One quick fix is just always split the entry upon shmem fault of 0
order folio like this:

diff --git a/mm/shmem.c b/mm/shmem.c
index 4ea6109a8043..c8e5c419c675 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -2341,6 +2341,10 @@ static int shmem_swapin_folio(struct inode
*inode, pgoff_t index,
                }
        }

+       /* Swapin of 0 order folio must always ensure the entries are split */
+       if (!folio_order(folio))
+               shmem_split_large_entry(inode, index, swap, gfp);
+
 alloced:
        /* We have to do this with folio locked to prevent races */
        folio_lock(folio);

And Hi Alex, can you help confirm if the above patch fixes your reported bug?

If we are OK with this, this should be merged into 6.14 I think, but
for the long term, it might be a good idea to just share a similar
logic of (or just reuse) __filemap_add_folio for shmem?
__filemap_add_folio will split the entry on insert, and code will be
much cleaner.