Re: Hang when swapping huge=within_size tmpfs from zram

Baolin Wang <baolin.wang@xxxxxxxxxxxxxxxxx> · Mon, 24 Feb 2025 11:21:09 +0800

Hi Kairui,

On 2025/2/24 02:22, Kairui Song wrote:
On Mon, Feb 24, 2025 at 1:53 AM Kairui Song <ryncsn@xxxxxxxxx> wrote:

On Fri, Feb 7, 2025 at 3:24 PM Baolin Wang
<baolin.wang@xxxxxxxxxxxxxxxxx> wrote:

On 2025/2/5 22:39, Lance Yang wrote:
On Wed, Feb 5, 2025 at 2:38 PM Baolin Wang
<baolin.wang@xxxxxxxxxxxxxxxxx> wrote:
On 2025/2/5 09:55, Baolin Wang wrote:
Hi Alex,

On 2025/2/5 09:23, Alex Xu (Hello71) wrote:
Hi all,

On 6.14-rc1, I found that creating a lot of files in tmpfs then deleting
them reliably hangs when tmpfs is mounted with huge=within_size, and it
is swapped out to zram (zstd/zsmalloc/no backing dev). I bisected this
to acd7ccb284b "mm: shmem: add large folio support for tmpfs".

When the issue occurs, rm uses 100% CPU, cannot be killed, and has no
output in /proc/pid/stack or wchan. Eventually, an RCU stall is
detected:

Thanks for your report. Let me try to reproduce the issue locally and
see what happens.

rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
rcu:     Tasks blocked on level-0 rcu_node (CPUs 0-11): P25160
rcu:     (detected by 10, t=2102 jiffies, g=532677, q=4997 ncpus=12)
task:rm              state:R  running task     stack:0     pid:25160
tgid:25160 ppid:24309  task_flags:0x400000 flags:0x00004004
Call Trace:
    <TASK>
    ? __schedule+0x388/0x1000
    ? kmem_cache_free.part.0+0x23d/0x280
    ? sysvec_apic_timer_interrupt+0xa/0x80
    ? asm_sysvec_apic_timer_interrupt+0x16/0x20
    ? xas_load+0x12/0xc0
    ? xas_load+0x8/0xc0
    ? xas_find+0x144/0x190
    ? find_lock_entries+0x75/0x260
    ? shmem_undo_range+0xe6/0x5f0
    ? shmem_evict_inode+0xe4/0x230
    ? mtree_erase+0x7e/0xe0
    ? inode_set_ctime_current+0x2e/0x1f0
    ? evict+0xe9/0x260
    ? _atomic_dec_and_lock+0x31/0x50
    ? do_unlinkat+0x270/0x2b0
    ? __x64_sys_unlinkat+0x30/0x50
    ? do_syscall_64+0x37/0xe0
    ? entry_SYSCALL_64_after_hwframe+0x50/0x58
    </TASK>

Let me know what information is needed to further troubleshoot this
issue.

Sorry, I can't reproduce this issue, and my testing process is as follows:
1. Mount tmpfs with huge=within_size
2. Create and write a tmpfs file
3. Swap out the large folios of the tmpfs file to zram
4. Execute 'rm' command to remove the tmpfs file

I’m unable to reproduce the issue as well, and am following steps similar
to Baolin's process:

1) Mount tmpfs with the huge=within_size option and enable swap (using
zstd/zsmalloc without a backing device).
2) Create and write over 10,000 files in the tmpfs.
3) Swap out the large folios of these tmpfs files to zram.
4) Use the rm command to delete all the files from the tmpfs.

Testing with both 2MiB and 64KiB large folio sizes, and with
shmem_enabled=within_size, but everything works as expected.

Thanks Lance for confirming again.

Alex, could you give more hints on how to reproduce this issue?


Hi Baolin,

I can reproduce this issue very easily with the build linux kernel
test, and the failure rate is very high. I'm not exactly sure this is
the same bug but very likely, my test step:

1. Create a 10G ZRAM device and set up SWAP on it.
2. Create a 1G memcg, and spawn a shell in it.
3. Mount tmpfs with huge=within_size, and then untar linux kernel
source code into it.
4. Build with make -j32 (higher or lower job number may also work),
the build will always fall within 10s due to file corrupted.

Very appreciated for your reproducer, and now I can reproduce the issue 
locally.

After some debugging, the reason is in shmem_swapin_folio, when swap
cache is hit `folio = swap_cache_get_folio(swap, NULL, 0);` sets folio
to a 0 order folio, then the following shmem_add_to_page_cache will
insert a order 0 folio overriding a high order entry in shmem's
xarray, so data are lost. Swap cache hit could be due to many reasons,
in this case it's the readahead.

Yes, thanks for your analysis. I missed that the swap readahead can swap 
in order 0 folios asynchronously.


One quick fix is just always split the entry upon shmem fault of 0
order folio like this:

diff --git a/mm/shmem.c b/mm/shmem.c
index 4ea6109a8043..c8e5c419c675 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -2341,6 +2341,10 @@ static int shmem_swapin_folio(struct inode
*inode, pgoff_t index,
                 }
         }

+       /* Swapin of 0 order folio must always ensure the entries are split */
+       if (!folio_order(folio))
+               shmem_split_large_entry(inode, index, swap, gfp);
+
  alloced:
         /* We have to do this with folio locked to prevent races */
         folio_lock(folio);

I don't think we should always split the large entry when getting folio 
from the swap cache. Instead, splitting should only be done when the 
order stored in the shmem mapping is inconsistent with the folio order, 
as well as updating the swap value.

Could you help to try below fix? I tested it and it can work well with 
your reproducer. Thanks a lot.

diff --git a/mm/shmem.c b/mm/shmem.c
index 671f63063fd4..7e70081a96d4 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -2253,7 +2253,7 @@ static int shmem_swapin_folio(struct inode *inode, 
pgoff_t index,
        struct folio *folio = NULL;
        bool skip_swapcache = false;
        swp_entry_t swap;
-       int error, nr_pages;
+       int error, nr_pages, order, split_order;

        VM_BUG_ON(!*foliop || !xa_is_value(*foliop));
        swap = radix_to_swp_entry(*foliop);
@@ -2272,10 +2272,9 @@ static int shmem_swapin_folio(struct inode 
*inode, pgoff_t index,

        /* Look it up and read it in.. */
        folio = swap_cache_get_folio(swap, NULL, 0);
+       order = xa_get_order(&mapping->i_pages, index);
        if (!folio) {
-               int order = xa_get_order(&mapping->i_pages, index);
                bool fallback_order0 = false;
-               int split_order;

                /* Or update major stats only when swapin succeeds?? */
                if (fault_type) {
@@ -2339,6 +2338,29 @@ static int shmem_swapin_folio(struct inode 
*inode, pgoff_t index,
                        error = -ENOMEM;
                        goto failed;
                }
+       } else if (order != folio_order(folio)) {
+               /*
+                * Swap readahead may swap in order 0 folios into swapcache
+                * asynchronously, while the shmem mapping can still stores
+                * large swap entries. In such cases, we should split the
+                * large swap entry to prevent possible data loss.
+                */
+               split_order = shmem_split_large_entry(inode, index, 
swap, gfp);
+               if (split_order < 0) {
+                       error = split_order;
+                       goto failed;
+               }
+
+               /*
+                * If the large swap entry has already been split, it is
+                * necessary to recalculate the new swap entry based on
+                * the old order alignment.
+                */
+               if (split_order > 0) {
+                       pgoff_t offset = index - round_down(index, 1 << 
split_order);
+
+                       swap = swp_entry(swp_type(swap), 
swp_offset(swap) + offset);
+               }
        }

 alloced:
@@ -2346,7 +2368,8 @@ static int shmem_swapin_folio(struct inode *inode, 
pgoff_t index,
        folio_lock(folio);
        if ((!skip_swapcache && !folio_test_swapcache(folio)) ||
            folio->swap.val != swap.val ||
-           !shmem_confirm_swap(mapping, index, swap)) {
+           !shmem_confirm_swap(mapping, index, swap) ||
+           xa_get_order(&mapping->i_pages, index) != folio_order(folio)) {
                error = -EEXIST;
                goto unlock;
        }

And Hi Alex, can you help confirm if the above patch fixes your reported bug?

If we are OK with this, this should be merged into 6.14 I think, but
for the long term, it might be a good idea to just share a similar
logic of (or just reuse) __filemap_add_folio for shmem?
__filemap_add_folio will split the entry on insert, and code will be
much cleaner.

Some extra comments for above patch: If it raced with another split,
or the entry used for swap cache lookup is wrongly aligned due to
large entry, the shmem_add_to_page_cache below will fail with -EEXIST
and try again. So that seems to be working well in my test.