On Mon, Feb 24, 2025 at 1:53 AM Kairui Song <ryncsn@xxxxxxxxx> wrote: > > On Fri, Feb 7, 2025 at 3:24 PM Baolin Wang > <baolin.wang@xxxxxxxxxxxxxxxxx> wrote: > > > > On 2025/2/5 22:39, Lance Yang wrote: > > > On Wed, Feb 5, 2025 at 2:38 PM Baolin Wang > > > <baolin.wang@xxxxxxxxxxxxxxxxx> wrote: > > >> On 2025/2/5 09:55, Baolin Wang wrote: > > >>> Hi Alex, > > >>> > > >>> On 2025/2/5 09:23, Alex Xu (Hello71) wrote: > > >>>> Hi all, > > >>>> > > >>>> On 6.14-rc1, I found that creating a lot of files in tmpfs then deleting > > >>>> them reliably hangs when tmpfs is mounted with huge=within_size, and it > > >>>> is swapped out to zram (zstd/zsmalloc/no backing dev). I bisected this > > >>>> to acd7ccb284b "mm: shmem: add large folio support for tmpfs". > > >>>> > > >>>> When the issue occurs, rm uses 100% CPU, cannot be killed, and has no > > >>>> output in /proc/pid/stack or wchan. Eventually, an RCU stall is > > >>>> detected: > > >>> > > >>> Thanks for your report. Let me try to reproduce the issue locally and > > >>> see what happens. > > >>> > > >>>> rcu: INFO: rcu_preempt detected stalls on CPUs/tasks: > > >>>> rcu: Tasks blocked on level-0 rcu_node (CPUs 0-11): P25160 > > >>>> rcu: (detected by 10, t=2102 jiffies, g=532677, q=4997 ncpus=12) > > >>>> task:rm state:R running task stack:0 pid:25160 > > >>>> tgid:25160 ppid:24309 task_flags:0x400000 flags:0x00004004 > > >>>> Call Trace: > > >>>> <TASK> > > >>>> ? __schedule+0x388/0x1000 > > >>>> ? kmem_cache_free.part.0+0x23d/0x280 > > >>>> ? sysvec_apic_timer_interrupt+0xa/0x80 > > >>>> ? asm_sysvec_apic_timer_interrupt+0x16/0x20 > > >>>> ? xas_load+0x12/0xc0 > > >>>> ? xas_load+0x8/0xc0 > > >>>> ? xas_find+0x144/0x190 > > >>>> ? find_lock_entries+0x75/0x260 > > >>>> ? shmem_undo_range+0xe6/0x5f0 > > >>>> ? shmem_evict_inode+0xe4/0x230 > > >>>> ? mtree_erase+0x7e/0xe0 > > >>>> ? inode_set_ctime_current+0x2e/0x1f0 > > >>>> ? evict+0xe9/0x260 > > >>>> ? _atomic_dec_and_lock+0x31/0x50 > > >>>> ? do_unlinkat+0x270/0x2b0 > > >>>> ? __x64_sys_unlinkat+0x30/0x50 > > >>>> ? do_syscall_64+0x37/0xe0 > > >>>> ? entry_SYSCALL_64_after_hwframe+0x50/0x58 > > >>>> </TASK> > > >>>> > > >>>> Let me know what information is needed to further troubleshoot this > > >>>> issue. > > >> > > >> Sorry, I can't reproduce this issue, and my testing process is as follows: > > >> 1. Mount tmpfs with huge=within_size > > >> 2. Create and write a tmpfs file > > >> 3. Swap out the large folios of the tmpfs file to zram > > >> 4. Execute 'rm' command to remove the tmpfs file > > > > > > I’m unable to reproduce the issue as well, and am following steps similar > > > to Baolin's process: > > > > > > 1) Mount tmpfs with the huge=within_size option and enable swap (using > > > zstd/zsmalloc without a backing device). > > > 2) Create and write over 10,000 files in the tmpfs. > > > 3) Swap out the large folios of these tmpfs files to zram. > > > 4) Use the rm command to delete all the files from the tmpfs. > > > > > > Testing with both 2MiB and 64KiB large folio sizes, and with > > > shmem_enabled=within_size, but everything works as expected. > > > > Thanks Lance for confirming again. > > > > Alex, could you give more hints on how to reproduce this issue? > > > > Hi Baolin, > > I can reproduce this issue very easily with the build linux kernel > test, and the failure rate is very high. I'm not exactly sure this is > the same bug but very likely, my test step: > > 1. Create a 10G ZRAM device and set up SWAP on it. > 2. Create a 1G memcg, and spawn a shell in it. > 3. Mount tmpfs with huge=within_size, and then untar linux kernel > source code into it. > 4. Build with make -j32 (higher or lower job number may also work), > the build will always fall within 10s due to file corrupted. > > After some debugging, the reason is in shmem_swapin_folio, when swap > cache is hit `folio = swap_cache_get_folio(swap, NULL, 0);` sets folio > to a 0 order folio, then the following shmem_add_to_page_cache will > insert a order 0 folio overriding a high order entry in shmem's > xarray, so data are lost. Swap cache hit could be due to many reasons, > in this case it's the readahead. > > One quick fix is just always split the entry upon shmem fault of 0 > order folio like this: > > diff --git a/mm/shmem.c b/mm/shmem.c > index 4ea6109a8043..c8e5c419c675 100644 > --- a/mm/shmem.c > +++ b/mm/shmem.c > @@ -2341,6 +2341,10 @@ static int shmem_swapin_folio(struct inode > *inode, pgoff_t index, > } > } > > + /* Swapin of 0 order folio must always ensure the entries are split */ > + if (!folio_order(folio)) > + shmem_split_large_entry(inode, index, swap, gfp); > + > alloced: > /* We have to do this with folio locked to prevent races */ > folio_lock(folio); > > And Hi Alex, can you help confirm if the above patch fixes your reported bug? > > If we are OK with this, this should be merged into 6.14 I think, but > for the long term, it might be a good idea to just share a similar > logic of (or just reuse) __filemap_add_folio for shmem? > __filemap_add_folio will split the entry on insert, and code will be > much cleaner. Some extra comments for above patch: If it raced with another split, or the entry used for swap cache lookup is wrongly aligned due to large entry, the shmem_add_to_page_cache below will fail with -EEXIST and try again. So that seems to be working well in my test.