On 9/17/24 7:25 AM, Matthew Wilcox wrote: > On Tue, Sep 17, 2024 at 01:13:05PM +0200, Chris Mason wrote: >> On 9/17/24 5:32 AM, Matthew Wilcox wrote: >>> On Mon, Sep 16, 2024 at 10:47:10AM +0200, Chris Mason wrote: >>>> I've got a bunch of assertions around incorrect folio->mapping and I'm >>>> trying to bash on the ENOMEM for readahead case. There's a GFP_NOWARN >>>> on those, and our systems do run pretty short on ram, so it feels right >>>> at least. We'll see. >>> >>> I've been running with some variant of this patch the whole way across >>> the Atlantic, and not hit any problems. But maybe with the right >>> workload ...? >>> >>> There are two things being tested here. One is whether we have a >>> cross-linked node (ie a node that's in two trees at the same time). >>> The other is whether the slab allocator is giving us a node that already >>> contains non-NULL entries. >>> >>> If you could throw this on top of your kernel, we might stand a chance >>> of catching the problem sooner. If it is one of these problems and not >>> something weirder. >>> >> >> This fires in roughly 10 seconds for me on top of v6.11. Since array seems >> to always be 1, I'm not sure if the assertion is right, but hopefully you >> can trigger yourself. > > Whoops. > > $ git grep XA_RCU_FREE > lib/xarray.c:#define XA_RCU_FREE ((struct xarray *)1) > lib/xarray.c: node->array = XA_RCU_FREE; > > so you walked into a node which is currently being freed by RCU. Which > isn't a problem, of course. I don't know why I do that; it doesn't seem > like anyone tests it. The jetlag is seriously kicking in right now, > so I'm going to refrain from saying anything more because it probably > won't be coherent. Based on a modified reproducer from Chris (N threads reading from a file, M threads dropping pages), I can pretty quickly reproduce the xas_descend() spin on 6.9 in a vm with 128 cpus. Here's some debugging output with a modified version of your patch too, that ignores XA_RCU_FREE: node ffff8e838a01f788 max 59 parent 0000000000000000 shift 0 count 0 values 0 array ffff8e839dfa86a0 list ffff8e838a01f7a0 ffff8e838a01f7a0 marks 0 0 0 WARNING: CPU: 106 PID: 1554 at lib/xarray.c:405 xas_alloc.cold+0x26/0x4b which is: XA_NODE_BUG_ON(node, memchr_inv(&node->slots, 0, sizeof(void *) * XA_CHUN K_SIZE)); and: node ffff8e838a01f788 offset 59 parent ffff8e838b0419c8 shift 0 count 252 values 0 array ffff8e839dfa86a0 list ffff8e838a01f7a0 ffff8e838a01f7a0 marks 0 0 0 which is: XA_NODE_BUG_ON(node, node->count > XA_CHUNK_SIZE); and for this particular run, 2 threads spinning: rcu: INFO: rcu_preempt detected stalls on CPUs/tasks: rcu: Tasks blocked on level-1 rcu_node (CPUs 16-31): P1555 rcu: Tasks blocked on level-1 rcu_node (CPUs 64-79): P1556 rcu: (detected by 97, t=2102 jiffies, g=7821, q=293800 ncpus=128) task:reader state:R running task stack:0 pid:1555 tgid:1551 ppid:1 flags:0x00004006 Call Trace: <TASK> ? __schedule+0x37f/0xaa0 ? sysvec_apic_timer_interrupt+0x96/0xb0 ? asm_sysvec_apic_timer_interrupt+0x16/0x20 ? xas_load+0x74/0xe0 ? xas_load+0x10/0xe0 ? xas_find+0x162/0x1b0 ? find_lock_entries+0x1ac/0x360 ? find_lock_entries+0x76/0x360 ? mapping_try_invalidate+0x5d/0x130 ? generic_fadvise+0x110/0x240 ? xfd_validate_state+0x1e/0x70 ? ksys_fadvise64_64+0x50/0x90 ? __x64_sys_fadvise64+0x18/0x20 ? do_syscall_64+0x5d/0x180 ? entry_SYSCALL_64_after_hwframe+0x4b/0x53 </TASK> task:reader state:R running task stack:0 pid:1556 tgid:1551 ppid:1 flags:0x00004006 The reproducer takes ~30 seconds, and will lead to anywhere from 1..N threads spinning here. Now for the kicker - this doesn't reproduce in 6.10 and onwards. There are only a few changes here that are relevant, seemingly, and the prime candidates are: commit a4864671ca0bf51c8e78242951741df52c06766f Author: Kairui Song <kasong@xxxxxxxxxxx> Date: Tue Apr 16 01:18:55 2024 +0800 lib/xarray: introduce a new helper xas_get_order and the followup filemap change: commit 6758c1128ceb45d1a35298912b974eb4895b7dd9 Author: Kairui Song <kasong@xxxxxxxxxxx> Date: Tue Apr 16 01:18:56 2024 +0800 mm/filemap: optimize filemap folio adding and reverting those two on 6.10 hits it again almost immediately. Didn't look into these commit, but looks like they inadvertently also fixed this corruption issue. -- Jens Axboe