Re: About the conflict between XFS inode recycle and VFS rcu-walk

Jinliang Zheng <alexjlzheng@xxxxxxxxx> · Mon, 27 May 2024 21:56:15 +0800

On Mon, 27 May 2024 at 19:41:18 +1000, Dave Chinner wrote:
> On Thu, May 16, 2024 at 03:23:40PM +0800, Ian Kent wrote:
> > On 16/5/24 15:08, Ian Kent wrote:
> > > On 16/5/24 12:56, Jinliang Zheng wrote:
> > > > > I encountered the following calltrace:
> > > > > 
> > > > > [20213.578756] BUG: kernel NULL pointer dereference, address:
> > > > > 0000000000000000
> > > > > [20213.578785] #PF: supervisor instruction fetch in kernel mode
> > > > > [20213.578799] #PF: error_code(0x0010) - not-present page
> > > > > [20213.578812] PGD 3f01d64067 P4D 3f01d64067 PUD 3f01d65067 PMD 0
> > > > > [20213.578828] Oops: 0010 [#1] SMP NOPTI
> > > > > [20213.578839] CPU: 92 PID: 766 Comm: /usr/local/serv Kdump:
> > > > > loaded Not tainted 5.4.241-1-tlinux4-0017.3 #1
> > > > > [20213.578860] Hardware name: New H3C Technologies Co., Ltd.
> > > > > UniServer R4900 G3/RS33M2C9SA, BIOS 2.00.38P02 04/14/2020
> > > > > [20213.578884] RIP: 0010:0x0
> > > > > [20213.578894] Code: Bad RIP value.
> > > > > [20213.578903] RSP: 0018:ffffc90021ebfc38 EFLAGS: 00010246
> > > > > [20213.578916] RAX: ffffffff82081f40 RBX: ffffc90021ebfce0 RCX:
> > > > > 0000000000000000
> > > > > [20213.578932] RDX: ffffc90021ebfd48 RSI: ffff88bfad8d3890 RDI:
> > > > > 0000000000000000
> > > > > [20213.578948] RBP: ffffc90021ebfc70 R08: 0000000000000001 R09:
> > > > > ffff889b9eeae380
> > > > > [20213.578965] R10: 302d343200000067 R11: 0000000000000001 R12:
> > > > > 0000000000000000
> > > > > [20213.578981] R13: ffff88bfad8d3890 R14: ffff889b9eeae380 R15:
> > > > > ffffc90021ebfd48
> > > > > [20213.578998] FS:  00007f89c534e740(0000)
> > > > > GS:ffff88c07fd00000(0000) knlGS:0000000000000000
> > > > > [20213.579016] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > > > > [20213.579030] CR2: ffffffffffffffd6 CR3: 0000003f01d90001 CR4:
> > > > > 00000000007706e0
> > > > > [20213.579046] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
> > > > > 0000000000000000
> > > > > [20213.579062] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7:
> > > > > 0000000000000400
> > > > > [20213.579079] PKRU: 55555554
> > > > > [20213.579087] Call Trace:
> > > > > [20213.579099]  trailing_symlink+0x1da/0x260
> > > > > [20213.579112]  path_lookupat.isra.53+0x79/0x220
> > > > > [20213.579125]  filename_lookup.part.69+0xa0/0x170
> > > > > [20213.579138]  ? kmem_cache_alloc+0x3f/0x3f0
> > > > > [20213.579151]  ? getname_flags+0x4f/0x1e0
> > > > > [20213.579161]  user_path_at_empty+0x3e/0x50
> > > > > [20213.579172]  vfs_statx+0x76/0xe0
> > > > > [20213.579182]  __do_sys_newstat+0x3d/0x70
> > > > > [20213.579194]  ? fput+0x13/0x20
> > > > > [20213.579203]  ? ksys_ioctl+0xb0/0x300
> > > > > [20213.579213]  ? generic_file_llseek+0x24/0x30
> > > > > [20213.579225]  ? fput+0x13/0x20
> > > > > [20213.579233]  ? ksys_lseek+0x8d/0xb0
> > > > > [20213.579243]  __x64_sys_newstat+0x16/0x20
> > > > > [20213.579256]  do_syscall_64+0x4d/0x140
> > > > > [20213.579268]  entry_SYSCALL_64_after_hwframe+0x5c/0xc1
> > > > > 
> > > > > <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
> > > > > 
> > > > Please note that the kernel version I use is the one maintained by
> > > > Tencent.Inc,
> > > > and the baseline is v5.4. But in fact, in the latest upstream source
> > > > tree,
> > > > although the trailing_symlink() function has been removed, its logic
> > > > has been
> > > > moved to pick_link(), so the problem still exists.
> > > > 
> > > > Ian Kent pointed out that try_to_unlazy() was introduced in
> > > > pick_link() in the
> > > > latest upstream source tree, but I don't understand why this can
> > > > solve the NULL
> > > > ->get_link pointer dereference problem, because ->get_link pointer
> > > > will be
> > > > dereferenced before try_to_unlazy().
> > > > 
> > > > (I don't understand why Ian Kent's email didn't appear on the
> > > > mailing list.)
> > > 
> > > It was something about html mail and I think my mail client was at fault.
> > > 
> > > In any case what you say is indeed correct, so the comment isn't
> > > important.
> > > 
> > > 
> > > Fact is it is still a race between the lockless path walk and inode
> > > eviction
> > > 
> > > and xfs recycling. I believe that the xfs recycling code is very hard to
> > > fix.
> 
> Not really for this case. This is simply concurrent pathwalk lookups
> occurring just after the inode has been evicted from the VFS inode
> cache. The first lookup hits the XFS inode cache, sees
> XFS_IRECLAIMABLE, and it then enters xfs_reinit_inode() to
> reinstantiate the VFS inode to an initial state. The Xfs inode
> itself is still valid as it hasn't reached the XFS_IRECLAIM state
> where it will be torn down and freed.
> 
> Whilst we are running xfs_reinit_inode(), a second RCU pathwalk has
> been run and that it is trying to call ->get_link on that same
> inode. Unfortunately, the first lookup has just set inode->f_ops =
> &empty_fops as part of the VFS inode reinit, and that then triggers
> the null pointer deref.

The RCU pathwalk must occur before xfs_reinit_inode(), and must obtain the
target inode before xfs_reinit_inode(). Because the target inode of
xfs_reinit_inode() must NOT be associated with any dentry, which is necessary
conditions for iput() -> iput_final() -> evict(), and the RCU pathwalk cannot
obtain any inode without a dentry.

> 
> Once the first lookup has finished the inode_init_always(),
> xfs_reinit_inode() resets inode->f_ops back to 
> xfs_symlink_file_ops and get_link calls work again.
> 
> Fundamentally, the problem is that we are completely reinitialising
> the VFS inode within the RCU grace period. i.e. while concurrent RCU
> pathwalks can still be in progress and find the VFS inode whilst the
> XFS inode cache is manipulating it.
> 
> What we should be doing here is a subset of inode_init_always(),
> which only reinitialises the bits of the VFS inode we need to
> initialise rather than the entire inode. The identity of the inode
> is not changing and so we don't need to go through a transient state
> where the VFS inode goes xfs symlink -> empty initialised inode ->
> xfs symlink.

Sorry, I think this question is wrong in more ways than just inode_operations.
After the target inode has been reinited by xfs_reinit_inode(), its semantics
is no longer the inode required by RCU walkpath. The meanings of many fields
have changed, such as mode, i_mtime, i_atime, i_ctime and so on.

> 
> i.e. We need to re-initialise the non-identity related parts of the
> VFS inode so the identity parts that the RCU pathwalks rely on never
> change within the RCU grace period where lookups can find the VFS
> inode after it has been evicted.
> 
> -Dave.
> -- 
> Dave Chinner
> david@xxxxxxxxxxxxx