Re: About the conflict between XFS inode recycle and VFS rcu-walk

Dave Chinner <david@xxxxxxxxxxxxx> · Mon, 27 May 2024 19:41:18 +1000

On Thu, May 16, 2024 at 03:23:40PM +0800, Ian Kent wrote:
> On 16/5/24 15:08, Ian Kent wrote:
> > On 16/5/24 12:56, Jinliang Zheng wrote:
> > > > I encountered the following calltrace:
> > > > 
> > > > [20213.578756] BUG: kernel NULL pointer dereference, address:
> > > > 0000000000000000
> > > > [20213.578785] #PF: supervisor instruction fetch in kernel mode
> > > > [20213.578799] #PF: error_code(0x0010) - not-present page
> > > > [20213.578812] PGD 3f01d64067 P4D 3f01d64067 PUD 3f01d65067 PMD 0
> > > > [20213.578828] Oops: 0010 [#1] SMP NOPTI
> > > > [20213.578839] CPU: 92 PID: 766 Comm: /usr/local/serv Kdump:
> > > > loaded Not tainted 5.4.241-1-tlinux4-0017.3 #1
> > > > [20213.578860] Hardware name: New H3C Technologies Co., Ltd.
> > > > UniServer R4900 G3/RS33M2C9SA, BIOS 2.00.38P02 04/14/2020
> > > > [20213.578884] RIP: 0010:0x0
> > > > [20213.578894] Code: Bad RIP value.
> > > > [20213.578903] RSP: 0018:ffffc90021ebfc38 EFLAGS: 00010246
> > > > [20213.578916] RAX: ffffffff82081f40 RBX: ffffc90021ebfce0 RCX:
> > > > 0000000000000000
> > > > [20213.578932] RDX: ffffc90021ebfd48 RSI: ffff88bfad8d3890 RDI:
> > > > 0000000000000000
> > > > [20213.578948] RBP: ffffc90021ebfc70 R08: 0000000000000001 R09:
> > > > ffff889b9eeae380
> > > > [20213.578965] R10: 302d343200000067 R11: 0000000000000001 R12:
> > > > 0000000000000000
> > > > [20213.578981] R13: ffff88bfad8d3890 R14: ffff889b9eeae380 R15:
> > > > ffffc90021ebfd48
> > > > [20213.578998] FS:  00007f89c534e740(0000)
> > > > GS:ffff88c07fd00000(0000) knlGS:0000000000000000
> > > > [20213.579016] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > > > [20213.579030] CR2: ffffffffffffffd6 CR3: 0000003f01d90001 CR4:
> > > > 00000000007706e0
> > > > [20213.579046] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
> > > > 0000000000000000
> > > > [20213.579062] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7:
> > > > 0000000000000400
> > > > [20213.579079] PKRU: 55555554
> > > > [20213.579087] Call Trace:
> > > > [20213.579099]  trailing_symlink+0x1da/0x260
> > > > [20213.579112]  path_lookupat.isra.53+0x79/0x220
> > > > [20213.579125]  filename_lookup.part.69+0xa0/0x170
> > > > [20213.579138]  ? kmem_cache_alloc+0x3f/0x3f0
> > > > [20213.579151]  ? getname_flags+0x4f/0x1e0
> > > > [20213.579161]  user_path_at_empty+0x3e/0x50
> > > > [20213.579172]  vfs_statx+0x76/0xe0
> > > > [20213.579182]  __do_sys_newstat+0x3d/0x70
> > > > [20213.579194]  ? fput+0x13/0x20
> > > > [20213.579203]  ? ksys_ioctl+0xb0/0x300
> > > > [20213.579213]  ? generic_file_llseek+0x24/0x30
> > > > [20213.579225]  ? fput+0x13/0x20
> > > > [20213.579233]  ? ksys_lseek+0x8d/0xb0
> > > > [20213.579243]  __x64_sys_newstat+0x16/0x20
> > > > [20213.579256]  do_syscall_64+0x4d/0x140
> > > > [20213.579268]  entry_SYSCALL_64_after_hwframe+0x5c/0xc1
> > > > 
> > > > <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
> > > > 
> > > Please note that the kernel version I use is the one maintained by
> > > Tencent.Inc,
> > > and the baseline is v5.4. But in fact, in the latest upstream source
> > > tree,
> > > although the trailing_symlink() function has been removed, its logic
> > > has been
> > > moved to pick_link(), so the problem still exists.
> > > 
> > > Ian Kent pointed out that try_to_unlazy() was introduced in
> > > pick_link() in the
> > > latest upstream source tree, but I don't understand why this can
> > > solve the NULL
> > > ->get_link pointer dereference problem, because ->get_link pointer
> > > will be
> > > dereferenced before try_to_unlazy().
> > > 
> > > (I don't understand why Ian Kent's email didn't appear on the
> > > mailing list.)
> > 
> > It was something about html mail and I think my mail client was at fault.
> > 
> > In any case what you say is indeed correct, so the comment isn't
> > important.
> > 
> > 
> > Fact is it is still a race between the lockless path walk and inode
> > eviction
> > 
> > and xfs recycling. I believe that the xfs recycling code is very hard to
> > fix.

Not really for this case. This is simply concurrent pathwalk lookups
occurring just after the inode has been evicted from the VFS inode
cache. The first lookup hits the XFS inode cache, sees
XFS_IRECLAIMABLE, and it then enters xfs_reinit_inode() to
reinstantiate the VFS inode to an initial state. The Xfs inode
itself is still valid as it hasn't reached the XFS_IRECLAIM state
where it will be torn down and freed.

Whilst we are running xfs_reinit_inode(), a second RCU pathwalk has
been run and that it is trying to call ->get_link on that same
inode. Unfortunately, the first lookup has just set inode->f_ops =
&empty_fops as part of the VFS inode reinit, and that then triggers
the null pointer deref.

Once the first lookup has finished the inode_init_always(),
xfs_reinit_inode() resets inode->f_ops back to 
xfs_symlink_file_ops and get_link calls work again.

Fundamentally, the problem is that we are completely reinitialising
the VFS inode within the RCU grace period. i.e. while concurrent RCU
pathwalks can still be in progress and find the VFS inode whilst the
XFS inode cache is manipulating it.

What we should be doing here is a subset of inode_init_always(),
which only reinitialises the bits of the VFS inode we need to
initialise rather than the entire inode. The identity of the inode
is not changing and so we don't need to go through a transient state
where the VFS inode goes xfs symlink -> empty initialised inode ->
xfs symlink.

i.e. We need to re-initialise the non-identity related parts of the
VFS inode so the identity parts that the RCU pathwalks rely on never
change within the RCU grace period where lookups can find the VFS
inode after it has been evicted.

-Dave.
-- 
Dave Chinner
david@xxxxxxxxxxxxx