On Mon, 27 May 2024 at 19:41:18 +1000, Dave Chinner wrote: > On Thu, May 16, 2024 at 03:23:40PM +0800, Ian Kent wrote: > > On 16/5/24 15:08, Ian Kent wrote: > > > On 16/5/24 12:56, Jinliang Zheng wrote: > > > > > I encountered the following calltrace: > > > > > > > > > > [20213.578756] BUG: kernel NULL pointer dereference, address: > > > > > 0000000000000000 > > > > > [20213.578785] #PF: supervisor instruction fetch in kernel mode > > > > > [20213.578799] #PF: error_code(0x0010) - not-present page > > > > > [20213.578812] PGD 3f01d64067 P4D 3f01d64067 PUD 3f01d65067 PMD 0 > > > > > [20213.578828] Oops: 0010 [#1] SMP NOPTI > > > > > [20213.578839] CPU: 92 PID: 766 Comm: /usr/local/serv Kdump: > > > > > loaded Not tainted 5.4.241-1-tlinux4-0017.3 #1 > > > > > [20213.578860] Hardware name: New H3C Technologies Co., Ltd. > > > > > UniServer R4900 G3/RS33M2C9SA, BIOS 2.00.38P02 04/14/2020 > > > > > [20213.578884] RIP: 0010:0x0 > > > > > [20213.578894] Code: Bad RIP value. > > > > > [20213.578903] RSP: 0018:ffffc90021ebfc38 EFLAGS: 00010246 > > > > > [20213.578916] RAX: ffffffff82081f40 RBX: ffffc90021ebfce0 RCX: > > > > > 0000000000000000 > > > > > [20213.578932] RDX: ffffc90021ebfd48 RSI: ffff88bfad8d3890 RDI: > > > > > 0000000000000000 > > > > > [20213.578948] RBP: ffffc90021ebfc70 R08: 0000000000000001 R09: > > > > > ffff889b9eeae380 > > > > > [20213.578965] R10: 302d343200000067 R11: 0000000000000001 R12: > > > > > 0000000000000000 > > > > > [20213.578981] R13: ffff88bfad8d3890 R14: ffff889b9eeae380 R15: > > > > > ffffc90021ebfd48 > > > > > [20213.578998] FS: 00007f89c534e740(0000) > > > > > GS:ffff88c07fd00000(0000) knlGS:0000000000000000 > > > > > [20213.579016] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > > > > > [20213.579030] CR2: ffffffffffffffd6 CR3: 0000003f01d90001 CR4: > > > > > 00000000007706e0 > > > > > [20213.579046] DR0: 0000000000000000 DR1: 0000000000000000 DR2: > > > > > 0000000000000000 > > > > > [20213.579062] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: > > > > > 0000000000000400 > > > > > [20213.579079] PKRU: 55555554 > > > > > [20213.579087] Call Trace: > > > > > [20213.579099] trailing_symlink+0x1da/0x260 > > > > > [20213.579112] path_lookupat.isra.53+0x79/0x220 > > > > > [20213.579125] filename_lookup.part.69+0xa0/0x170 > > > > > [20213.579138] ? kmem_cache_alloc+0x3f/0x3f0 > > > > > [20213.579151] ? getname_flags+0x4f/0x1e0 > > > > > [20213.579161] user_path_at_empty+0x3e/0x50 > > > > > [20213.579172] vfs_statx+0x76/0xe0 > > > > > [20213.579182] __do_sys_newstat+0x3d/0x70 > > > > > [20213.579194] ? fput+0x13/0x20 > > > > > [20213.579203] ? ksys_ioctl+0xb0/0x300 > > > > > [20213.579213] ? generic_file_llseek+0x24/0x30 > > > > > [20213.579225] ? fput+0x13/0x20 > > > > > [20213.579233] ? ksys_lseek+0x8d/0xb0 > > > > > [20213.579243] __x64_sys_newstat+0x16/0x20 > > > > > [20213.579256] do_syscall_64+0x4d/0x140 > > > > > [20213.579268] entry_SYSCALL_64_after_hwframe+0x5c/0xc1 > > > > > > > > > > <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< > > > > > > > > > Please note that the kernel version I use is the one maintained by > > > > Tencent.Inc, > > > > and the baseline is v5.4. But in fact, in the latest upstream source > > > > tree, > > > > although the trailing_symlink() function has been removed, its logic > > > > has been > > > > moved to pick_link(), so the problem still exists. > > > > > > > > Ian Kent pointed out that try_to_unlazy() was introduced in > > > > pick_link() in the > > > > latest upstream source tree, but I don't understand why this can > > > > solve the NULL > > > > ->get_link pointer dereference problem, because ->get_link pointer > > > > will be > > > > dereferenced before try_to_unlazy(). > > > > > > > > (I don't understand why Ian Kent's email didn't appear on the > > > > mailing list.) > > > > > > It was something about html mail and I think my mail client was at fault. > > > > > > In any case what you say is indeed correct, so the comment isn't > > > important. > > > > > > > > > Fact is it is still a race between the lockless path walk and inode > > > eviction > > > > > > and xfs recycling. I believe that the xfs recycling code is very hard to > > > fix. > > Not really for this case. This is simply concurrent pathwalk lookups > occurring just after the inode has been evicted from the VFS inode > cache. The first lookup hits the XFS inode cache, sees > XFS_IRECLAIMABLE, and it then enters xfs_reinit_inode() to > reinstantiate the VFS inode to an initial state. The Xfs inode > itself is still valid as it hasn't reached the XFS_IRECLAIM state > where it will be torn down and freed. > > Whilst we are running xfs_reinit_inode(), a second RCU pathwalk has > been run and that it is trying to call ->get_link on that same > inode. Unfortunately, the first lookup has just set inode->f_ops = > &empty_fops as part of the VFS inode reinit, and that then triggers > the null pointer deref. The RCU pathwalk must occur before xfs_reinit_inode(), and must obtain the target inode before xfs_reinit_inode(). Because the target inode of xfs_reinit_inode() must NOT be associated with any dentry, which is necessary conditions for iput() -> iput_final() -> evict(), and the RCU pathwalk cannot obtain any inode without a dentry. > > Once the first lookup has finished the inode_init_always(), > xfs_reinit_inode() resets inode->f_ops back to > xfs_symlink_file_ops and get_link calls work again. > > Fundamentally, the problem is that we are completely reinitialising > the VFS inode within the RCU grace period. i.e. while concurrent RCU > pathwalks can still be in progress and find the VFS inode whilst the > XFS inode cache is manipulating it. > > What we should be doing here is a subset of inode_init_always(), > which only reinitialises the bits of the VFS inode we need to > initialise rather than the entire inode. The identity of the inode > is not changing and so we don't need to go through a transient state > where the VFS inode goes xfs symlink -> empty initialised inode -> > xfs symlink. Sorry, I think this question is wrong in more ways than just inode_operations. After the target inode has been reinited by xfs_reinit_inode(), its semantics is no longer the inode required by RCU walkpath. The meanings of many fields have changed, such as mode, i_mtime, i_atime, i_ctime and so on. > > i.e. We need to re-initialise the non-identity related parts of the > VFS inode so the identity parts that the RCU pathwalks rely on never > change within the RCU grace period where lookups can find the VFS > inode after it has been evicted. > > -Dave. > -- > Dave Chinner > david@xxxxxxxxxxxxx