On Sun, May 26, 2024 at 11:04:14PM +0800, Jinliang Zheng wrote: > On Tue, 21 May 2024 at 10:13:38 +0800, Ian Kent wrote: > > On 21/5/24 09:35, Ian Kent wrote: > > > On 21/5/24 01:36, Darrick J. Wong wrote: > > >> On Thu, May 16, 2024 at 03:23:40PM +0800, Ian Kent wrote: > > >>> On 16/5/24 15:08, Ian Kent wrote: > > >>>> On 16/5/24 12:56, Jinliang Zheng wrote: > > >>>>> On Wed, 15 May 2024 at 23:54:41 +0800, Jinliang Zheng wrote: > > >>>>>> On Wed, 31 Jan 2024 at 11:30:18 -0800, djwong@xxxxxxxxxx wrote: > > >>>>>>> On Wed, Jan 31, 2024 at 02:35:17PM +0800, Jinliang Zheng wrote: > > >>>>>>>> On Fri, 8 Dec 2023 11:14:32 +1100, david@xxxxxxxxxxxxx wrote: > > >>>>>>>>> On Tue, Dec 05, 2023 at 07:38:33PM +0800, > > >>>>>>>>> alexjlzheng@xxxxxxxxx wrote: > > >>>>>>>>>> Hi, all > > >>>>>>>>>> > > >>>>>>>>>> I would like to ask if the conflict between xfs > > >>>>>>>>>> inode recycle and vfs rcu-walk > > >>>>>>>>>> which can lead to null pointer references has been resolved? > > >>>>>>>>>> > > >>>>>>>>>> I browsed through emails about the following > > >>>>>>>>>> patches and their discussions: > > >>>>>>>>>> - > > >>>>>>>>>> https://lore.kernel.org/linux-xfs/20220217172518.3842951-2-bfoster@xxxxxxxxxx/ > > >>>>>>>>>> - > > >>>>>>>>>> https://lore.kernel.org/linux-xfs/20220121142454.1994916-1-bfoster@xxxxxxxxxx/ > > >>>>>>>>>> - > > >>>>>>>>>> https://lore.kernel.org/linux-xfs/164180589176.86426.501271559065590169.stgit@xxxxxxxxxxxxxxxxx/ > > >>>>>>>>>> > > >>>>>>>>>> And then came to the conclusion that this > > >>>>>>>>>> problem has not been solved, am I > > >>>>>>>>>> right? Did I miss some patch that could solve this problem? > > >>>>>>>>> We fixed the known problems this caused by turning off the VFS > > >>>>>>>>> functionality that the rcu pathwalks kept tripping over. See > > >>>>>>>>> commit > > >>>>>>>>> 7b7820b83f23 ("xfs: don't expose internal symlink > > >>>>>>>>> metadata buffers to > > >>>>>>>>> the vfs"). > > >>>>>>>> Sorry for the delay. > > >>>>>>>> > > >>>>>>>> The problem I encountered in the production environment > > >>>>>>>> was that during the > > >>>>>>>> rcu walk process the ->get_link() pointer was NULL, > > >>>>>>>> which caused a crash. > > >>>>>>>> > > >>>>>>>> As far as I know, commit 7b7820b83f23 ("xfs: don't > > >>>>>>>> expose internal symlink > > >>>>>>>> metadata buffers to the vfs") first appeared in: > > >>>>>>>> - > > >>>>>>>> https://lore.kernel.org/linux-fsdevel/YZvvP9RFXi3%2FjX0q@bfoster/ > > >>>>>>>> > > >>>>>>>> Does this commit solve the problem of NULL ->get_link()? And how? > > >>>>>>> I suggest reading the call stack from wherever the VFS enters > > >>>>>>> the XFS > > >>>>>>> readlink code. If you have a reliable reproducer, then > > >>>>>>> apply this patch > > >>>>>>> to your kernel (you haven't mentioned which one it is) and see > > >>>>>>> if the > > >>>>>>> bad dereference goes away. > > >>>>>>> > > >>>>>>> --D > > >>>>>> Sorry for the delay. > > >>>>>> > > >>>>>> I encountered the following calltrace: > > >>>>>> > > >>>>>> [20213.578756] BUG: kernel NULL pointer dereference, address: > > >>>>>> 0000000000000000 > > >>>>>> [20213.578785] #PF: supervisor instruction fetch in kernel mode > > >>>>>> [20213.578799] #PF: error_code(0x0010) - not-present page > > >>>>>> [20213.578812] PGD 3f01d64067 P4D 3f01d64067 PUD 3f01d65067 PMD 0 > > >>>>>> [20213.578828] Oops: 0010 [#1] SMP NOPTI > > >>>>>> [20213.578839] CPU: 92 PID: 766 Comm: /usr/local/serv Kdump: > > >>>>>> loaded Not tainted 5.4.241-1-tlinux4-0017.3 #1 > > >>>>>> [20213.578860] Hardware name: New H3C Technologies Co., Ltd. > > >>>>>> UniServer R4900 G3/RS33M2C9SA, BIOS 2.00.38P02 04/14/2020 > > >>>>>> [20213.578884] RIP: 0010:0x0 > > >>>>>> [20213.578894] Code: Bad RIP value. > > >>>>>> [20213.578903] RSP: 0018:ffffc90021ebfc38 EFLAGS: 00010246 > > >>>>>> [20213.578916] RAX: ffffffff82081f40 RBX: ffffc90021ebfce0 RCX: > > >>>>>> 0000000000000000 > > >>>>>> [20213.578932] RDX: ffffc90021ebfd48 RSI: ffff88bfad8d3890 RDI: > > >>>>>> 0000000000000000 > > >>>>>> [20213.578948] RBP: ffffc90021ebfc70 R08: 0000000000000001 R09: > > >>>>>> ffff889b9eeae380 > > >>>>>> [20213.578965] R10: 302d343200000067 R11: 0000000000000001 R12: > > >>>>>> 0000000000000000 > > >>>>>> [20213.578981] R13: ffff88bfad8d3890 R14: ffff889b9eeae380 R15: > > >>>>>> ffffc90021ebfd48 > > >>>>>> [20213.578998] FS: 00007f89c534e740(0000) > > >>>>>> GS:ffff88c07fd00000(0000) knlGS:0000000000000000 > > >>>>>> [20213.579016] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > > >>>>>> [20213.579030] CR2: ffffffffffffffd6 CR3: 0000003f01d90001 CR4: > > >>>>>> 00000000007706e0 > > >>>>>> [20213.579046] DR0: 0000000000000000 DR1: 0000000000000000 DR2: > > >>>>>> 0000000000000000 > > >>>>>> [20213.579062] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: > > >>>>>> 0000000000000400 > > >>>>>> [20213.579079] PKRU: 55555554 > > >>>>>> [20213.579087] Call Trace: > > >>>>>> [20213.579099] trailing_symlink+0x1da/0x260 > > >>>>>> [20213.579112] path_lookupat.isra.53+0x79/0x220 > > >>>>>> [20213.579125] filename_lookup.part.69+0xa0/0x170 > > >>>>>> [20213.579138] ? kmem_cache_alloc+0x3f/0x3f0 > > >>>>>> [20213.579151] ? getname_flags+0x4f/0x1e0 > > >>>>>> [20213.579161] user_path_at_empty+0x3e/0x50 > > >>>>>> [20213.579172] vfs_statx+0x76/0xe0 > > >>>>>> [20213.579182] __do_sys_newstat+0x3d/0x70 > > >>>>>> [20213.579194] ? fput+0x13/0x20 > > >>>>>> [20213.579203] ? ksys_ioctl+0xb0/0x300 > > >>>>>> [20213.579213] ? generic_file_llseek+0x24/0x30 > > >>>>>> [20213.579225] ? fput+0x13/0x20 > > >>>>>> [20213.579233] ? ksys_lseek+0x8d/0xb0 > > >>>>>> [20213.579243] __x64_sys_newstat+0x16/0x20 > > >>>>>> [20213.579256] do_syscall_64+0x4d/0x140 > > >>>>>> [20213.579268] entry_SYSCALL_64_after_hwframe+0x5c/0xc1 > > >>>>>> > > >>>>>> <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< > > >>>>>> > > >>>>>> > > >>>>> Please note that the kernel version I use is the one maintained by > > >>>>> Tencent.Inc, > > >>>>> and the baseline is v5.4. But in fact, in the latest upstream source > > >>>>> tree, > > >>>>> although the trailing_symlink() function has been removed, its logic > > >>>>> has been > > >>>>> moved to pick_link(), so the problem still exists. > > >>>>> > > >>>>> Ian Kent pointed out that try_to_unlazy() was introduced in > > >>>>> pick_link() in the > > >>>>> latest upstream source tree, but I don't understand why this can > > >>>>> solve the NULL > > >>>>> ->get_link pointer dereference problem, because ->get_link pointer > > >>>>> will be > > >>>>> dereferenced before try_to_unlazy(). > > >>>>> > > >>>>> (I don't understand why Ian Kent's email didn't appear on the > > >>>>> mailing list.) > > >>>> It was something about html mail and I think my mail client was at > > >>>> fault. > > >>>> > > >>>> In any case what you say is indeed correct, so the comment isn't > > >>>> important. > > >>>> > > >>>> > > >>>> Fact is it is still a race between the lockless path walk and inode > > >>>> eviction > > >>>> > > >>>> and xfs recycling. I believe that the xfs recycling code is very > > >>>> hard to > > >>>> fix. > > >>>> > > >>>> > > >>>> IIRC correctly putting a NULL check in pick_link() was not considered > > >>>> acceptable > > >>>> > > >>>> but there must be a way that is acceptable to check this and > > >>>> restart the > > >>>> walk. > > >>>> > > >>>> Maybe there was a reluctance to suffer the overhead of restarting the > > >>>> walk when > > >>>> > > >>>> it shouldn't be needed. > > >>> Or perhaps the worry was that if it can become NULL it could also > > >>> become a > > >>> pointer to a > > >>> > > >>> different (incorrect) link altogether which could have really > > >>> odd/unpleasant > > >>> outcomes. > > >> Yuck. I think that means that we can't reallocate freed inodes until > > >> the rcu grace period expires. For inodes that haven't been evicted, I > > >> think that also means we cannot recycle cached inodes until after an rcu > > >> grace period expires; or maybe that we cannot reset i_op/i_fop and must > > >> not leave the incore state in an inconsistent format? > > > > > > Yeah, not pretty! > > > > > > But shouldn't this case occur only occasionally? > > > > > > > > > So issuing a cache miss shouldn't impact performance too much that was, > > > > > > I believe, the concern with waiting for the rcu grace period. > > > > > > > > > Identifying it's happening should be possible, the vfs legitimize_*() > > > > > > has this job for various objects but maybe it's using vfs private info. > > > > > > (certainly it uses nameidata struct with a seq lock sequence number in > > > > > > it) but I assume it can be done somehow. > > > > Unfortunately, when you start trying to work out how to do this, it > > isn't at all > > > > obvious how to do it ... > > How about adding a synchronize_rcu() in front of xfs_reinit_inode()? > > Maybe this will affect performance, but compared to crashing the kernel, this > performance penalty is completely worth it. There is always synchronize_rcu_expedited(), especially if this is a relatively rare operation. The typical synchronize_rcu() delay is tens of milliseconds, while the typical synchronize_rcu_expedited() delay is tens to hundreds of microseconds. The downside of synchronize_rcu_expedited() is higher per-RCU-update CPU utilization. Plus added IPIs. > And, perhaps we can gradually take some optimization measures, such as: > https://lore.kernel.org/linux-xfs/20220217172518.3842951-2-bfoster@xxxxxxxxxx/ I cannot claim to fully understand this code, but I do agree that the use of things like poll_state_synchronize_rcu() and cond_synchronize_rcu() could greatly reduce the number of grace-period waits. In recent kernels, there is cond_synchronize_rcu_expedited() as well. Thanx, Paul > Best Regards, > Jinliang Zheng > > > > > > > > > > > > > > My question then becomes is it viable/straight forward to not recycle > > > such > > > > > > an inode and discard it instead so it gets re-created, I guess it's > > > essentially > > > > > > a cache miss? > > > > > > > > > Ian > > > > > >> > > >> --D > > >> > > >>>> > > >>>> The alternative would be to find some way to identify when it's unsafe > > >>>> to reuse > > >>>> > > >>>> an inode marked for re-cycle before dropping rcu read, perhaps with > > >>>> the > > >>>> reference > > >>>> > > >>>> count plus the seqlock. Basically, to reuse inodes xfs will need to > > >>>> identify when > > >>>> > > >>>> the race occurs and let the inode go away under rcu and create a > > >>>> new one > > >>>> if a race > > >>>> > > >>>> is detected. But possibly that isn't nearly as simple as it sounds? > > >>>> > > >>>> > > >>>>> Thanks, > > >>>>> Jinliang Zheng > > >>>>> > > >>>>>> And I analyzed the disassembly of trailing_symlink() and > > >>>>>> confirmed that a NULL > > >>>>>> ->get_link() happened here: > > >>>>>> > > >>>>>> 0xffffffff812e4850 <trailing_symlink>: nopl 0x0(%rax,%rax,1) > > >>>>>> [FTRACE NOP] > > >>>>>> 0xffffffff812e4855 <trailing_symlink+0x5>: push %rbp > > >>>>>> 0xffffffff812e4856 <trailing_symlink+0x6>: mov %rsp,%rbp > > >>>>>> 0xffffffff812e4859 <trailing_symlink+0x9>: push %r15 > > >>>>>> 0xffffffff812e485b <trailing_symlink+0xb>: push %r14 > > >>>>>> 0xffffffff812e485d <trailing_symlink+0xd>: push %r13 > > >>>>>> 0xffffffff812e485f <trailing_symlink+0xf>: push %r12 > > >>>>>> 0xffffffff812e4861 <trailing_symlink+0x11>: push %rbx > > >>>>>> 0xffffffff812e4862 <trailing_symlink+0x12>: mov > > >>>>>> %rdi,%rbx # rbx = &nameidate > > >>>>>> 0xffffffff812e4865 <trailing_symlink+0x15>: sub $0x8,%rsp > > >>>>>> 0xffffffff812e4869 <trailing_symlink+0x19>: mov > > >>>>>> 0x1765845(%rip),%edx # 0xffffffff82a4a0b4 > > >>>>>> <sysctl_protected_symlinks> > > >>>>>> 0xffffffff812e486f <trailing_symlink+0x1f>: mov 0x38(%rdi),%eax > > >>>>>> 0xffffffff812e4872 <trailing_symlink+0x22>: test %edx,%edx > > >>>>>> 0xffffffff812e4874 <trailing_symlink+0x24>: je > > >>>>>> 0xffffffff812e48ac <trailing_symlink+0x5c> > > >>>>>> 0xffffffff812e4876 <trailing_symlink+0x26>: mov %gs:0x1ad00,%rdx > > >>>>>> 0xffffffff812e487f <trailing_symlink+0x2f>: mov > > >>>>>> 0xc8(%rdi),%rcx # rcx = nameidata->link_inode > > >>>>>> 0xffffffff812e4886 <trailing_symlink+0x36>: mov 0xc18(%rdx),%rdx > > >>>>>> 0xffffffff812e488d <trailing_symlink+0x3d>: mov > > >>>>>> 0x4(%rcx),%ecx # ecx = link_inode->uid > > >>>>>> 0xffffffff812e4890 <trailing_symlink+0x40>: cmp %ecx,0x1c(%rdx) > > >>>>>> 0xffffffff812e4893 <trailing_symlink+0x43>: je > > >>>>>> 0xffffffff812e48ac <trailing_symlink+0x5c> > > >>>>>> 0xffffffff812e4895 <trailing_symlink+0x45>: mov 0x30(%rdi),%rsi > > >>>>>> 0xffffffff812e4899 <trailing_symlink+0x49>: movzwl (%rsi),%edx > > >>>>>> 0xffffffff812e489c <trailing_symlink+0x4c>: and $0x202,%dx > > >>>>>> 0xffffffff812e48a1 <trailing_symlink+0x51>: cmp $0x202,%dx > > >>>>>> 0xffffffff812e48a6 <trailing_symlink+0x56>: je > > >>>>>> 0xffffffff812e495f <trailing_symlink+0x10f> > > >>>>>> 0xffffffff812e48ac <trailing_symlink+0x5c>: or $0x10,%eax > > >>>>>> 0xffffffff812e48af <trailing_symlink+0x5f>: mov > > >>>>>> %eax,0x38(%rbx) # nd->flags |= LOOKUP_PARENT > > >>>>>> 0xffffffff812e48b2 <trailing_symlink+0x62>: mov > > >>>>>> 0x50(%rbx),%rax # rax = nd->stack > > >>>>>> 0xffffffff812e48b6 <trailing_symlink+0x66>: movq > > >>>>>> $0x0,0x20(%rax) # stack[0].name = NULL > > >>>>>> 0xffffffff812e48be <trailing_symlink+0x6e>: mov > > >>>>>> 0x48(%rbx),%eax # nd->depth > > >>>>>> 0xffffffff812e48c1 <trailing_symlink+0x71>: mov > > >>>>>> 0x50(%rbx),%rdx # nd->stack > > >>>>>> 0xffffffff812e48c5 <trailing_symlink+0x75>: mov > > >>>>>> 0xc8(%rbx),%r13 # nd->link_inode > > >>>>>> 0xffffffff812e48cc <trailing_symlink+0x7c>: lea > > >>>>>> (%rax,%rax,2),%rax # rax = depth * 3 > > >>>>>> 0xffffffff812e48d0 <trailing_symlink+0x80>: shl > > >>>>>> $0x4,%rax # rax = rax << 4, sizeof(saved):0x30 > > >>>>>> 0xffffffff812e48d4 <trailing_symlink+0x84>: lea > > >>>>>> -0x30(%rdx,%rax,1),%r15 # r15 = last > > >>>>>> 0xffffffff812e48d9 <trailing_symlink+0x89>: mov > > >>>>>> 0x8(%r15),%r14 # r14 = last->link.dentry > > >>>>>> 0xffffffff812e48dd <trailing_symlink+0x8d>: testb $0x40,0x38(%rbx) > > >>>>>> 0xffffffff812e48e1 <trailing_symlink+0x91>: je > > >>>>>> 0xffffffff812e4950 <trailing_symlink+0x100> > > >>>>>> 0xffffffff812e48e3 <trailing_symlink+0x93>: mov %r13,%rsi > > >>>>>> 0xffffffff812e48e6 <trailing_symlink+0x96>: mov %r15,%rdi > > >>>>>> 0xffffffff812e48e9 <trailing_symlink+0x99>: callq > > >>>>>> 0xffffffff812f8a00 <atime_needs_update> > > >>>>>> 0xffffffff812e48ee <trailing_symlink+0x9e>: test %al,%al > > >>>>>> 0xffffffff812e48f0 <trailing_symlink+0xa0>: jne > > >>>>>> 0xffffffff812e4a56 <trailing_symlink+0x206> > > >>>>>> 0xffffffff812e48f6 <trailing_symlink+0xa6>: mov 0x38(%rbx),%edx > > >>>>>> 0xffffffff812e48f9 <trailing_symlink+0xa9>: mov %r13,%rsi > > >>>>>> 0xffffffff812e48fc <trailing_symlink+0xac>: mov %r14,%rdi > > >>>>>> 0xffffffff812e48ff <trailing_symlink+0xaf>: shr $0x6,%edx > > >>>>>> 0xffffffff812e4902 <trailing_symlink+0xb2>: and $0x1,%edx > > >>>>>> 0xffffffff812e4905 <trailing_symlink+0xb5>: callq > > >>>>>> 0xffffffff81424310 <security_inode_follow_link> > > >>>>>> 0xffffffff812e490a <trailing_symlink+0xba>: movslq %eax,%r12 > > >>>>>> 0xffffffff812e490d <trailing_symlink+0xbd>: test %eax,%eax > > >>>>>> 0xffffffff812e490f <trailing_symlink+0xbf>: jne > > >>>>>> 0xffffffff812e4939 <trailing_symlink+0xe9> > > >>>>>> 0xffffffff812e4911 <trailing_symlink+0xc1>: movl $0x4,0x44(%rbx) > > >>>>>> 0xffffffff812e4918 <trailing_symlink+0xc8>: mov 0x248(%r13),%r12 > > >>>>>> 0xffffffff812e491f <trailing_symlink+0xcf>: test %r12,%r12 > > >>>>>> 0xffffffff812e4922 <trailing_symlink+0xd2>: je > > >>>>>> 0xffffffff812e49e5 <trailing_symlink+0x195> > > >>>>>> 0xffffffff812e4928 <trailing_symlink+0xd8>: movzbl (%r12),%eax > > >>>>>> 0xffffffff812e492d <trailing_symlink+0xdd>: cmp $0x2f,%al > > >>>>>> 0xffffffff812e492f <trailing_symlink+0xdf>: je > > >>>>>> 0xffffffff812e49b7 <trailing_symlink+0x167> > > >>>>>> 0xffffffff812e4935 <trailing_symlink+0xe5>: test %al,%al > > >>>>>> 0xffffffff812e4937 <trailing_symlink+0xe7>: je > > >>>>>> 0xffffffff812e49ae <trailing_symlink+0x15e> > > >>>>>> 0xffffffff812e4939 <trailing_symlink+0xe9>: test %r12,%r12 > > >>>>>> 0xffffffff812e493c <trailing_symlink+0xec>: je > > >>>>>> 0xffffffff812e49ae <trailing_symlink+0x15e> > > >>>>>> 0xffffffff812e493e <trailing_symlink+0xee>: add $0x8,%rsp > > >>>>>> 0xffffffff812e4942 <trailing_symlink+0xf2>: mov %r12,%rax > > >>>>>> 0xffffffff812e4945 <trailing_symlink+0xf5>: pop %rbx > > >>>>>> 0xffffffff812e4946 <trailing_symlink+0xf6>: pop %r12 > > >>>>>> 0xffffffff812e4948 <trailing_symlink+0xf8>: pop %r13 > > >>>>>> 0xffffffff812e494a <trailing_symlink+0xfa>: pop %r14 > > >>>>>> 0xffffffff812e494c <trailing_symlink+0xfc>: pop %r15 > > >>>>>> 0xffffffff812e494e <trailing_symlink+0xfe>: pop %rbp > > >>>>>> 0xffffffff812e494f <trailing_symlink+0xff>: retq > > >>>>>> 0xffffffff812e4950 <trailing_symlink+0x100>: mov %r15,%rdi > > >>>>>> 0xffffffff812e4953 <trailing_symlink+0x103>: callq > > >>>>>> 0xffffffff812f8ae0 <touch_atime> > > >>>>>> 0xffffffff812e4958 <trailing_symlink+0x108>: callq > > >>>>>> 0xffffffff81a26410 <_cond_resched> > > >>>>>> 0xffffffff812e495d <trailing_symlink+0x10d>: jmp > > >>>>>> 0xffffffff812e48f6 <trailing_symlink+0xa6> > > >>>>>> 0xffffffff812e495f <trailing_symlink+0x10f>: mov 0x4(%rsi),%edx > > >>>>>> 0xffffffff812e4962 <trailing_symlink+0x112>: cmp $0xffffffff,%edx > > >>>>>> 0xffffffff812e4965 <trailing_symlink+0x115>: je > > >>>>>> 0xffffffff812e496f <trailing_symlink+0x11f> > > >>>>>> 0xffffffff812e4967 <trailing_symlink+0x117>: cmp %edx,%ecx > > >>>>>> 0xffffffff812e4969 <trailing_symlink+0x119>: je > > >>>>>> 0xffffffff812e48ac <trailing_symlink+0x5c> > > >>>>>> 0xffffffff812e496f <trailing_symlink+0x11f>: mov > > >>>>>> $0xfffffffffffffff6,%r12 > > >>>>>> 0xffffffff812e4976 <trailing_symlink+0x126>: test $0x40,%al > > >>>>>> 0xffffffff812e4978 <trailing_symlink+0x128>: jne > > >>>>>> 0xffffffff812e493e <trailing_symlink+0xee> > > >>>>>> 0xffffffff812e497a <trailing_symlink+0x12a>: mov %gs:0x1ad00,%rax > > >>>>>> 0xffffffff812e4983 <trailing_symlink+0x133>: mov 0xce0(%rax),%rax > > >>>>>> 0xffffffff812e498a <trailing_symlink+0x13a>: test %rax,%rax > > >>>>>> 0xffffffff812e498d <trailing_symlink+0x13d>: je > > >>>>>> 0xffffffff812e4999 <trailing_symlink+0x149> > > >>>>>> 0xffffffff812e498f <trailing_symlink+0x13f>: mov (%rax),%eax > > >>>>>> 0xffffffff812e4991 <trailing_symlink+0x141>: test %eax,%eax > > >>>>>> 0xffffffff812e4993 <trailing_symlink+0x143>: je > > >>>>>> 0xffffffff812e4a6f <trailing_symlink+0x21f> > > >>>>>> 0xffffffff812e4999 <trailing_symlink+0x149>: mov > > >>>>>> $0xffffffff82319b4f,%rdi > > >>>>>> 0xffffffff812e49a0 <trailing_symlink+0x150>: mov > > >>>>>> $0xfffffffffffffff3,%r12 > > >>>>>> 0xffffffff812e49a7 <trailing_symlink+0x157>: callq > > >>>>>> 0xffffffff81161310 <audit_log_link_denied> > > >>>>>> 0xffffffff812e49ac <trailing_symlink+0x15c>: jmp > > >>>>>> 0xffffffff812e493e <trailing_symlink+0xee> > > >>>>>> 0xffffffff812e49ae <trailing_symlink+0x15e>: mov > > >>>>>> $0xffffffff8230164d,%r12 > > >>>>>> 0xffffffff812e49b5 <trailing_symlink+0x165>: jmp > > >>>>>> 0xffffffff812e493e <trailing_symlink+0xee> > > >>>>>> 0xffffffff812e49b7 <trailing_symlink+0x167>: cmpq $0x0,0x20(%rbx) > > >>>>>> 0xffffffff812e49bc <trailing_symlink+0x16c>: je > > >>>>>> 0xffffffff812e4a8a <trailing_symlink+0x23a> > > >>>>>> 0xffffffff812e49c2 <trailing_symlink+0x172>: mov %rbx,%rdi > > >>>>>> 0xffffffff812e49c5 <trailing_symlink+0x175>: callq > > >>>>>> 0xffffffff812e2da0 <nd_jump_root> > > >>>>>> 0xffffffff812e49ca <trailing_symlink+0x17a>: test %eax,%eax > > >>>>>> 0xffffffff812e49cc <trailing_symlink+0x17c>: jne > > >>>>>> 0xffffffff812e4a97 <trailing_symlink+0x247> > > >>>>>> 0xffffffff812e49d2 <trailing_symlink+0x182>: add $0x1,%r12 > > >>>>>> 0xffffffff812e49d6 <trailing_symlink+0x186>: movzbl (%r12),%eax > > >>>>>> 0xffffffff812e49db <trailing_symlink+0x18b>: cmp $0x2f,%al > > >>>>>> 0xffffffff812e49dd <trailing_symlink+0x18d>: jne > > >>>>>> 0xffffffff812e4935 <trailing_symlink+0xe5> > > >>>>>> 0xffffffff812e49e3 <trailing_symlink+0x193>: jmp > > >>>>>> 0xffffffff812e49d2 <trailing_symlink+0x182> > > >>>>>> 0xffffffff812e49e5 <trailing_symlink+0x195>: mov > > >>>>>> 0x20(%r13),%rax # inode->i_op > > >>>>>> 0xffffffff812e49e9 <trailing_symlink+0x199>: add $0x10,%r15 > > >>>>>> 0xffffffff812e49ed <trailing_symlink+0x19d>: mov %r13,%rsi > > >>>>>> 0xffffffff812e49f0 <trailing_symlink+0x1a0>: mov %r15,%rdx > > >>>>>> 0xffffffff812e49f3 <trailing_symlink+0x1a3>: mov > > >>>>>> 0x8(%rax),%rcx # inode_operations->get_link > > >>>>>> 0xffffffff812e49f7 <trailing_symlink+0x1a7>: testb $0x40,0x38(%rbx) > > >>>>>> 0xffffffff812e49fb <trailing_symlink+0x1ab>: jne > > >>>>>> 0xffffffff812e4a1f <trailing_symlink+0x1cf> > > >>>>>> 0xffffffff812e49fd <trailing_symlink+0x1ad>: mov > > >>>>>> %r14,%rdi # nd->flags & LOOKUP_RCU == 0 > > >>>>>> 0xffffffff812e4a00 <trailing_symlink+0x1b0>: callq > > >>>>>> 0xffffffff81e00f70 <__x86_indirect_thunk_rcx> # jmpq *%rcx > > >>>>>> 0xffffffff812e4a05 <trailing_symlink+0x1b5>: mov %rax,%r12 > > >>>>>> 0xffffffff812e4a08 <trailing_symlink+0x1b8>: test %r12,%r12 > > >>>>>> 0xffffffff812e4a0b <trailing_symlink+0x1bb>: je > > >>>>>> 0xffffffff812e49ae <trailing_symlink+0x15e> > > >>>>>> 0xffffffff812e4a0d <trailing_symlink+0x1bd>: cmp > > >>>>>> $0xfffffffffffff000,%r12 > > >>>>>> 0xffffffff812e4a14 <trailing_symlink+0x1c4>: jbe > > >>>>>> 0xffffffff812e4928 <trailing_symlink+0xd8> > > >>>>>> 0xffffffff812e4a1a <trailing_symlink+0x1ca>: jmpq > > >>>>>> 0xffffffff812e493e <trailing_symlink+0xee> > > >>>>>> 0xffffffff812e4a1f <trailing_symlink+0x1cf>: xor > > >>>>>> %edi,%edi # nd->flags & LOOKUP_RCU != 0 > > >>>>>> 0xffffffff812e4a21 <trailing_symlink+0x1d1>: mov %rcx,-0x30(%rbp) > > >>>>>> 0xffffffff812e4a25 <trailing_symlink+0x1d5>: callq > > >>>>>> 0xffffffff81e00f70 <__x86_indirect_thunk_rcx> # jmpq *%rcx > > >>>>>> 0xffffffff812e4a2a <trailing_symlink+0x1da>: mov %rax,%r12 > > >>>>>> 0xffffffff812e4a2d <trailing_symlink+0x1dd>: cmp > > >>>>>> $0xfffffffffffffff6,%rax > > >>>>>> 0xffffffff812e4a31 <trailing_symlink+0x1e1>: jne > > >>>>>> 0xffffffff812e4a08 <trailing_symlink+0x1b8> > > >>>>>> 0xffffffff812e4a33 <trailing_symlink+0x1e3>: mov %rbx,%rdi > > >>>>>> 0xffffffff812e4a36 <trailing_symlink+0x1e6>: callq > > >>>>>> 0xffffffff812e3840 <unlazy_walk> > > >>>>>> 0xffffffff812e4a3b <trailing_symlink+0x1eb>: test %eax,%eax > > >>>>>> 0xffffffff812e4a3d <trailing_symlink+0x1ed>: jne > > >>>>>> 0xffffffff812e4a97 <trailing_symlink+0x247> > > >>>>>> 0xffffffff812e4a3f <trailing_symlink+0x1ef>: mov %r15,%rdx > > >>>>>> 0xffffffff812e4a42 <trailing_symlink+0x1f2>: mov %r13,%rsi > > >>>>>> 0xffffffff812e4a45 <trailing_symlink+0x1f5>: mov %r14,%rdi > > >>>>>> 0xffffffff812e4a48 <trailing_symlink+0x1f8>: mov -0x30(%rbp),%rcx > > >>>>>> 0xffffffff812e4a4c <trailing_symlink+0x1fc>: callq > > >>>>>> 0xffffffff81e00f70 <__x86_indirect_thunk_rcx> > > >>>>>> 0xffffffff812e4a51 <trailing_symlink+0x201>: mov %rax,%r12 > > >>>>>> 0xffffffff812e4a54 <trailing_symlink+0x204>: jmp > > >>>>>> 0xffffffff812e4a08 <trailing_symlink+0x1b8> > > >>>>>> 0xffffffff812e4a56 <trailing_symlink+0x206>: mov %rbx,%rdi > > >>>>>> 0xffffffff812e4a59 <trailing_symlink+0x209>: callq > > >>>>>> 0xffffffff812e3840 <unlazy_walk> > > >>>>>> 0xffffffff812e4a5e <trailing_symlink+0x20e>: test %eax,%eax > > >>>>>> 0xffffffff812e4a60 <trailing_symlink+0x210>: jne > > >>>>>> 0xffffffff812e4a97 <trailing_symlink+0x247> > > >>>>>> 0xffffffff812e4a62 <trailing_symlink+0x212>: mov %r15,%rdi > > >>>>>> 0xffffffff812e4a65 <trailing_symlink+0x215>: callq > > >>>>>> 0xffffffff812f8ae0 <touch_atime> > > >>>>>> 0xffffffff812e4a6a <trailing_symlink+0x21a>: jmpq > > >>>>>> 0xffffffff812e48f6 <trailing_symlink+0xa6> > > >>>>>> 0xffffffff812e4a6f <trailing_symlink+0x21f>: mov 0x50(%rbx),%rax > > >>>>>> 0xffffffff812e4a73 <trailing_symlink+0x223>: mov 0xb8(%rbx),%rdi > > >>>>>> 0xffffffff812e4a7a <trailing_symlink+0x22a>: xor %edx,%edx > > >>>>>> 0xffffffff812e4a7c <trailing_symlink+0x22c>: mov 0x8(%rax),%rsi > > >>>>>> 0xffffffff812e4a80 <trailing_symlink+0x230>: callq > > >>>>>> 0xffffffff811673f0 <__audit_inode> > > >>>>>> 0xffffffff812e4a85 <trailing_symlink+0x235>: jmpq > > >>>>>> 0xffffffff812e4999 <trailing_symlink+0x149> > > >>>>>> 0xffffffff812e4a8a <trailing_symlink+0x23a>: mov %rbx,%rdi > > >>>>>> 0xffffffff812e4a8d <trailing_symlink+0x23d>: callq > > >>>>>> 0xffffffff812e4790 <set_root> > > >>>>>> 0xffffffff812e4a92 <trailing_symlink+0x242>: jmpq > > >>>>>> 0xffffffff812e49c2 <trailing_symlink+0x172> > > >>>>>> 0xffffffff812e4a97 <trailing_symlink+0x247>: mov > > >>>>>> $0xfffffffffffffff6,%r12 > > >>>>>> 0xffffffff812e4a9e <trailing_symlink+0x24e>: jmpq > > >>>>>> 0xffffffff812e493e <trailing_symlink+0xee> > > >>>>>> > > >>>>>> <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< > > >>>>>> > > >>>>>> > > >>>>>> > > >>>>>> According to my understanding, the problem solved by commit > > >>>>>> 7b7820b83f23 ("xfs: > > >>>>>> don't expose internal symlink metadata buffers to the vfs") is a > > >>>>>> data NULL > > >>>>>> pointer dereference, but the problem here is an instruction NULL > > >>>>>> pointer > > >>>>>> dereference. > > >>>>>> > > >>>>>> Further, I analyzed the possible triggering process as follows: > > >>>>>> > > >>>>>> rcu_walk do_unlinkat ~~> prune_dcache_sb create > > >>>>>> rcu_read_lock > > >>>>>> read_seqcount_retry > > >>>>>> (the last check) iput_final > > >>>>>> evict > > >>>>>> destroy_inode > > >>>>>> xfs_fs_destroy_inode > > >>>>>> xfs_inode_set_reclaim_tag xfs_ialloc > > >>>>>> spin_lock(ip->i_flags_lock) xfs_dialloc > > >>>>>> set(ip, XFS_IRECLAIMABLE) > > >>>>>> xfs_iget > > >>>>>> wakeup(xfs_reclaim_worker) rcu_read_lock > > >>>>>> spin_unlock(ip->i_flags_lock) xfs_iget_cache_hit > > >>>>>> spin_lock(ip->i_flags_lock) > > >>>>>> > > >>>>>> if (XFS_IRECLAIMABLE && !XFS_IRECLAIM) > > >>>>>> set(ip, XFS_IRECLAIM) > > >>>>>> spin_unlock(ip->i_flags_lock) > > >>>>>> rcu_read_unlock > > >>>>>> < ------------ > > > >>>>>> > > >>>>>> // miss synchronize_rcu() > > >>>>>> xfs_reinit_inode > > >>>>>> ->get_link = NULL > > >>>>>> get_link() // NULL > > >>>>>> > > >>>>>> rcu_read_unlock > > >>>>>> > > >>>>>> <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< > > >>>>>> > > >>>>>> > > >>>>>> > > >>>>>> Therefore, I think that after commit 7b7820b83f23 ("xfs: don't > > >>>>>> expose internal > > >>>>>> symlink metadata buffers to the vfs"), we should start > > >>>>>> processing this NULL > > >>>>>> ->get_link pointer dereference. > > >>>>>> > > >>>>>> Or, am I thinking wrong somewhere? > > >>>>>> > > >>>>>> Thanks, > > >>>>>> Jinliang Zheng > > >>>>>> > > >>>>>>>>> Apart from that issue, I'm not aware of any other issues that the > > >>>>>>>>> XFS inode recycling directly exposes. > > >>>>>>>>> > > >>>>>>>>>> According to my understanding, the essence of > > >>>>>>>>>> this problem is that XFS reuses > > >>>>>>>>>> the inode evicted by VFS, but VFS rcu-walk > > >>>>>>>>>> assumes that this will not happen. > > >>>>>>>>> It assumes that the inode will not change identity during the RCU > > >>>>>>>>> grace period after the inode has been evicted from cache. We can > > >>>>>>>>> safely reinstantiate an evicted inode without waiting for an RCU > > >>>>>>>>> grace period as long as it is the same inode with the same > > >>>>>>>>> content > > >>>>>>>>> and same state. > > >>>>>>>>> > > >>>>>>>>> Problems *may* arise when we unlink the inode, then evict it, > > >>>>>>>>> then a > > >>>>>>>>> new file is created and the old slab cache memory address is used > > >>>>>>>>> for the new inode. I describe the issue here: > > >>>>>>>>> > > >>>>>>>>> https://lore.kernel.org/linux-xfs/20220118232547.GD59729@xxxxxxxxxxxxxxxxxxx/ > > >>>>>>>>> > > >>>>>>>>> > > >>>>>>>> And judging from the relevant emails, the main reason > > >>>>>>>> why ->get_link() is set > > >>>>>>>> to NULL should be the lack of synchronize_rcu() before > > >>>>>>>> xfs_reinit_inode() when > > >>>>>>>> the inode is chosen to be reused. > > >>>>>>>> > > >>>>>>>> However, perhaps due to performance reasons, this > > >>>>>>>> solution has not been merged > > >>>>>>>> for a long time. How is it now? > > >>>>>>>> > > >>>>>>>> Maybe I am missing something in the threads of mail? > > >>>>>>>> > > >>>>>>>> Thank you very much. :) > > >>>>>>>> Jinliang Zheng > > >>>>>>>> > > >>>>>>>>> That said, we have exactly zero evidence that this is actually a > > >>>>>>>>> problem in production systems. We did get systems tripping > > >>>>>>>>> over the > > >>>>>>>>> symlink issue, but there's no evidence that the > > >>>>>>>>> unlink->close->open(O_CREAT) issues are manifesting in the > > >>>>>>>>> wild and > > >>>>>>>>> hence there hasn't been any particular urgency to address it. > > >>>>>>>>> > > >>>>>>>>>> Are there any recommended workarounds until an > > >>>>>>>>>> elegant and efficient solution > > >>>>>>>>>> can be proposed? After all, causing a crash is > > >>>>>>>>>> extremely unacceptable in a > > >>>>>>>>>> production environment. > > >>>>>>>>> What crashes are you seeing in your production environment? > > >>>>>>>>> > > >>>>>>>>> -Dave. > > >>>>>>>>> -- > > >>>>>>>>> Dave Chinner > > >>>>>>>>> david@xxxxxxxxxxxxx > > > >