On Mon, May 27, 2024 at 09:56:15PM +0800, Jinliang Zheng wrote: > On Mon, 27 May 2024 at 19:41:18 +1000, Dave Chinner wrote: > > On Thu, May 16, 2024 at 03:23:40PM +0800, Ian Kent wrote: > > > On 16/5/24 15:08, Ian Kent wrote: > > > > On 16/5/24 12:56, Jinliang Zheng wrote: > > > > In any case what you say is indeed correct, so the comment isn't > > > > important. > > > > > > > > > > > > Fact is it is still a race between the lockless path walk and inode > > > > eviction > > > > > > > > and xfs recycling. I believe that the xfs recycling code is very hard to > > > > fix. > > > > Not really for this case. This is simply concurrent pathwalk lookups > > occurring just after the inode has been evicted from the VFS inode > > cache. The first lookup hits the XFS inode cache, sees > > XFS_IRECLAIMABLE, and it then enters xfs_reinit_inode() to > > reinstantiate the VFS inode to an initial state. The Xfs inode > > itself is still valid as it hasn't reached the XFS_IRECLAIM state > > where it will be torn down and freed. > > > > Whilst we are running xfs_reinit_inode(), a second RCU pathwalk has > > been run and that it is trying to call ->get_link on that same > > inode. Unfortunately, the first lookup has just set inode->f_ops = > > &empty_fops as part of the VFS inode reinit, and that then triggers > > the null pointer deref. > > The RCU pathwalk must occur before xfs_reinit_inode(), and must obtain the > target inode before xfs_reinit_inode(). I'm not sure I follow - xfs_reinit_inode() typically occurs during a pathwalk when no dentry for the given path component is found in the dcache. Hence it has to create the dentry and look up the inode. i.e. walk_component() lookup_fast() -> doesn't find a valid cached dentry lookup_slow() inode_lock_shared(parent) parent->i_op->lookup(child) xfs_vn_lookup() xfs_lookup() xfs_iget(child) <<<< inode may not exist until here xfs_iget_recycle(child) xfs_reinit_inode(child) inode_unlock_shared(parent) The path you are indicating is going wrong is: link_path_walk() walk_component() <find child dentry> step_into(child) if (!d_is_symlink(child dentry)) { .... return } pick_link(child) if (!inode->i_link) inode->i_op->get_link() <<<< doesn't exist, not a symlink inode This implies that lookup_fast() found a symlink dentry with a d_inode pointer to something that wasn't a symlink. That doesn't mean that anything has gone wrong with xfs inode recycling within an RCU grace period. For example, d_is_symlink() looks purely at the dentry state and assumes that it matches the dentry->d_inode attached to it: #define DCACHE_ENTRY_TYPE (7 << 20) /* bits 20..22 are for storing type: */ #define DCACHE_MISS_TYPE (0 << 20) /* Negative dentry */ #define DCACHE_WHITEOUT_TYPE (1 << 20) /* Whiteout dentry (stop pathwalk) */ #define DCACHE_DIRECTORY_TYPE (2 << 20) /* Normal directory */ #define DCACHE_AUTODIR_TYPE (3 << 20) /* Lookupless directory (presumed automount) */ #define DCACHE_REGULAR_TYPE (4 << 20) /* Regular file type */ #define DCACHE_SPECIAL_TYPE (5 << 20) /* Other file type */ #define DCACHE_SYMLINK_TYPE (6 << 20) /* Symlink */ static inline unsigned __d_entry_type(const struct dentry *dentry) { return dentry->d_flags & DCACHE_ENTRY_TYPE; } static inline bool d_is_symlink(const struct dentry *dentry) { return __d_entry_type(dentry) == DCACHE_SYMLINK_TYPE; } This is a valid optimisation and good for performance, but it does make it susceptible to memory corruption based failues. i.e. a single bit memory corruption can change a DCACHE_DIRECTORY_TYPE dentry to look like a DCACHE_SYMLINK_TYPE dentry, and then the code calls pick_link() on a dentry that points to a directory inode and not a symlink inode. Such a memory corruption would have an identical crash signature to the stack trace you posted, hence I'd really like to have solid confirmation that the crash you are seeing is actually a result of inode recycling and not something else.... > > Once the first lookup has finished the inode_init_always(), > > xfs_reinit_inode() resets inode->f_ops back to > > xfs_symlink_file_ops and get_link calls work again. > > > > Fundamentally, the problem is that we are completely reinitialising > > the VFS inode within the RCU grace period. i.e. while concurrent RCU > > pathwalks can still be in progress and find the VFS inode whilst the > > XFS inode cache is manipulating it. > > > > What we should be doing here is a subset of inode_init_always(), > > which only reinitialises the bits of the VFS inode we need to > > initialise rather than the entire inode. The identity of the inode > > is not changing and so we don't need to go through a transient state > > where the VFS inode goes xfs symlink -> empty initialised inode -> > > xfs symlink. > > Sorry, I think this question is wrong in more ways than just inode_operations. > After the target inode has been reinited by xfs_reinit_inode(), its semantics > is no longer the inode required by RCU walkpath. The meanings of many fields > have changed, such as mode, i_mtime, i_atime, i_ctime and so on. That's only the case in the the unlink->inode free->create-> inode allocation path, assuming that is what the system actually tripped over. However, we can hit the reinit code from a simple path lookup immediately after memory reclaim freed the dentry and inode and it is still in the XFS inode cache. i.e. ->destroy_inode() -> XFS_IRECLAIMABLE -> ->lookup() -> xfs_iget() -> xfs_iget_recycle(). i.e. the inode reinit doesn't only get triggered by unlink/alloc cycles, so we often reinit to the exact same inode state as before the inode was evicted from memory. Essentially, it is not clear to me how your system tripped over this issue; it *may* be an inode cache recycling issue, but I can also point to other situations that could result in a very similar crash signature. What I'm looking for is real evidence that it was a recycling issue that lead to this problem, and evidence that it can still occur on a current TOT kernel. A method for reproducing the issue your kernels are seeing would be nice. FWIW, reproducing on a current TOT kernel is important - even if you're seeing the unlink/alloc/reinit case on your 5.4 kernel, this path had a major architectural change in 5.14 and AFAICT that largely invalidates all the previous analysis of this inode reinit behaviour. In 5.14 we moved the inode freeing code we used to do in evict() into a background thread, hence the "evict, unlink, create, reinit" process now has an enforced context switch and delay between ->destroy_inode() and the internal inode unlink/freeing code. By decoupling the unlink processing from the calling task context, the task context can no longer immediately reallocate the same physical inode, and so the mechanism that lead to applications being able to directly trigger the xfs_inode_reinit() code for inodes that are changing identity repeatedly in certain situations no longer exists. The delay in unlink processing also affects how RCU grace periods expire between unlink and allocation/reinit, so assumptions made on that side of the analysis are also suspect and need to be re-examined. Hence before we spend any more time chasing ghosts, I'd really like to see hard evidence for what caused the crash you reported and a demonstration of it occuring on current TOT kernels, too. -Dave. -- Dave Chinner david@xxxxxxxxxxxxx