Re: About the conflict between XFS inode recycle and VFS rcu-walk

Dave Chinner <david@xxxxxxxxxxxxx> · Tue, 28 May 2024 12:10:35 +1000

On Mon, May 27, 2024 at 09:56:15PM +0800, Jinliang Zheng wrote:
> On Mon, 27 May 2024 at 19:41:18 +1000, Dave Chinner wrote:
> > On Thu, May 16, 2024 at 03:23:40PM +0800, Ian Kent wrote:
> > > On 16/5/24 15:08, Ian Kent wrote:
> > > > On 16/5/24 12:56, Jinliang Zheng wrote:
> > > > In any case what you say is indeed correct, so the comment isn't
> > > > important.
> > > > 
> > > > 
> > > > Fact is it is still a race between the lockless path walk and inode
> > > > eviction
> > > > 
> > > > and xfs recycling. I believe that the xfs recycling code is very hard to
> > > > fix.
> > 
> > Not really for this case. This is simply concurrent pathwalk lookups
> > occurring just after the inode has been evicted from the VFS inode
> > cache. The first lookup hits the XFS inode cache, sees
> > XFS_IRECLAIMABLE, and it then enters xfs_reinit_inode() to
> > reinstantiate the VFS inode to an initial state. The Xfs inode
> > itself is still valid as it hasn't reached the XFS_IRECLAIM state
> > where it will be torn down and freed.
> > 
> > Whilst we are running xfs_reinit_inode(), a second RCU pathwalk has
> > been run and that it is trying to call ->get_link on that same
> > inode. Unfortunately, the first lookup has just set inode->f_ops =
> > &empty_fops as part of the VFS inode reinit, and that then triggers
> > the null pointer deref.
> 
> The RCU pathwalk must occur before xfs_reinit_inode(), and must obtain the
> target inode before xfs_reinit_inode().

I'm not sure I follow - xfs_reinit_inode() typically occurs during a
pathwalk when no dentry for the given path component is found in the
dcache. Hence it has to create the dentry and look up the inode.
i.e.

walk_component()
  lookup_fast() -> doesn't find a valid cached dentry
  lookup_slow()
    inode_lock_shared(parent)
    parent->i_op->lookup(child)
      xfs_vn_lookup()
        xfs_lookup()
          xfs_iget(child)	<<<< inode may not exist until here
            xfs_iget_recycle(child)
	      xfs_reinit_inode(child)
    inode_unlock_shared(parent)

The path you are indicating is going wrong is:

link_path_walk()
  walk_component()
    <find child dentry>
    step_into(child)
      if (!d_is_symlink(child dentry)) {
        ....
        return
      }
      pick_link(child)
        if (!inode->i_link)
          inode->i_op->get_link()   <<<< doesn't exist, not a symlink inode

This implies that lookup_fast() found a symlink dentry with a
d_inode pointer to something that wasn't a symlink. That doesn't
mean that anything has gone wrong with xfs inode recycling within an
RCU grace period.

For example, d_is_symlink() looks purely at the dentry state and
assumes that it matches the dentry->d_inode attached to it:

#define DCACHE_ENTRY_TYPE               (7 << 20) /* bits 20..22 are for storing type: */
#define DCACHE_MISS_TYPE                (0 << 20) /* Negative dentry */
#define DCACHE_WHITEOUT_TYPE            (1 << 20) /* Whiteout dentry (stop pathwalk) */
#define DCACHE_DIRECTORY_TYPE           (2 << 20) /* Normal directory */
#define DCACHE_AUTODIR_TYPE             (3 << 20) /* Lookupless directory (presumed automount) */
#define DCACHE_REGULAR_TYPE             (4 << 20) /* Regular file type */
#define DCACHE_SPECIAL_TYPE             (5 << 20) /* Other file type */
#define DCACHE_SYMLINK_TYPE             (6 << 20) /* Symlink */

static inline unsigned __d_entry_type(const struct dentry *dentry)
{
        return dentry->d_flags & DCACHE_ENTRY_TYPE;
}

static inline bool d_is_symlink(const struct dentry *dentry)
{
        return __d_entry_type(dentry) == DCACHE_SYMLINK_TYPE;
}

This is a valid optimisation and good for performance, but it does
make it susceptible to memory corruption based failues. i.e.  a
single bit memory corruption can change a DCACHE_DIRECTORY_TYPE
dentry to look like a DCACHE_SYMLINK_TYPE dentry, and then the code
calls pick_link() on a dentry that points to a directory inode and
not a symlink inode.

Such a memory corruption would have an identical crash signature
to the stack trace you posted, hence I'd really like to have solid
confirmation that the crash you are seeing is actually a result of
inode recycling and not something else....

> > Once the first lookup has finished the inode_init_always(),
> > xfs_reinit_inode() resets inode->f_ops back to 
> > xfs_symlink_file_ops and get_link calls work again.
> > 
> > Fundamentally, the problem is that we are completely reinitialising
> > the VFS inode within the RCU grace period. i.e. while concurrent RCU
> > pathwalks can still be in progress and find the VFS inode whilst the
> > XFS inode cache is manipulating it.
> > 
> > What we should be doing here is a subset of inode_init_always(),
> > which only reinitialises the bits of the VFS inode we need to
> > initialise rather than the entire inode. The identity of the inode
> > is not changing and so we don't need to go through a transient state
> > where the VFS inode goes xfs symlink -> empty initialised inode ->
> > xfs symlink.
> 
> Sorry, I think this question is wrong in more ways than just inode_operations.
> After the target inode has been reinited by xfs_reinit_inode(), its semantics
> is no longer the inode required by RCU walkpath. The meanings of many fields
> have changed, such as mode, i_mtime, i_atime, i_ctime and so on.

That's only the case in the the unlink->inode free->create-> inode
allocation path, assuming that is what the system actually tripped
over.

However, we can hit the reinit code from a simple path lookup
immediately after memory reclaim freed the dentry and inode and it
is still in the XFS inode cache.  i.e.  ->destroy_inode() ->
XFS_IRECLAIMABLE -> ->lookup() -> xfs_iget() -> xfs_iget_recycle().
i.e. the inode reinit doesn't only get triggered by unlink/alloc
cycles, so we often reinit to the exact same inode state as before
the inode was evicted from memory.

Essentially, it is not clear to me how your system tripped over this
issue; it *may* be an inode cache recycling issue, but I can also
point to other situations that could result in a very similar crash
signature. What I'm looking for is real evidence that it was a
recycling issue that lead to this problem, and evidence that it can
still occur on a current TOT kernel. A method for reproducing the
issue your kernels are seeing would be nice.

FWIW, reproducing on a current TOT kernel is important - even if you're seeing the
unlink/alloc/reinit case on your 5.4 kernel, this path had a major
architectural change in 5.14 and AFAICT that largely invalidates all
the previous analysis of this inode reinit behaviour.

In 5.14 we moved the inode freeing code we used to do in evict()
into a background thread, hence the "evict, unlink, create, reinit"
process now has an enforced context switch and delay between
->destroy_inode() and the internal inode unlink/freeing code.

By decoupling the unlink processing from the calling task context,
the task context can no longer immediately reallocate the same
physical inode, and so the mechanism that lead to applications being
able to directly trigger the xfs_inode_reinit() code for inodes that
are changing identity repeatedly in certain situations no longer
exists. The delay in unlink processing also affects how RCU grace
periods expire between unlink and allocation/reinit, so assumptions
made on that side of the analysis are also suspect and need to be
re-examined.

Hence before we spend any more time chasing ghosts, I'd really like
to see hard evidence for what caused the crash you reported and a
demonstration of it occuring on current TOT kernels, too.

-Dave.
-- 
Dave Chinner
david@xxxxxxxxxxxxx