Re: [PATCH 3/5] xfs: rewrite xfs_icache_inode_is_allocated

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Wed, Jul 05, 2023 at 05:37:37PM -0700, Darrick J. Wong wrote:
> On Thu, Jun 22, 2023 at 02:04:31PM +1000, Dave Chinner wrote:
> > On Thu, May 25, 2023 at 05:51:34PM -0700, Darrick J. Wong wrote:
> > >  
> > > -	*inuse = !!(VFS_I(ip)->i_mode);
> > > -	xfs_irele(ip);
> > > -	return 0;
> > > +	/* get the perag structure and ensure that it's inode capable */
> > > +	pag = xfs_perag_get(mp, XFS_INO_TO_AGNO(mp, ino));
> > > +	if (!pag) {
> > > +		/* No perag means this inode can't possibly be allocated */
> > > +		return -EINVAL;
> > > +	}
> > 
> > Probably should be xfs_perag_grab/rele in this function.
> 
> Why?  Is it because we presuppose that the caller holds the AGI buffer
> and hence we only need a passive reference?

Right, there's nothing in this function to guarantee that the perag
is valid and active, so it might be in the process of being torn
down by, say, shrink or memory reclaim. And while we are using the
perag, we want to ensure that nothing can start a teardown
operation...

> > > +	spin_lock(&ip->i_flags_lock);
> > > +	if (ip->i_ino != ino)
> > > +		goto out_skip;
> > > +
> > > +	trace_xfs_icache_inode_is_allocated(ip);
> > > +
> > > +	/*
> > > +	 * We have an incore inode that matches the inode we want, and the
> > > +	 * caller holds the AGI buffer.
> > > +	 *
> > > +	 * If the incore inode is INEW, there are several possibilities:
> > > +	 *
> > > +	 * For a file that is being created, note that we allocate the ondisk
> > > +	 * inode before allocating, initializing, and adding the incore inode
> > > +	 * to the radix tree.
> > > +	 *
> > > +	 * If the incore inode is being recycled, the inode has to be allocated
> > > +	 * because we don't allow freed inodes to be recycled.
> > > +	 *
> > > +	 * If the inode is queued for inactivation, it should still be
> > > +	 * allocated.
> > > +	 *
> > > +	 * If the incore inode is undergoing inactivation, either it is before
> > > +	 * the point where it would get freed ondisk (in which case i_mode is
> > > +	 * still nonzero), or it has already been freed, in which case i_mode
> > > +	 * is zero.  We don't take the ILOCK here, but difree and dialloc
> > > +	 * require the AGI, which we do hold.
> > > +	 *
> > > +	 * If the inode is anywhere in the reclaim mechanism, we know that it's
> > > +	 * still ok to query i_mode because we don't allow uncached inode
> > > +	 * updates.
> > 
> > Is it? We explicitly consider XFS_IRECLAIM inodes as in the process
> > of being freed, so there is no guarantee that anything in them is
> > valid anymore. Indeed, there's a transient state in recycling an
> > inode where we set XFS_IRECLAIM, then re-initialise the inode (which
> > trashes i_mode) and then restore i_mode to it's correct value before
> > clearing XFS_IRECLAIM.
> > 
> > Hence I think that if XFS_IRECLAIM is set, we can't make any safe
> > judgement of the state of i_mode here with just a rcu_read_lock()
> > being held.
> 
> I wrote this much too long ago, back when reclaim actually could write
> inode clusters to disk.
> 
> At this point the comment and the code are both wrong -- if the inode is
> in IRECLAIM, then it's either being recycled or reclaimed.  Neither of
> those things modify the ondisk buffers anymore, so we actually could
> return ENODATA here because all three callers of this function use that
> as a signal to read i_mode from the icluster buffer.

OK.

> 
> > > +	 *
> > > +	 * If the incore inode is live (i.e. referenced from the dcache), the
> > > +	 * ondisk inode had better be allocated.  This is the most trivial
> > > +	 * case.
> > > +	 */
> > > +#ifdef DEBUG
> > > +	if (ip->i_flags & XFS_INEW) {
> > > +		/* created on disk already or recycling */
> > > +		ASSERT(VFS_I(ip)->i_mode != 0);
> > > +	}
> > 
> > I don't think this is correct. In xfs_iget_cache_miss() when
> > allocating a new inode, we set XFS_INEW and we don't set i_mode
> > until we call xfs_init_new_inode() after xfs_iget() on the newly
> > allocated inode returns.  Hence there is a long period where
> > XFS_INEW can be set and i_mode is zero and the i_flags_lock is not
> > held.
> 
> I think you're referring to the situation where an icreate calls
> xfs_iget_cache_miss in the (v5 && IGET_CREATE && !ikeep) scenario,
> in which case we don't get around to setting i_mode until
> xfs_init_new_inode?

Yes, that is one example I can think of.

> The icreate transaction holds the AGI all the way to the end, so a
> different thread calling _is_allocated shouldn't find any inodes in this
> state as long as it holds the AGI buffer, right?

Yes, I think that is correct.

> > Remember, if this is a generic function (which by placing it in
> > fs/xfs/xfs_icache.c is essentially asserting that it is) then the
> > inode state is only being serialised by RCU. Hence the debug code
> > here cannot assume that it has been called with the AGI locked to
> > serialise it against create/free operations, nor that there aren't
> > other operations being performed on the inode as the lookup is done.
> 
> How about I stuff it into fs/xfs/scrub/ instead?  The only reason it's
> in xfs_icache.c is because I wanted to keep the rcu lock and radix tree
> lookup stuff in there... but yes I agree it's dangerous to let anyone
> else see this weird function.
> 
> Plus then I can require the caller pass in a valid AGI buffer to prove
> they've serialized against icreate/ifree. :)

That's fine by me. If we need it at a later date as generic
functionality, we can sort out the semantics then. :)

Cheers,

Dave.
-- 
Dave Chinner
david@xxxxxxxxxxxxx



[Index of Archives]     [XFS Filesystem Development (older mail)]     [Linux Filesystem Development]     [Linux Audio Users]     [Yosemite Trails]     [Linux Kernel]     [Linux RAID]     [Linux SCSI]


  Powered by Linux