Re: [PATCH 3/5] xfs: rewrite xfs_icache_inode_is_allocated

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Mon, Jul 10, 2023 at 09:05:05AM +1000, Dave Chinner wrote:
> On Wed, Jul 05, 2023 at 05:37:37PM -0700, Darrick J. Wong wrote:
> > On Thu, Jun 22, 2023 at 02:04:31PM +1000, Dave Chinner wrote:
> > > On Thu, May 25, 2023 at 05:51:34PM -0700, Darrick J. Wong wrote:
> > > >  
> > > > -	*inuse = !!(VFS_I(ip)->i_mode);
> > > > -	xfs_irele(ip);
> > > > -	return 0;
> > > > +	/* get the perag structure and ensure that it's inode capable */
> > > > +	pag = xfs_perag_get(mp, XFS_INO_TO_AGNO(mp, ino));
> > > > +	if (!pag) {
> > > > +		/* No perag means this inode can't possibly be allocated */
> > > > +		return -EINVAL;
> > > > +	}
> > > 
> > > Probably should be xfs_perag_grab/rele in this function.
> > 
> > Why?  Is it because we presuppose that the caller holds the AGI buffer
> > and hence we only need a passive reference?
> 
> Right, there's nothing in this function to guarantee that the perag
> is valid and active, so it might be in the process of being torn
> down by, say, shrink or memory reclaim. And while we are using the
> perag, we want to ensure that nothing can start a teardown
> operation...
> 
> > > > +	spin_lock(&ip->i_flags_lock);
> > > > +	if (ip->i_ino != ino)
> > > > +		goto out_skip;
> > > > +
> > > > +	trace_xfs_icache_inode_is_allocated(ip);
> > > > +
> > > > +	/*
> > > > +	 * We have an incore inode that matches the inode we want, and the
> > > > +	 * caller holds the AGI buffer.
> > > > +	 *
> > > > +	 * If the incore inode is INEW, there are several possibilities:
> > > > +	 *
> > > > +	 * For a file that is being created, note that we allocate the ondisk
> > > > +	 * inode before allocating, initializing, and adding the incore inode
> > > > +	 * to the radix tree.
> > > > +	 *
> > > > +	 * If the incore inode is being recycled, the inode has to be allocated
> > > > +	 * because we don't allow freed inodes to be recycled.
> > > > +	 *
> > > > +	 * If the inode is queued for inactivation, it should still be
> > > > +	 * allocated.
> > > > +	 *
> > > > +	 * If the incore inode is undergoing inactivation, either it is before
> > > > +	 * the point where it would get freed ondisk (in which case i_mode is
> > > > +	 * still nonzero), or it has already been freed, in which case i_mode
> > > > +	 * is zero.  We don't take the ILOCK here, but difree and dialloc
> > > > +	 * require the AGI, which we do hold.
> > > > +	 *
> > > > +	 * If the inode is anywhere in the reclaim mechanism, we know that it's
> > > > +	 * still ok to query i_mode because we don't allow uncached inode
> > > > +	 * updates.
> > > 
> > > Is it? We explicitly consider XFS_IRECLAIM inodes as in the process
> > > of being freed, so there is no guarantee that anything in them is
> > > valid anymore. Indeed, there's a transient state in recycling an
> > > inode where we set XFS_IRECLAIM, then re-initialise the inode (which
> > > trashes i_mode) and then restore i_mode to it's correct value before
> > > clearing XFS_IRECLAIM.
> > > 
> > > Hence I think that if XFS_IRECLAIM is set, we can't make any safe
> > > judgement of the state of i_mode here with just a rcu_read_lock()
> > > being held.
> > 
> > I wrote this much too long ago, back when reclaim actually could write
> > inode clusters to disk.
> > 
> > At this point the comment and the code are both wrong -- if the inode is
> > in IRECLAIM, then it's either being recycled or reclaimed.  Neither of
> > those things modify the ondisk buffers anymore, so we actually could
> > return ENODATA here because all three callers of this function use that
> > as a signal to read i_mode from the icluster buffer.
> 
> OK.
> 
> > 
> > > > +	 *
> > > > +	 * If the incore inode is live (i.e. referenced from the dcache), the
> > > > +	 * ondisk inode had better be allocated.  This is the most trivial
> > > > +	 * case.
> > > > +	 */
> > > > +#ifdef DEBUG
> > > > +	if (ip->i_flags & XFS_INEW) {
> > > > +		/* created on disk already or recycling */
> > > > +		ASSERT(VFS_I(ip)->i_mode != 0);
> > > > +	}
> > > 
> > > I don't think this is correct. In xfs_iget_cache_miss() when
> > > allocating a new inode, we set XFS_INEW and we don't set i_mode
> > > until we call xfs_init_new_inode() after xfs_iget() on the newly
> > > allocated inode returns.  Hence there is a long period where
> > > XFS_INEW can be set and i_mode is zero and the i_flags_lock is not
> > > held.
> > 
> > I think you're referring to the situation where an icreate calls
> > xfs_iget_cache_miss in the (v5 && IGET_CREATE && !ikeep) scenario,
> > in which case we don't get around to setting i_mode until
> > xfs_init_new_inode?
> 
> Yes, that is one example I can think of.
> 
> > The icreate transaction holds the AGI all the way to the end, so a
> > different thread calling _is_allocated shouldn't find any inodes in this
> > state as long as it holds the AGI buffer, right?
> 
> Yes, I think that is correct.
> 
> > > Remember, if this is a generic function (which by placing it in
> > > fs/xfs/xfs_icache.c is essentially asserting that it is) then the
> > > inode state is only being serialised by RCU. Hence the debug code
> > > here cannot assume that it has been called with the AGI locked to
> > > serialise it against create/free operations, nor that there aren't
> > > other operations being performed on the inode as the lookup is done.
> > 
> > How about I stuff it into fs/xfs/scrub/ instead?  The only reason it's
> > in xfs_icache.c is because I wanted to keep the rcu lock and radix tree
> > lookup stuff in there... but yes I agree it's dangerous to let anyone
> > else see this weird function.
> > 
> > Plus then I can require the caller pass in a valid AGI buffer to prove
> > they've serialized against icreate/ifree. :)
> 
> That's fine by me. If we need it at a later date as generic
> functionality, we can sort out the semantics then. :)

Ok.  Even better, the xfs_perag_grab/get thing becomes irrelevant, since
scrub already got its own reference and we can just use it here in
xchk_inode_is_allocated. :)

--D

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@xxxxxxxxxxxxx



[Index of Archives]     [XFS Filesystem Development (older mail)]     [Linux Filesystem Development]     [Linux Audio Users]     [Yosemite Trails]     [Linux Kernel]     [Linux RAID]     [Linux SCSI]


  Powered by Linux