On Mon, Jul 10, 2023 at 09:05:05AM +1000, Dave Chinner wrote: > On Wed, Jul 05, 2023 at 05:37:37PM -0700, Darrick J. Wong wrote: > > On Thu, Jun 22, 2023 at 02:04:31PM +1000, Dave Chinner wrote: > > > On Thu, May 25, 2023 at 05:51:34PM -0700, Darrick J. Wong wrote: > > > > > > > > - *inuse = !!(VFS_I(ip)->i_mode); > > > > - xfs_irele(ip); > > > > - return 0; > > > > + /* get the perag structure and ensure that it's inode capable */ > > > > + pag = xfs_perag_get(mp, XFS_INO_TO_AGNO(mp, ino)); > > > > + if (!pag) { > > > > + /* No perag means this inode can't possibly be allocated */ > > > > + return -EINVAL; > > > > + } > > > > > > Probably should be xfs_perag_grab/rele in this function. > > > > Why? Is it because we presuppose that the caller holds the AGI buffer > > and hence we only need a passive reference? > > Right, there's nothing in this function to guarantee that the perag > is valid and active, so it might be in the process of being torn > down by, say, shrink or memory reclaim. And while we are using the > perag, we want to ensure that nothing can start a teardown > operation... > > > > > + spin_lock(&ip->i_flags_lock); > > > > + if (ip->i_ino != ino) > > > > + goto out_skip; > > > > + > > > > + trace_xfs_icache_inode_is_allocated(ip); > > > > + > > > > + /* > > > > + * We have an incore inode that matches the inode we want, and the > > > > + * caller holds the AGI buffer. > > > > + * > > > > + * If the incore inode is INEW, there are several possibilities: > > > > + * > > > > + * For a file that is being created, note that we allocate the ondisk > > > > + * inode before allocating, initializing, and adding the incore inode > > > > + * to the radix tree. > > > > + * > > > > + * If the incore inode is being recycled, the inode has to be allocated > > > > + * because we don't allow freed inodes to be recycled. > > > > + * > > > > + * If the inode is queued for inactivation, it should still be > > > > + * allocated. > > > > + * > > > > + * If the incore inode is undergoing inactivation, either it is before > > > > + * the point where it would get freed ondisk (in which case i_mode is > > > > + * still nonzero), or it has already been freed, in which case i_mode > > > > + * is zero. We don't take the ILOCK here, but difree and dialloc > > > > + * require the AGI, which we do hold. > > > > + * > > > > + * If the inode is anywhere in the reclaim mechanism, we know that it's > > > > + * still ok to query i_mode because we don't allow uncached inode > > > > + * updates. > > > > > > Is it? We explicitly consider XFS_IRECLAIM inodes as in the process > > > of being freed, so there is no guarantee that anything in them is > > > valid anymore. Indeed, there's a transient state in recycling an > > > inode where we set XFS_IRECLAIM, then re-initialise the inode (which > > > trashes i_mode) and then restore i_mode to it's correct value before > > > clearing XFS_IRECLAIM. > > > > > > Hence I think that if XFS_IRECLAIM is set, we can't make any safe > > > judgement of the state of i_mode here with just a rcu_read_lock() > > > being held. > > > > I wrote this much too long ago, back when reclaim actually could write > > inode clusters to disk. > > > > At this point the comment and the code are both wrong -- if the inode is > > in IRECLAIM, then it's either being recycled or reclaimed. Neither of > > those things modify the ondisk buffers anymore, so we actually could > > return ENODATA here because all three callers of this function use that > > as a signal to read i_mode from the icluster buffer. > > OK. > > > > > > > + * > > > > + * If the incore inode is live (i.e. referenced from the dcache), the > > > > + * ondisk inode had better be allocated. This is the most trivial > > > > + * case. > > > > + */ > > > > +#ifdef DEBUG > > > > + if (ip->i_flags & XFS_INEW) { > > > > + /* created on disk already or recycling */ > > > > + ASSERT(VFS_I(ip)->i_mode != 0); > > > > + } > > > > > > I don't think this is correct. In xfs_iget_cache_miss() when > > > allocating a new inode, we set XFS_INEW and we don't set i_mode > > > until we call xfs_init_new_inode() after xfs_iget() on the newly > > > allocated inode returns. Hence there is a long period where > > > XFS_INEW can be set and i_mode is zero and the i_flags_lock is not > > > held. > > > > I think you're referring to the situation where an icreate calls > > xfs_iget_cache_miss in the (v5 && IGET_CREATE && !ikeep) scenario, > > in which case we don't get around to setting i_mode until > > xfs_init_new_inode? > > Yes, that is one example I can think of. > > > The icreate transaction holds the AGI all the way to the end, so a > > different thread calling _is_allocated shouldn't find any inodes in this > > state as long as it holds the AGI buffer, right? > > Yes, I think that is correct. > > > > Remember, if this is a generic function (which by placing it in > > > fs/xfs/xfs_icache.c is essentially asserting that it is) then the > > > inode state is only being serialised by RCU. Hence the debug code > > > here cannot assume that it has been called with the AGI locked to > > > serialise it against create/free operations, nor that there aren't > > > other operations being performed on the inode as the lookup is done. > > > > How about I stuff it into fs/xfs/scrub/ instead? The only reason it's > > in xfs_icache.c is because I wanted to keep the rcu lock and radix tree > > lookup stuff in there... but yes I agree it's dangerous to let anyone > > else see this weird function. > > > > Plus then I can require the caller pass in a valid AGI buffer to prove > > they've serialized against icreate/ifree. :) > > That's fine by me. If we need it at a later date as generic > functionality, we can sort out the semantics then. :) Ok. Even better, the xfs_perag_grab/get thing becomes irrelevant, since scrub already got its own reference and we can just use it here in xchk_inode_is_allocated. :) --D > Cheers, > > Dave. > -- > Dave Chinner > david@xxxxxxxxxxxxx