Re: [PATCH 9/9] xfs: cache unlinked pointers in an rhashtable

"Darrick J. Wong" <darrick.wong@xxxxxxxxxx> · Thu, 7 Feb 2019 08:33:13 -0800

On Thu, Feb 07, 2019 at 09:31:15AM -0500, Brian Foster wrote:
> On Wed, Feb 06, 2019 at 08:55:36AM -0800, Darrick J. Wong wrote:
> > From: Darrick J. Wong <darrick.wong@xxxxxxxxxx>
> > 
> > Use a rhashtable to cache the unlinked list incore.  This should speed
> > up unlinked processing considerably when there are a lot of inodes on
> > the unlinked list because iunlink_remove no longer has to traverse an
> > entire bucket list to find which inode points to the one being removed.
> > 
> > The incore list structure records "X.next_unlinked = Y" relations, with
> > the rhashtable using Y to index the records.  This makes finding the
> > inode X that points to a inode Y very quick.  If our cache fails to find
> > anything we can always fall back on the old method.
> > 
> > FWIW this drastically reduces the amount of time it takes to remove
> > inodes from the unlinked list.  I wrote a program to open a lot of
> > O_TMPFILE files and then close them in the same order, which takes
> > a very long time if we have to traverse the unlinked lists.  With the
> > ptach, I see:
> > 
> ...
> > 
> > Signed-off-by: Darrick J. Wong <darrick.wong@xxxxxxxxxx>
> > ---
> 
> A few minor comments below. With those addressed, this otherwise looks
> pretty good to me:
> 
> Reviewed-by: Brian Foster <bfoster@xxxxxxxxxx>
> 
> >  fs/xfs/libxfs/xfs_errortag.h |    4 -
> >  fs/xfs/xfs_error.c           |    3 
> >  fs/xfs/xfs_inode.c           |  269 +++++++++++++++++++++++++++++++++++++++++-
> >  fs/xfs/xfs_inode.h           |    3 
> >  fs/xfs/xfs_mount.c           |    5 +
> >  fs/xfs/xfs_mount.h           |    7 +
> >  fs/xfs/xfs_trace.h           |    1 
> >  7 files changed, 285 insertions(+), 7 deletions(-)
> > 
> > 
> ...
> > diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
> > index 53a9c8c26d18..fdfcb3a9bac7 100644
> > --- a/fs/xfs/xfs_inode.c
> > +++ b/fs/xfs/xfs_inode.c
> > @@ -1906,6 +1906,194 @@ xfs_inactive(
> >  	xfs_qm_dqdetach(ip);
> >  }
> >  
> ...
> > +/*
> > + * Remember that @prev_agino.next_unlinked = @this_agino.  Fail loudly if there
> > + * already was an entry, but absorb any other runtime errors.
> > + */
> > +int
> > +xfs_iunlink_add_backref(
> > +	struct xfs_perag	*pag,
> > +	xfs_agino_t		prev_agino,
> > +	xfs_agino_t		this_agino)
> > +{
> > +	struct xfs_iunlink	*iu;
> > +	int			error;
> > +
> > +	if (XFS_TEST_ERROR(false, pag->pag_mount, XFS_ERRTAG_IUNLINK_FALLBACK))
> > +		return 0;
> > +
> > +	iu = kmem_zalloc(sizeof(*iu), KM_SLEEP | KM_NOFS);
> > +	iu->iu_agino = prev_agino;
> > +	iu->iu_next_unlinked = this_agino;
> > +
> > +	error = rhashtable_insert_fast(&pag->pagi_unlinked_hash,
> > +			&iu->iu_rhash_head, xfs_iunlink_hash_params);
> > +	if (error == 0 || error == -EEXIST)
> > +		return error;
> > +	return 0;
> 
> The return val looks funny at a glance: return -EEXIST or otherwise
> return 0 for success and all other errors (also the error == 0 looks
> superfluous). I see the comment above describes what this does, but it
> doesn't explain why. I'd move the comment to above the if check and
> explain why it's there (i.e., "Absorb all runtime errors besides -EEXIST
> because fallback blah blah. We care about -EEXIST because ...").

Will do.

> Also I assume we need to free the iu object on insert failure,
> regardless of whether we ultimately return the error.

Oops, yes, good catch!

> > +}
> > +
> > +/*
> > + * Replace X.next_unlinked = @agino with X.next_unlinked = @next_unlinked.
> > + * If @next_unlinked is NULLAGINO, we drop the backref and exit.  If there
> > + * wasn't any such entry then we don't bother.
> > + */
> > +int
> > +xfs_iunlink_change_backref(
> > +	struct xfs_perag	*pag,
> > +	xfs_agino_t		agino,
> > +	xfs_agino_t		next_unlinked)
> > +{
> > +	struct xfs_iunlink	*iu;
> > +	int			error;
> > +
> > +	/* Look up the old entry; if there wasn't one then exit. */
> > +	iu = rhashtable_lookup_fast(&pag->pagi_unlinked_hash, &agino,
> > +			xfs_iunlink_hash_params);
> > +	if (!iu)
> > +		return 0;
> > +
> > +	/*
> > +	 * Remove the entry.  This shouldn't ever return an error, but if we
> > +	 * couldn't remove the old entry we don't want to add it again to the
> > +	 * hash table, and if the entry disappeared on us then someone's
> > +	 * violated the locking rules and we need to fail loudly.
> > +	 */
> > +	error = rhashtable_remove_fast(&pag->pagi_unlinked_hash,
> > +			&iu->iu_rhash_head, xfs_iunlink_hash_params);
> > +	if (error)
> > +		return error;
> 
> What's interesting about remove failure is that if we otherwise found an
> iu for this inode, removing the inode from the unlinked list leaves a
> stale/incorrect iu around in the in-core table. So it's one thing for
> the in-core table to be a valid subset of the on-disk structures, but
> another thing entirely to have incoherence between the two. It might be
> worth pointing out that it's critical to return an error here so we
> don't actually remove the inode (whereas the re-add below is less
> strict).

Ok.

> > +
> > +	/* If there is no new next entry just free our item and return. */
> > +	if (next_unlinked == NULLAGINO) {
> > +		kmem_free(iu);
> > +		return 0;
> > +	}
> > +
> > +	/*
> > +	 * Update the hash table to the new entry, ignoring operational
> > +	 * errors.
> > +	 */
> > +	iu->iu_next_unlinked = next_unlinked;
> > +	error = rhashtable_insert_fast(&pag->pagi_unlinked_hash,
> > +			&iu->iu_rhash_head, xfs_iunlink_hash_params);
> > +	if (error == 0 || error == -EEXIST)
> > +		return error;
> > +	return 0;
> 
> Similar error == 0 thing and potential memory leak here.

Refactored into a common helper to capture all the logic and
documentation.

> > +}
> > +
> ...
> > @@ -2141,6 +2341,27 @@ xfs_iunlink_map_prev(
> >  
> >  	ASSERT(head_agino != target_agino);
> >  
> > +	/* See if our backref cache can find it faster. */
> > +	*agino = xfs_iunlink_lookup_backref(pag, target_agino);
> > +	if (*agino != NULLAGINO) {
> > +		error = xfs_iunlink_map_ino(tp, agno, *agino, imap, dipp, bpp);
> > +		if (error)
> > +			return error;
> > +
> > +		if (be32_to_cpu((*dipp)->di_next_unlinked) == target_agino)
> > +			return 0;
> > +
> > +		/*
> > +		 * If we get here the cache contents were corrupt, so drop the
> > +		 * buffer and fall back to walking the bucket list.
> > +		 */
> > +		xfs_trans_brelse(tp, *bpp);
> > +		*bpp = NULL;
> 
> I like the logic to ensure we don't screw around with an inode that
> doesn't actually point to our target, but I do wonder whether we should
> do a little more than silently ignore the fact we found an incoherent
> backref. Even just a WARN_ON[_ONCE]() or something here to help us
> detect this case during testing might be sufficient..?

Yeah, that's a good idea.  The cache can miss, but it shouldn't ever be
corrupt.

--D

> 
> Brian
> 
> > +	}
> > +
> > +	trace_xfs_iunlink_map_prev_fallback(mp, agno);
> > +
> > +	/* Otherwise, walk the entire bucket until we find it. */
> >  	next_agino = head_agino;
> >  	while (next_agino != target_agino) {
> >  		xfs_agino_t	unlinked_agino;
> > @@ -2186,6 +2407,7 @@ xfs_iunlink_remove(
> >  	struct xfs_buf		*agibp;
> >  	struct xfs_buf		*last_ibp = NULL;
> >  	struct xfs_dinode	*last_dip = NULL;
> > +	struct xfs_perag	*pag = NULL;
> >  	xfs_agnumber_t		agno = XFS_INO_TO_AGNO(mp, ip->i_ino);
> >  	xfs_agino_t		agino = XFS_INO_TO_AGINO(mp, ip->i_ino);
> >  	xfs_agino_t		next_agino;
> > @@ -2221,27 +2443,62 @@ xfs_iunlink_remove(
> >  	if (error)
> >  		return error;
> >  
> > +	/*
> > +	 * If there was a backref pointing from the next inode back to this
> > +	 * one, remove it because we've removed this inode from the list.
> > +	 *
> > +	 * Later, if this inode was in the middle of the list we'll update
> > +	 * this inode's backref to point from the next inode.
> > +	 */
> > +	if (next_agino != NULLAGINO) {
> > +		pag = xfs_perag_get(mp, agno);
> > +		error = xfs_iunlink_change_backref(pag, next_agino,
> > +				NULLAGINO);
> > +		if (error)
> > +			goto out;
> > +	}
> > +
> >  	if (head_agino == agino) {
> >  		/* Point the head of the list to the next unlinked inode. */
> >  		error = xfs_iunlink_update_bucket(tp, agno, agibp, bucket_index,
> >  				next_agino);
> >  		if (error)
> > -			return error;
> > +			goto out;
> >  	} else {
> >  		struct xfs_imap	imap;
> >  		xfs_agino_t	prev_agino;
> >  
> > +		if (!pag)
> > +			pag = xfs_perag_get(mp, agno);
> > +
> >  		/* We need to search the list for the inode being freed. */
> >  		error = xfs_iunlink_map_prev(tp, agno, head_agino, agino,
> > -				&prev_agino, &imap, &last_dip, &last_ibp);
> > +				&prev_agino, &imap, &last_dip, &last_ibp,
> > +				pag);
> >  		if (error)
> > -			return error;
> > +			goto out;
> >  
> >  		/* Point the previous inode on the list to the next inode. */
> >  		xfs_iunlink_update_dinode(tp, agno, prev_agino, last_ibp,
> >  				last_dip, &imap, next_agino);
> > +
> > +		/*
> > +		 * Now we deal with the backref for this inode.  If this inode
> > +		 * pointed at a real inode, change the backref that pointed to
> > +		 * us to point to our old next.  If this inode was the end of
> > +		 * the list, delete the backref that pointed to us.  Note that
> > +		 * change_backref takes care of deleting the backref if
> > +		 * next_agino is NULLAGINO.
> > +		 */
> > +		error = xfs_iunlink_change_backref(pag, agino, next_agino);
> > +		if (error)
> > +			goto out;
> >  	}
> > -	return 0;
> > +
> > +out:
> > +	if (pag)
> > +		xfs_perag_put(pag);
> > +	return error;
> >  }
> >  
> >  /*
> > diff --git a/fs/xfs/xfs_inode.h b/fs/xfs/xfs_inode.h
> > index be2014520155..e62074a5257c 100644
> > --- a/fs/xfs/xfs_inode.h
> > +++ b/fs/xfs/xfs_inode.h
> > @@ -500,4 +500,7 @@ extern struct kmem_zone	*xfs_inode_zone;
> >  
> >  bool xfs_inode_verify_forks(struct xfs_inode *ip);
> >  
> > +int xfs_iunlink_init(struct xfs_perag *pag);
> > +void xfs_iunlink_destroy(struct xfs_perag *pag);
> > +
> >  #endif	/* __XFS_INODE_H__ */
> > diff --git a/fs/xfs/xfs_mount.c b/fs/xfs/xfs_mount.c
> > index b4d8c318be3c..fd63b0b1307c 100644
> > --- a/fs/xfs/xfs_mount.c
> > +++ b/fs/xfs/xfs_mount.c
> > @@ -149,6 +149,7 @@ xfs_free_perag(
> >  		spin_unlock(&mp->m_perag_lock);
> >  		ASSERT(pag);
> >  		ASSERT(atomic_read(&pag->pag_ref) == 0);
> > +		xfs_iunlink_destroy(pag);
> >  		xfs_buf_hash_destroy(pag);
> >  		mutex_destroy(&pag->pag_ici_reclaim_lock);
> >  		call_rcu(&pag->rcu_head, __xfs_free_perag);
> > @@ -227,6 +228,9 @@ xfs_initialize_perag(
> >  		/* first new pag is fully initialized */
> >  		if (first_initialised == NULLAGNUMBER)
> >  			first_initialised = index;
> > +		error = xfs_iunlink_init(pag);
> > +		if (error)
> > +			goto out_hash_destroy;
> >  	}
> >  
> >  	index = xfs_set_inode_alloc(mp, agcount);
> > @@ -249,6 +253,7 @@ xfs_initialize_perag(
> >  		if (!pag)
> >  			break;
> >  		xfs_buf_hash_destroy(pag);
> > +		xfs_iunlink_destroy(pag);
> >  		mutex_destroy(&pag->pag_ici_reclaim_lock);
> >  		kmem_free(pag);
> >  	}
> > diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
> > index 7daafe064af8..a33f45077867 100644
> > --- a/fs/xfs/xfs_mount.h
> > +++ b/fs/xfs/xfs_mount.h
> > @@ -396,6 +396,13 @@ typedef struct xfs_perag {
> >  
> >  	/* reference count */
> >  	uint8_t			pagf_refcount_level;
> > +
> > +	/*
> > +	 * Unlinked inode information.  This incore information reflects
> > +	 * data stored in the AGI, so callers must hold the AGI buffer lock
> > +	 * or have some other means to control concurrency.
> > +	 */
> > +	struct rhashtable	pagi_unlinked_hash;
> >  } xfs_perag_t;
> >  
> >  static inline struct xfs_ag_resv *
> > diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
> > index a6e384a642b1..c83ce022a355 100644
> > --- a/fs/xfs/xfs_trace.h
> > +++ b/fs/xfs/xfs_trace.h
> > @@ -3447,6 +3447,7 @@ DEFINE_EVENT(xfs_ag_inode_class, name, \
> >  	TP_ARGS(ip))
> >  DEFINE_AGINODE_EVENT(xfs_iunlink);
> >  DEFINE_AGINODE_EVENT(xfs_iunlink_remove);
> > +DEFINE_AG_EVENT(xfs_iunlink_map_prev_fallback);
> >  
> >  #endif /* _TRACE_XFS_H */
> >  
> >