Re: [PATCH 10/10] xfs: cache unlinked pointers in an rhashtable

Brian Foster <bfoster@xxxxxxxxxx> · Tue, 5 Feb 2019 09:24:59 -0500

On Mon, Feb 04, 2019 at 10:00:05AM -0800, Darrick J. Wong wrote:
> From: Darrick J. Wong <darrick.wong@xxxxxxxxxx>
> 
> Use a rhashtable to cache the unlinked list incore.  This should speed
> up unlinked processing considerably when there are a lot of inodes on
> the unlinked list because iunlink_remove no longer has to traverse an
> entire bucket list to find which inode points to the one being removed.
> 
> The incore list structure records "X.next_unlinked = Y" relations, with
> the rhashtable using Y to index the records.  This makes finding the
> inode X that points to a inode Y very quick.  If our cache fails to find
> anything we can always fall back on the old method.
> 
> FWIW this drastically reduces the amount of time it takes to remove
> inodes from the unlinked list.  I wrote a program to open a lot of
> O_TMPFILE files and then close them in the same order, which takes
> a very long time if we have to traverse the unlinked lists.  With the
> ptach, I see:
> 
...
> ---
>  fs/xfs/xfs_inode.c       |  207 ++++++++++++++++++++++++++++++++++++++++++++++
>  fs/xfs/xfs_inode.h       |    9 ++
>  fs/xfs/xfs_log_recover.c |   12 ++-
>  fs/xfs/xfs_mount.c       |    5 +
>  fs/xfs/xfs_mount.h       |    1 
>  5 files changed, 233 insertions(+), 1 deletion(-)
> 
> 
> diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
> index b9696d762c8f..baee8c894447 100644
> --- a/fs/xfs/xfs_inode.c
> +++ b/fs/xfs/xfs_inode.c
> @@ -1880,6 +1880,167 @@ xfs_inactive(
>  	xfs_qm_dqdetach(ip);
>  }
>  
...
> +
> +static const struct rhashtable_params xfs_iunlink_hash_params = {
> +	.min_size		= XFS_AGI_UNLINKED_BUCKETS,
> +	.nelem_hint		= 512,

Any reasoning behind the 512 value? It seems rather large to me, at
least until we get more into deferred inactivation and whatnot. It looks
like the rhashtable code will round this up to 1024 as well, FWIW.

I'm also wondering whether a kmem_zone might be worthwhile for
xfs_iunlink structures, but that's probably also more for when we expect
to drive deeper unlinked lists.

> +	.key_len		= sizeof(xfs_agino_t),
> +	.key_offset		= offsetof(struct xfs_iunlink, iu_next_unlinked),
> +	.head_offset		= offsetof(struct xfs_iunlink, iu_rhash_head),
> +	.automatic_shrinking	= true,
> +	.obj_cmpfn		= xfs_iunlink_obj_cmpfn,
> +};
> +
...
>  /*
>   * Point the AGI unlinked bucket at an inode and log the results.  The caller
>   * is responsible for validating the old value.
> @@ -2055,6 +2216,14 @@ xfs_iunlink(
>  		if (error)
>  			goto out;
>  		ASSERT(old_agino == NULLAGINO);
> +
> +		/*
> +		 * agino has been unlinked, add a backref from the next inode
> +		 * back to agino.
> +		 */
> +		error = xfs_iunlink_add_backref(pag, agino, next_agino);
> +		if (error)
> +			goto out;

At a glance, it looks like -ENOMEM/-EEXIST is possible from
rhashtable_insert_fast(). Do we really want to translate those into a
broader operational failure, or perhaps just skip the hashtable update?
The latter seems more appropriate given that we already account for the
possibility of a missing hashtable entry on lookup.

>  	}
>  
>  	/* Point the head of the list to point to this inode. */
> @@ -2127,6 +2296,17 @@ xfs_iunlink_map_prev(
>  
>  	ASSERT(head_agino != target_agino);
>  
> +	/* See if our backref cache can find it faster. */
> +	next_agino = xfs_iunlink_lookup_backref(pag, target_agino);
> +	if (next_agino != NULLAGINO) {
> +		next_ino = XFS_AGINO_TO_INO(mp, pag->pag_agno, next_agino);
> +		error = xfs_iunlink_map_ino(tp, next_ino, imap, &last_dip,
> +				&last_ibp);
> +		if (error)
> +			return error;

Could we assert or somehow or another verify that
last_dip->di_next_unlinked points at target_agino before we proceed?

> +		goto out;
> +	}
> +
>  	next_agino = head_agino;
>  	while (next_agino != target_agino) {
>  		xfs_agino_t	unlinked_agino;
> @@ -2156,6 +2336,7 @@ xfs_iunlink_map_prev(
>  		next_agino = unlinked_agino;
>  	}
>  
> +out:
>  	/* Should never happen, but don't return garbage. */
>  	if (next_ino == NULLFSINO)
>  		return -EFSCORRUPTED;
> @@ -2218,6 +2399,20 @@ xfs_iunlink_remove(
>  	if (error)
>  		goto out;
>  
> +	/*
> +	 * If there was a backref pointing from the next inode back to this
> +	 * one, remove it because we've removed this inode from the list.
> +	 *
> +	 * Later, if this inode was in the middle of the list we'll update
> +	 * this inode's backref to point from the next inode.
> +	 */
> +	if (next_agino != NULLAGINO) {
> +		error = xfs_iunlink_change_backref(pag, next_agino,
> +				NULLAGINO);
> +		if (error)
> +			goto out;
> +	}
> +
>  	if (head_agino == agino) {
>  		/* Point the head of the list to the next unlinked inode. */
>  		error = xfs_iunlink_update_bucket(tp, agno, agibp, bucket_index,
> @@ -2236,6 +2431,18 @@ xfs_iunlink_remove(
>  		/* Point the previous inode on the list to the next inode. */
>  		xfs_iunlink_update_dinode(tp, agno, last_ibp, last_dip, &imap,
>  				prev_ino, next_agino);
> +
> +		/*
> +		 * Now we deal with the backref for this inode.  If this inode
> +		 * pointed at a real inode, change the backref that pointed to
> +		 * us to point to our old next.  If this inode was the end of
> +		 * the list, delete the backref that pointed to us.  Note that
> +		 * change_backref takes care of deleting the backref if
> +		 * next_agino is NULLAGINO.
> +		 */

Thanks for the comment.

> +		error = xfs_iunlink_change_backref(pag, agino, next_agino);
> +		if (error)
> +			goto out;

Ok, but the whole lookup path accounts for the possibility of a missing
entry and falls back to the old on-disk walk. It doesn't look like we
handle that properly here. IOW, if the hashtable lookup ever fails and
we do fallback as such, we're guaranteed to eventually fail here.

It seems to me we need to be extra careful so as to not turn in-core
hashtable issues (whether it be external things such as -ENOENT or
internal -ENOMEM or whatever) into fatal filesystem errors.

>  	}
>  	pag->pagi_unlinked_count--;
>  out:
...
> diff --git a/fs/xfs/xfs_log_recover.c b/fs/xfs/xfs_log_recover.c
> index c634fbfea4a8..f5fb8885662f 100644
> --- a/fs/xfs/xfs_log_recover.c
> +++ b/fs/xfs/xfs_log_recover.c
> @@ -5078,10 +5078,20 @@ xlog_recover_process_one_iunlink(
>  	agino = be32_to_cpu(dip->di_next_unlinked);
>  	xfs_buf_relse(ibp);
>  
> -	/* Make sure the in-core data knows about this unlinked inode. */
> +	/*
> +	 * Make sure the in-core data knows about this unlinked inode.  Since
> +	 * our iunlinks recovery basically just deletes the head of a bucket
> +	 * list until the bucket is empty, we need only to add the backref from
> +	 * the current list item to the next one, if this isn't the list tail.
> +	 */
>  	pag = xfs_perag_get(mp, agno);
>  	pag->pagi_unlinked_count++;
> +	if (agino != NULLAGINO)
> +		error = xfs_iunlink_add_backref(pag, XFS_INO_TO_AGINO(mp, ino),
> +				agino);

ISTM that fixing the iunlink_remove code to not rely on hashtable
entries could also alleviate the need for this hack?

>  	xfs_perag_put(pag);
> +	if (error)
> +		goto fail_iput;

Similar potential for error amplification here.

Brian

>  
>  	/*
>  	 * Prevent any DMAPI event from being sent when the reference on
> diff --git a/fs/xfs/xfs_mount.c b/fs/xfs/xfs_mount.c
> index 6ca92a71c233..9a181f7ca1d5 100644
> --- a/fs/xfs/xfs_mount.c
> +++ b/fs/xfs/xfs_mount.c
> @@ -151,6 +151,7 @@ xfs_free_perag(
>  		ASSERT(atomic_read(&pag->pag_ref) == 0);
>  		ASSERT(pag->pagi_unlinked_count == 0 ||
>  		       XFS_FORCED_SHUTDOWN(mp));
> +		xfs_iunlink_destroy(pag);
>  		xfs_buf_hash_destroy(pag);
>  		mutex_destroy(&pag->pag_ici_reclaim_lock);
>  		call_rcu(&pag->rcu_head, __xfs_free_perag);
> @@ -229,6 +230,9 @@ xfs_initialize_perag(
>  		/* first new pag is fully initialized */
>  		if (first_initialised == NULLAGNUMBER)
>  			first_initialised = index;
> +		error = xfs_iunlink_init(pag);
> +		if (error)
> +			goto out_hash_destroy;
>  	}
>  
>  	index = xfs_set_inode_alloc(mp, agcount);
> @@ -251,6 +255,7 @@ xfs_initialize_perag(
>  		if (!pag)
>  			break;
>  		xfs_buf_hash_destroy(pag);
> +		xfs_iunlink_destroy(pag);
>  		mutex_destroy(&pag->pag_ici_reclaim_lock);
>  		kmem_free(pag);
>  	}
> diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
> index 169069a01f3c..dcc45b441dd6 100644
> --- a/fs/xfs/xfs_mount.h
> +++ b/fs/xfs/xfs_mount.h
> @@ -391,6 +391,7 @@ typedef struct xfs_perag {
>  
>  	/* unlinked inode info; lock AGI to access */
>  	unsigned int		pagi_unlinked_count;
> +	struct rhashtable	pagi_unlinked_hash;
>  } xfs_perag_t;
>  
>  static inline struct xfs_ag_resv *
>