Re: [PATCH v5 13/13] NFSD: Use rhashtable for managing nfs4_file objects

"NeilBrown" <neilb@xxxxxxx> · Tue, 25 Oct 2022 12:01:55 +1100

On Tue, 25 Oct 2022, Chuck Lever III wrote:
> 
> > On Oct 24, 2022, at 7:43 PM, NeilBrown <neilb@xxxxxxx> wrote:
> > 
> > On Tue, 25 Oct 2022, Chuck Lever wrote:
> >> fh_match() is costly, especially when filehandles are large (as is
> >> the case for NFSv4). It needs to be used sparingly when searching
> >> data structures. Unfortunately, with common workloads, I see
> >> multiple thousands of objects stored in file_hashtbl[], which always
> >> has only 256 buckets, which makes the hash chains quite lengthy.
> >> 
> >> Walking long hash chains with the state_lock held blocks other
> >> activity that needs that lock.
> >> 
> >> To help mitigate the cost of searching with fh_match(), replace the
> >> nfs4_file hash table with an rhashtable, which can dynamically
> >> resize its bucket array to minimize hash chain length. The ideal for
> >> this use case is one bucket per inode.
> >> 
> >> The result is an improvement in the latency of NFSv4 operations
> >> and the reduction of nfsd CPU utilization due to the cost of
> >> fh_match() and the CPU cache misses incurred while walking long
> >> hash chains in the nfs4_file hash table.
> >> 
> >> Signed-off-by: Chuck Lever <chuck.lever@xxxxxxxxxx>
> >> ---
> >> fs/nfsd/nfs4state.c |   64 +++++++++++++++++++++++++--------------------------
> >> fs/nfsd/state.h     |    4 ---
> >> 2 files changed, 31 insertions(+), 37 deletions(-)
> >> 
> >> diff --git a/fs/nfsd/nfs4state.c b/fs/nfsd/nfs4state.c
> >> index 681cb2daa843..5b90e5a6a04f 100644
> >> --- a/fs/nfsd/nfs4state.c
> >> +++ b/fs/nfsd/nfs4state.c
> >> @@ -591,11 +591,8 @@ static void nfsd4_free_file_rcu(struct rcu_head *rcu)
> >> void
> >> put_nfs4_file(struct nfs4_file *fi)
> >> {
> >> -	might_lock(&state_lock);
> >> -
> >> -	if (refcount_dec_and_lock(&fi->fi_ref, &state_lock)) {
> >> +	if (refcount_dec_and_test(&fi->fi_ref)) {
> >> 		remove_nfs4_file_locked(fi);
> >> -		spin_unlock(&state_lock);
> >> 		WARN_ON_ONCE(!list_empty(&fi->fi_clnt_odstate));
> >> 		WARN_ON_ONCE(!list_empty(&fi->fi_delegations));
> >> 		call_rcu(&fi->fi_rcu, nfsd4_free_file_rcu);
> >> @@ -709,20 +706,6 @@ static unsigned int ownerstr_hashval(struct xdr_netobj *ownername)
> >> 	return ret & OWNER_HASH_MASK;
> >> }
> >> 
> >> -/* hash table for nfs4_file */
> >> -#define FILE_HASH_BITS                   8
> >> -#define FILE_HASH_SIZE                  (1 << FILE_HASH_BITS)
> >> -
> >> -static unsigned int file_hashval(const struct svc_fh *fh)
> >> -{
> >> -	struct inode *inode = d_inode(fh->fh_dentry);
> >> -
> >> -	/* XXX: why not (here & in file cache) use inode? */
> >> -	return (unsigned int)hash_long(inode->i_ino, FILE_HASH_BITS);
> >> -}
> >> -
> >> -static struct hlist_head file_hashtbl[FILE_HASH_SIZE];
> >> -
> >> static struct rhltable nfs4_file_rhltable ____cacheline_aligned_in_smp;
> >> 
> >> static const struct rhashtable_params nfs4_file_rhash_params = {
> >> @@ -4683,12 +4666,13 @@ move_to_close_lru(struct nfs4_ol_stateid *s, struct net *net)
> >> static noinline_for_stack struct nfs4_file *
> >> find_nfs4_file(const struct svc_fh *fhp)
> >> {
> >> -	unsigned int hashval = file_hashval(fhp);
> >> +	struct rhlist_head *tmp, *list;
> >> 	struct nfs4_file *fi = NULL;
> >> 
> >> 	rcu_read_lock();
> >> -	hlist_for_each_entry_rcu(fi, &file_hashtbl[hashval], fi_hash,
> >> -				 lockdep_is_held(&state_lock)) {
> >> +	list = rhltable_lookup(&nfs4_file_rhltable, fhp,
> >> +			       nfs4_file_rhash_params);
> >> +	rhl_for_each_entry_rcu(fi, tmp, list, fi_rlist) {
> >> 		if (fh_match(&fi->fi_fhandle, &fhp->fh_handle)) {
> >> 			if (!refcount_inc_not_zero(&fi->fi_ref))
> >> 				fi = NULL;
> >> @@ -4708,33 +4692,45 @@ find_nfs4_file(const struct svc_fh *fhp)
> >> static noinline_for_stack struct nfs4_file *
> >> insert_nfs4_file(struct nfs4_file *new, const struct svc_fh *fhp)
> >> {
> >> -	unsigned int hashval = file_hashval(fhp);
> >> +	struct rhlist_head *tmp, *list;
> >> 	struct nfs4_file *ret = NULL;
> >> 	bool alias_found = false;
> >> 	struct nfs4_file *fi;
> >> +	int err;
> >> 
> >> -	spin_lock(&state_lock);
> >> -	hlist_for_each_entry_rcu(fi, &file_hashtbl[hashval], fi_hash,
> >> -				 lockdep_is_held(&state_lock)) {
> >> +	rcu_read_lock();
> >> +
> > 
> > As mentioned separately, we need some sort of lock around this loop and
> > the later insert.
> 
> Yep. I just ran out of time today.
> 
> 
> >> +	list = rhltable_lookup(&nfs4_file_rhltable, fhp,
> >> +			       nfs4_file_rhash_params);
> >> +	rhl_for_each_entry_rcu(fi, tmp, list, fi_rlist) {
> >> 		if (fh_match(&fi->fi_fhandle, &fhp->fh_handle)) {
> >> 			if (refcount_inc_not_zero(&fi->fi_ref))
> >> 				ret = fi;
> >> 		} else if (d_inode(fhp->fh_dentry) == fi->fi_inode)
> >> 			fi->fi_aliased = alias_found = true;
> > 
> > This d_inde)( == fi->fi_inode test is no longer relevant.  Everything in
> > the list must have the same inode.  Best remove it.
> 
> My understanding is that more than one inode can hash into one of
> these bucket lists. I don't see how rhltable_insert() can prevent
> that from happening, and that will certainly be true if we go back
> to using a fixed-size table.

With rhltable each bucket is a list of lists.  An rhlist_head contains
two pointers.  "rhead.next" points to the next entry in the bucket which
will have a different key (different inode in this case).  "next" points
to the next entry with the same key - different filehandle, same inode,
in this case.

The list you get back from rhltable_lookup() isn't the full bucket list.
It is just the list within the bucket of all elements with the same key.

NeilBrown

> 
> I will try to test this assumption before posting the next revision.
>