Re: [PATCH v4 5/7] NFSD: Use rhashtable for managing nfs4_file objects

Chuck Lever III <chuck.lever@xxxxxxxxxx> · Mon, 24 Oct 2022 17:26:19 +0000

> On Oct 23, 2022, at 11:26 PM, NeilBrown <neilb@xxxxxxx> wrote:
> 
> On Wed, 19 Oct 2022, Chuck Lever wrote:
>> fh_match() is costly, especially when filehandles are large (as is
>> the case for NFSv4). It needs to be used sparingly when searching
>> data structures. Unfortunately, with common workloads, I see
>> multiple thousands of objects stored in file_hashtbl[], which always
>> has only 256 buckets, which makes the hash chains quite lengthy.
>> 
>> Walking long hash chains with the state_lock held blocks other
>> activity that needs that lock.
> 
> Using a bit-spin-lock for each hash chain would be a simple fix if this
> was the only concern.  See for example fscache_hash_cookie().
> I'm not at all against using rhltables, but thought I would mention that
> there are other options.

rhashtable uses bit-spin-locks, and as I said in the other thread,
without the state_lock this all seems to work without crashing,
lockdep or KASAN splat, or test workload failures.

>> To help mitigate the cost of searching with fh_match(), replace the
>> nfs4_file hash table with an rhashtable, which can dynamically
>> resize its bucket array to minimize hash chain length. The ideal for
>> this use case is one bucket per inode.
>> 
>> The result is an improvement in the latency of NFSv4 operations
>> and the reduction of nfsd CPU utilization due to the cost of
>> fh_match() and the CPU cache misses incurred while walking long
>> hash chains in the nfs4_file hash table.
>> 
>> Note that hash removal is no longer protected by the state_lock.
>> This is because insert_nfs4_file() takes the RCU read lock then the
>> state_lock. rhltable_remove() takes the RCU read lock internally;
>> if remove_nfs4_file() took the state_lock before calling
>> rhltable_remove(), that would result in a lock inversion.
> 
> As mentioned separately, lock inversion is not relevant for
> rcu_read_lock().

Understood.

> Also, I cannot see any need for state_lock to be used for protecting
> additions to, or removals from, the rhash table.

Agreed.

>> Signed-off-by: Chuck Lever <chuck.lever@xxxxxxxxxx>
>> ---
>> fs/nfsd/nfs4state.c |  215 +++++++++++++++++++++++++++++++++++----------------
>> fs/nfsd/state.h     |    5 -
>> 2 files changed, 147 insertions(+), 73 deletions(-)
>> 
>> diff --git a/fs/nfsd/nfs4state.c b/fs/nfsd/nfs4state.c
>> index 2b850de288cf..a63334ad61f6 100644
>> --- a/fs/nfsd/nfs4state.c
>> +++ b/fs/nfsd/nfs4state.c
>> @@ -44,7 +44,9 @@
>> #include <linux/jhash.h>
>> #include <linux/string_helpers.h>
>> #include <linux/fsnotify.h>
>> +#include <linux/rhashtable.h>
>> #include <linux/nfs_ssc.h>
>> +
>> #include "xdr4.h"
>> #include "xdr4cb.h"
>> #include "vfs.h"
>> @@ -84,6 +86,7 @@ static bool check_for_locks(struct nfs4_file *fp, struct nfs4_lockowner *lowner)
>> static void nfs4_free_ol_stateid(struct nfs4_stid *stid);
>> void nfsd4_end_grace(struct nfsd_net *nn);
>> static void _free_cpntf_state_locked(struct nfsd_net *nn, struct nfs4_cpntf_state *cps);
>> +static void remove_nfs4_file(struct nfs4_file *fi);
>> 
>> /* Locking: */
>> 
>> @@ -577,11 +580,8 @@ static void nfsd4_free_file_rcu(struct rcu_head *rcu)
>> void
>> put_nfs4_file(struct nfs4_file *fi)
>> {
>> -	might_lock(&state_lock);
>> -
>> -	if (refcount_dec_and_lock(&fi->fi_ref, &state_lock)) {
>> -		hlist_del_rcu(&fi->fi_hash);
>> -		spin_unlock(&state_lock);
>> +	if (refcount_dec_and_test(&fi->fi_ref)) {
>> +		remove_nfs4_file(fi);
>> 		WARN_ON_ONCE(!list_empty(&fi->fi_clnt_odstate));
>> 		WARN_ON_ONCE(!list_empty(&fi->fi_delegations));
>> 		call_rcu(&fi->fi_rcu, nfsd4_free_file_rcu);
>> @@ -695,19 +695,82 @@ static unsigned int ownerstr_hashval(struct xdr_netobj *ownername)
>> 	return ret & OWNER_HASH_MASK;
>> }
>> 
>> -/* hash table for nfs4_file */
>> -#define FILE_HASH_BITS                   8
>> -#define FILE_HASH_SIZE                  (1 << FILE_HASH_BITS)
>> +static struct rhltable nfs4_file_rhltable ____cacheline_aligned_in_smp;
>> +
>> +/*
>> + * The returned hash value is based solely on the address of an in-code
>> + * inode, a pointer to a slab-allocated object. The entropy in such a
>> + * pointer is concentrated in its middle bits.
>> + */
>> +static u32 nfs4_file_inode_hash(const struct inode *inode, u32 seed)
>> +{
>> +	u32 k;
>> +
>> +	k = ((unsigned long)inode) >> L1_CACHE_SHIFT;
>> +	return jhash2(&k, 1, seed);
> 
> I still don't think this makes any sense at all.
> 
>        return jhash2(&inode, sizeof(inode)/4, seed);
> 
> uses all of the entropy, which is best for rhashtables.

I don't really disagree, but the L1_CACHE_SHIFT was added at
Al's behest. OK, I'll replace it.

>> +}
>> 
>> -static unsigned int file_hashval(struct svc_fh *fh)
>> +/**
>> + * nfs4_file_key_hashfn - Compute the hash value of a lookup key
>> + * @data: key on which to compute the hash value
>> + * @len: rhash table's key_len parameter (unused)
>> + * @seed: rhash table's random seed of the day
>> + *
>> + * Return value:
>> + *   Computed 32-bit hash value
>> + */
>> +static u32 nfs4_file_key_hashfn(const void *data, u32 len, u32 seed)
>> {
>> -	struct inode *inode = d_inode(fh->fh_dentry);
>> +	const struct svc_fh *fhp = data;
>> 
>> -	/* XXX: why not (here & in file cache) use inode? */
>> -	return (unsigned int)hash_long(inode->i_ino, FILE_HASH_BITS);
>> +	return nfs4_file_inode_hash(d_inode(fhp->fh_dentry), seed);
>> }
>> 
>> -static struct hlist_head file_hashtbl[FILE_HASH_SIZE];
>> +/**
>> + * nfs4_file_obj_hashfn - Compute the hash value of an nfs4_file object
>> + * @data: object on which to compute the hash value
>> + * @len: rhash table's key_len parameter (unused)
>> + * @seed: rhash table's random seed of the day
>> + *
>> + * Return value:
>> + *   Computed 32-bit hash value
>> + */
>> +static u32 nfs4_file_obj_hashfn(const void *data, u32 len, u32 seed)
>> +{
>> +	const struct nfs4_file *fi = data;
>> +
>> +	return nfs4_file_inode_hash(fi->fi_inode, seed);
>> +}
>> +
>> +/**
>> + * nfs4_file_obj_cmpfn - Match a cache item against search criteria
>> + * @arg: search criteria
>> + * @ptr: cache item to check
>> + *
>> + * Return values:
>> + *   %0 - Success; item matches search criteria
>> + */
>> +static int nfs4_file_obj_cmpfn(struct rhashtable_compare_arg *arg,
>> +			       const void *ptr)
>> +{
>> +	const struct svc_fh *fhp = arg->key;
>> +	const struct nfs4_file *fi = ptr;
>> +
>> +	return fh_match(&fi->fi_fhandle, &fhp->fh_handle) ? 0 : 1;
>> +}
> 
> This doesn't make sense.  Maybe the subtleties of rhl-tables are a bit
> obscure.
> 
> An rhltable is like an rhashtable except that insertion can never fail. 
> Multiple entries with the same key can co-exist in a linked list.
> Lookup does not return an entry but returns a linked list of entries.
> 
> For the nfs4_file table this isn't really a good match (though I
> previously thought it was).  Insertion must be able to fail if an entry
> with the same filehandle exists.

I think I've arrived at the same conclusion (about nfs4_file insertion
needing to fail for duplicate filehandles). That's more or less open-coded
in my current revision of this work rather than relying on the behavior
of rhltable_insert().

> One approach that might work would be to take the inode->i_lock after a
> lookup fails then repeat the lookup and if it fails again, do the
> insert.  This keeps the locking fine-grained but means we don't have to
> depend on insert failing.
> 
> So the hash and cmp functions would only look at the inode pointer.
> If lookup returns something, we would walk the list calling fh_match()
> to see if we have the right entry - all under RCU and using
> refcount_inc_not_zero() when we find the matching filehandle.

That's basically what's going on now. But I tried removing obj_cmpfn()
and the rhltable_init() failed outright, so it's still there like a
vestigial organ.

If you now believe rhltable is not a good fit, I can revert all of the
rhltable changes and go back to rhashtable quite easily.

Let me do that and post what I have.

--
Chuck Lever