> On Oct 23, 2022, at 11:26 PM, NeilBrown <neilb@xxxxxxx> wrote: > > On Wed, 19 Oct 2022, Chuck Lever wrote: >> fh_match() is costly, especially when filehandles are large (as is >> the case for NFSv4). It needs to be used sparingly when searching >> data structures. Unfortunately, with common workloads, I see >> multiple thousands of objects stored in file_hashtbl[], which always >> has only 256 buckets, which makes the hash chains quite lengthy. >> >> Walking long hash chains with the state_lock held blocks other >> activity that needs that lock. > > Using a bit-spin-lock for each hash chain would be a simple fix if this > was the only concern. See for example fscache_hash_cookie(). > I'm not at all against using rhltables, but thought I would mention that > there are other options. rhashtable uses bit-spin-locks, and as I said in the other thread, without the state_lock this all seems to work without crashing, lockdep or KASAN splat, or test workload failures. >> To help mitigate the cost of searching with fh_match(), replace the >> nfs4_file hash table with an rhashtable, which can dynamically >> resize its bucket array to minimize hash chain length. The ideal for >> this use case is one bucket per inode. >> >> The result is an improvement in the latency of NFSv4 operations >> and the reduction of nfsd CPU utilization due to the cost of >> fh_match() and the CPU cache misses incurred while walking long >> hash chains in the nfs4_file hash table. >> >> Note that hash removal is no longer protected by the state_lock. >> This is because insert_nfs4_file() takes the RCU read lock then the >> state_lock. rhltable_remove() takes the RCU read lock internally; >> if remove_nfs4_file() took the state_lock before calling >> rhltable_remove(), that would result in a lock inversion. > > As mentioned separately, lock inversion is not relevant for > rcu_read_lock(). Understood. > Also, I cannot see any need for state_lock to be used for protecting > additions to, or removals from, the rhash table. Agreed. >> Signed-off-by: Chuck Lever <chuck.lever@xxxxxxxxxx> >> --- >> fs/nfsd/nfs4state.c | 215 +++++++++++++++++++++++++++++++++++---------------- >> fs/nfsd/state.h | 5 - >> 2 files changed, 147 insertions(+), 73 deletions(-) >> >> diff --git a/fs/nfsd/nfs4state.c b/fs/nfsd/nfs4state.c >> index 2b850de288cf..a63334ad61f6 100644 >> --- a/fs/nfsd/nfs4state.c >> +++ b/fs/nfsd/nfs4state.c >> @@ -44,7 +44,9 @@ >> #include <linux/jhash.h> >> #include <linux/string_helpers.h> >> #include <linux/fsnotify.h> >> +#include <linux/rhashtable.h> >> #include <linux/nfs_ssc.h> >> + >> #include "xdr4.h" >> #include "xdr4cb.h" >> #include "vfs.h" >> @@ -84,6 +86,7 @@ static bool check_for_locks(struct nfs4_file *fp, struct nfs4_lockowner *lowner) >> static void nfs4_free_ol_stateid(struct nfs4_stid *stid); >> void nfsd4_end_grace(struct nfsd_net *nn); >> static void _free_cpntf_state_locked(struct nfsd_net *nn, struct nfs4_cpntf_state *cps); >> +static void remove_nfs4_file(struct nfs4_file *fi); >> >> /* Locking: */ >> >> @@ -577,11 +580,8 @@ static void nfsd4_free_file_rcu(struct rcu_head *rcu) >> void >> put_nfs4_file(struct nfs4_file *fi) >> { >> - might_lock(&state_lock); >> - >> - if (refcount_dec_and_lock(&fi->fi_ref, &state_lock)) { >> - hlist_del_rcu(&fi->fi_hash); >> - spin_unlock(&state_lock); >> + if (refcount_dec_and_test(&fi->fi_ref)) { >> + remove_nfs4_file(fi); >> WARN_ON_ONCE(!list_empty(&fi->fi_clnt_odstate)); >> WARN_ON_ONCE(!list_empty(&fi->fi_delegations)); >> call_rcu(&fi->fi_rcu, nfsd4_free_file_rcu); >> @@ -695,19 +695,82 @@ static unsigned int ownerstr_hashval(struct xdr_netobj *ownername) >> return ret & OWNER_HASH_MASK; >> } >> >> -/* hash table for nfs4_file */ >> -#define FILE_HASH_BITS 8 >> -#define FILE_HASH_SIZE (1 << FILE_HASH_BITS) >> +static struct rhltable nfs4_file_rhltable ____cacheline_aligned_in_smp; >> + >> +/* >> + * The returned hash value is based solely on the address of an in-code >> + * inode, a pointer to a slab-allocated object. The entropy in such a >> + * pointer is concentrated in its middle bits. >> + */ >> +static u32 nfs4_file_inode_hash(const struct inode *inode, u32 seed) >> +{ >> + u32 k; >> + >> + k = ((unsigned long)inode) >> L1_CACHE_SHIFT; >> + return jhash2(&k, 1, seed); > > I still don't think this makes any sense at all. > > return jhash2(&inode, sizeof(inode)/4, seed); > > uses all of the entropy, which is best for rhashtables. I don't really disagree, but the L1_CACHE_SHIFT was added at Al's behest. OK, I'll replace it. >> +} >> >> -static unsigned int file_hashval(struct svc_fh *fh) >> +/** >> + * nfs4_file_key_hashfn - Compute the hash value of a lookup key >> + * @data: key on which to compute the hash value >> + * @len: rhash table's key_len parameter (unused) >> + * @seed: rhash table's random seed of the day >> + * >> + * Return value: >> + * Computed 32-bit hash value >> + */ >> +static u32 nfs4_file_key_hashfn(const void *data, u32 len, u32 seed) >> { >> - struct inode *inode = d_inode(fh->fh_dentry); >> + const struct svc_fh *fhp = data; >> >> - /* XXX: why not (here & in file cache) use inode? */ >> - return (unsigned int)hash_long(inode->i_ino, FILE_HASH_BITS); >> + return nfs4_file_inode_hash(d_inode(fhp->fh_dentry), seed); >> } >> >> -static struct hlist_head file_hashtbl[FILE_HASH_SIZE]; >> +/** >> + * nfs4_file_obj_hashfn - Compute the hash value of an nfs4_file object >> + * @data: object on which to compute the hash value >> + * @len: rhash table's key_len parameter (unused) >> + * @seed: rhash table's random seed of the day >> + * >> + * Return value: >> + * Computed 32-bit hash value >> + */ >> +static u32 nfs4_file_obj_hashfn(const void *data, u32 len, u32 seed) >> +{ >> + const struct nfs4_file *fi = data; >> + >> + return nfs4_file_inode_hash(fi->fi_inode, seed); >> +} >> + >> +/** >> + * nfs4_file_obj_cmpfn - Match a cache item against search criteria >> + * @arg: search criteria >> + * @ptr: cache item to check >> + * >> + * Return values: >> + * %0 - Success; item matches search criteria >> + */ >> +static int nfs4_file_obj_cmpfn(struct rhashtable_compare_arg *arg, >> + const void *ptr) >> +{ >> + const struct svc_fh *fhp = arg->key; >> + const struct nfs4_file *fi = ptr; >> + >> + return fh_match(&fi->fi_fhandle, &fhp->fh_handle) ? 0 : 1; >> +} > > This doesn't make sense. Maybe the subtleties of rhl-tables are a bit > obscure. > > An rhltable is like an rhashtable except that insertion can never fail. > Multiple entries with the same key can co-exist in a linked list. > Lookup does not return an entry but returns a linked list of entries. > > For the nfs4_file table this isn't really a good match (though I > previously thought it was). Insertion must be able to fail if an entry > with the same filehandle exists. I think I've arrived at the same conclusion (about nfs4_file insertion needing to fail for duplicate filehandles). That's more or less open-coded in my current revision of this work rather than relying on the behavior of rhltable_insert(). > One approach that might work would be to take the inode->i_lock after a > lookup fails then repeat the lookup and if it fails again, do the > insert. This keeps the locking fine-grained but means we don't have to > depend on insert failing. > > So the hash and cmp functions would only look at the inode pointer. > If lookup returns something, we would walk the list calling fh_match() > to see if we have the right entry - all under RCU and using > refcount_inc_not_zero() when we find the matching filehandle. That's basically what's going on now. But I tried removing obj_cmpfn() and the rhltable_init() failed outright, so it's still there like a vestigial organ. If you now believe rhltable is not a good fit, I can revert all of the rhltable changes and go back to rhashtable quite easily. Let me do that and post what I have. -- Chuck Lever