> On Oct 24, 2022, at 6:02 PM, NeilBrown <neilb@xxxxxxx> wrote: > > On Tue, 25 Oct 2022, Chuck Lever III wrote: >> >>> On Oct 23, 2022, at 11:26 PM, NeilBrown <neilb@xxxxxxx> wrote: >>> >>> On Wed, 19 Oct 2022, Chuck Lever wrote: >>>> +/* >>>> + * The returned hash value is based solely on the address of an in-code >>>> + * inode, a pointer to a slab-allocated object. The entropy in such a >>>> + * pointer is concentrated in its middle bits. >>>> + */ >>>> +static u32 nfs4_file_inode_hash(const struct inode *inode, u32 seed) >>>> +{ >>>> + u32 k; >>>> + >>>> + k = ((unsigned long)inode) >> L1_CACHE_SHIFT; >>>> + return jhash2(&k, 1, seed); >>> >>> I still don't think this makes any sense at all. >>> >>> return jhash2(&inode, sizeof(inode)/4, seed); >>> >>> uses all of the entropy, which is best for rhashtables. >> >> I don't really disagree, but the L1_CACHE_SHIFT was added at >> Al's behest. OK, I'll replace it. > > I think that was in a different context though. > > If you are using a simple fixed array of bucket-chains for a hash > table, then shifting down L1_CACHE_SHIFT and masking off the number of > bits to match the array size is a perfectly sensible hash function. It > will occasionally produce longer chains, but no often. > > But rhashtables does something a bit different. It mixes a seed into > the key as part of producing the hash, and assumes that if an > unfortunate distribution of keys results is a substantially non-uniform > distribution into buckets, then choosing a new random seed will > re-distribute the keys into buckets and will probably produce a more > uniform distribution. > > The purpose of this seed is (as I understand it) primarily focused on > network-faced hash tables where an attacker might be able to choose keys > that all hash to the same value. The random seed will defeat the > attacker. > > When hashing inodes there is no opportunity for a deliberate attack, and > we would probably be happy to not use a seed and accept the small > possibility of occasional long chains. However the rhashtable code > doesn't offer us that option. It insists on rehashing if any chain > exceeds 16. So we need to provide a much entropy as we can, and mix the > seed in as effectively as we can, to ensure that rehashing really works. > >>>> +static int nfs4_file_obj_cmpfn(struct rhashtable_compare_arg *arg, >>>> + const void *ptr) >>>> +{ >>>> + const struct svc_fh *fhp = arg->key; >>>> + const struct nfs4_file *fi = ptr; >>>> + >>>> + return fh_match(&fi->fi_fhandle, &fhp->fh_handle) ? 0 : 1; >>>> +} >>> >>> This doesn't make sense. Maybe the subtleties of rhl-tables are a bit >>> obscure. >>> >>> An rhltable is like an rhashtable except that insertion can never fail. >>> Multiple entries with the same key can co-exist in a linked list. >>> Lookup does not return an entry but returns a linked list of entries. >>> >>> For the nfs4_file table this isn't really a good match (though I >>> previously thought it was). Insertion must be able to fail if an entry >>> with the same filehandle exists. >> >> I think I've arrived at the same conclusion (about nfs4_file insertion >> needing to fail for duplicate filehandles). That's more or less open-coded >> in my current revision of this work rather than relying on the behavior >> of rhltable_insert(). >> >> >>> One approach that might work would be to take the inode->i_lock after a >>> lookup fails then repeat the lookup and if it fails again, do the >>> insert. This keeps the locking fine-grained but means we don't have to >>> depend on insert failing. >>> >>> So the hash and cmp functions would only look at the inode pointer. >>> If lookup returns something, we would walk the list calling fh_match() >>> to see if we have the right entry - all under RCU and using >>> refcount_inc_not_zero() when we find the matching filehandle. >> >> That's basically what's going on now. But I tried removing obj_cmpfn() >> and the rhltable_init() failed outright, so it's still there like a >> vestigial organ. > > You need an obj_cmpfn if you have an obj_hashfn. If you don't have > obj_hashfn and just provide hashfn and key_len, then you don't need the > cmpfn. > > If you have a cmpfn, just compare the inode (the same thing that you > hash) - don't compare the file handle. > >> >> If you now believe rhltable is not a good fit, I can revert all of the >> rhltable changes and go back to rhashtable quite easily. > > I wasn't very clear, as I was still working out what I though. > > I think an rhashtable cannot work as you need two different keys, the > inode and the filehandle. > > I think rhltable can work but it needs help, particularly as it will > never fail an insertion. > The help we can provide is to take a lock, perform a lookup, then only > try to insert if the lookup failed (and we are still holding an > appropriate lock to stop another thread inserting). Thus any time we > try an insert, we don't want it to fail. > > The lock I suggest is ->i_lock in the inode. I will give that a try next. > The rhltable should only consider the inode as a key, and should provide > a linked list of nfs4_files with the same inode. The implementation I've arrived at is rhltable keyed on inode only. The bucket chains are searched with fh_match(). I've gotten rid of all of the special obj_hash/cmp functions as a result of this simplification. If I've set up the rhashtable parameters correctly, the rhashtable code should use jhash/jhash2 on the whole inode pointer by default. > We then code a linear search of this linked list (which is expected to > have one entry) to find if there is an entry with the required file > handle. I'm not sure about that expectation: multiple inode pointers could hash to the same bucket. Also, there are cases where multiple file handles refer to the same inode (the aliasing that a0ce48375a36 ("nfsd: track filehandle aliasing in nfs4_files") tries to deal with). I will post what I have right now. It's not passing all tests due to very recent changes (and probably because of lack of insert serialization). We can pick up from there; I know I've likely missed some review comments still. > An alternative is to not use a resizable table at all. We could stick > with the table we currently have and make small changes. > > 1/ change the compare function to test the inode pointer first and only > call fh_match when the inodes match. This should make it much > faster. I had the same thought, but this doesn't reduce the number of CPU cache misses I observed from long bucket chain walks. I'd prefer a solution that maintains small buckets. > 2/ Instead of using state_lock for locking, use a bit-spin-lock on the > lsb of the bucket pointers in the array to lock each bucket. This > will substantially reduce locking conflicts. > > I don't see a big performance difference between this approach and the > rhltable approach. It really depends which you would feel more > comfortable working with. Would you rather work with common code which > isn't quite a perfect fit, or with private code that you have complete > control over? I think rhltable still has some legs, will keep chipping away at that. -- Chuck Lever