Re: [PATCH v4 5/7] NFSD: Use rhashtable for managing nfs4_file objects

Chuck Lever III <chuck.lever@xxxxxxxxxx> · Mon, 24 Oct 2022 22:30:26 +0000

> On Oct 24, 2022, at 6:02 PM, NeilBrown <neilb@xxxxxxx> wrote:
> 
> On Tue, 25 Oct 2022, Chuck Lever III wrote:
>> 
>>> On Oct 23, 2022, at 11:26 PM, NeilBrown <neilb@xxxxxxx> wrote:
>>> 
>>> On Wed, 19 Oct 2022, Chuck Lever wrote:
>>>> +/*
>>>> + * The returned hash value is based solely on the address of an in-code
>>>> + * inode, a pointer to a slab-allocated object. The entropy in such a
>>>> + * pointer is concentrated in its middle bits.
>>>> + */
>>>> +static u32 nfs4_file_inode_hash(const struct inode *inode, u32 seed)
>>>> +{
>>>> +	u32 k;
>>>> +
>>>> +	k = ((unsigned long)inode) >> L1_CACHE_SHIFT;
>>>> +	return jhash2(&k, 1, seed);
>>> 
>>> I still don't think this makes any sense at all.
>>> 
>>>       return jhash2(&inode, sizeof(inode)/4, seed);
>>> 
>>> uses all of the entropy, which is best for rhashtables.
>> 
>> I don't really disagree, but the L1_CACHE_SHIFT was added at
>> Al's behest. OK, I'll replace it.
> 
> I think that was in a different context though.
> 
> If you are using a simple fixed array of bucket-chains for a hash
> table, then shifting down L1_CACHE_SHIFT and masking off the number of
> bits to match the array size is a perfectly sensible hash function.  It
> will occasionally produce longer chains, but no often.
> 
> But rhashtables does something a bit different.  It mixes a seed into
> the key as part of producing the hash, and assumes that if an
> unfortunate distribution of keys results is a substantially non-uniform
> distribution into buckets, then choosing a new random seed will
> re-distribute the keys into buckets and will probably produce a more
> uniform distribution.
> 
> The purpose of this seed is (as I understand it) primarily focused on
> network-faced hash tables where an attacker might be able to choose keys
> that all hash to the same value.  The random seed will defeat the
> attacker.
> 
> When hashing inodes there is no opportunity for a deliberate attack, and
> we would probably be happy to not use a seed and accept the small
> possibility of occasional long chains.  However the rhashtable code
> doesn't offer us that option.  It insists on rehashing if any chain
> exceeds 16.  So we need to provide a much entropy as we can, and mix the
> seed in as effectively as we can, to ensure that rehashing really works.
> 
>>>> +static int nfs4_file_obj_cmpfn(struct rhashtable_compare_arg *arg,
>>>> +			       const void *ptr)
>>>> +{
>>>> +	const struct svc_fh *fhp = arg->key;
>>>> +	const struct nfs4_file *fi = ptr;
>>>> +
>>>> +	return fh_match(&fi->fi_fhandle, &fhp->fh_handle) ? 0 : 1;
>>>> +}
>>> 
>>> This doesn't make sense.  Maybe the subtleties of rhl-tables are a bit
>>> obscure.
>>> 
>>> An rhltable is like an rhashtable except that insertion can never fail. 
>>> Multiple entries with the same key can co-exist in a linked list.
>>> Lookup does not return an entry but returns a linked list of entries.
>>> 
>>> For the nfs4_file table this isn't really a good match (though I
>>> previously thought it was).  Insertion must be able to fail if an entry
>>> with the same filehandle exists.
>> 
>> I think I've arrived at the same conclusion (about nfs4_file insertion
>> needing to fail for duplicate filehandles). That's more or less open-coded
>> in my current revision of this work rather than relying on the behavior
>> of rhltable_insert().
>> 
>> 
>>> One approach that might work would be to take the inode->i_lock after a
>>> lookup fails then repeat the lookup and if it fails again, do the
>>> insert.  This keeps the locking fine-grained but means we don't have to
>>> depend on insert failing.
>>> 
>>> So the hash and cmp functions would only look at the inode pointer.
>>> If lookup returns something, we would walk the list calling fh_match()
>>> to see if we have the right entry - all under RCU and using
>>> refcount_inc_not_zero() when we find the matching filehandle.
>> 
>> That's basically what's going on now. But I tried removing obj_cmpfn()
>> and the rhltable_init() failed outright, so it's still there like a
>> vestigial organ.
> 
> You need an obj_cmpfn if you have an obj_hashfn.  If you don't have
> obj_hashfn and just provide hashfn and key_len, then you don't need the
> cmpfn.
> 
> If you have a cmpfn, just compare the inode (the same thing that you
> hash) - don't compare the file handle.
> 
>> 
>> If you now believe rhltable is not a good fit, I can revert all of the
>> rhltable changes and go back to rhashtable quite easily.
> 
> I wasn't very clear, as I was still working out what I though.
> 
> I think an rhashtable cannot work as you need two different keys, the
> inode and the filehandle.
> 
> I think rhltable can work but it needs help, particularly as it will
> never fail an insertion.
> The help we can provide is to take a lock, perform a lookup, then only
> try to insert if the lookup failed (and we are still holding an
> appropriate lock to stop another thread inserting).  Thus any time we
> try an insert, we don't want it to fail.
> 
> The lock I suggest is ->i_lock in the inode.

I will give that a try next.

> The rhltable should only consider the inode as a key, and should provide
> a linked list of nfs4_files with the same inode.

The implementation I've arrived at is rhltable keyed on inode only.
The bucket chains are searched with fh_match().

I've gotten rid of all of the special obj_hash/cmp functions as a
result of this simplification. If I've set up the rhashtable
parameters correctly, the rhashtable code should use jhash/jhash2
on the whole inode pointer by default.

> We then code a linear search of this linked list (which is expected to
> have one entry) to find if there is an entry with the required file
> handle.

I'm not sure about that expectation: multiple inode pointers could
hash to the same bucket. Also, there are cases where multiple file
handles refer to the same inode (the aliasing that a0ce48375a36
("nfsd: track filehandle aliasing in nfs4_files") tries to deal
with).

I will post what I have right now. It's not passing all tests due
to very recent changes (and probably because of lack of insert
serialization). We can pick up from there; I know I've likely missed
some review comments still.

> An alternative is to not use a resizable table at all.  We could stick
> with the table we currently have and make small changes.
> 
> 1/ change the compare function to test the inode pointer first and only
>   call fh_match when the inodes match.  This should make it much
>   faster.

I had the same thought, but this doesn't reduce the number of CPU
cache misses I observed from long bucket chain walks. I'd prefer a
solution that maintains small buckets.

> 2/ Instead of using state_lock for locking, use a bit-spin-lock on the
>   lsb of the bucket pointers in the array to lock each bucket.  This
>   will substantially reduce locking conflicts.
> 
> I don't see a big performance difference between this approach and the
> rhltable approach.  It really depends which you would feel more
> comfortable working with.  Would you rather work with common code which
> isn't quite a perfect fit, or with private code that you have complete
> control over?

I think rhltable still has some legs, will keep chipping away at that.

--
Chuck Lever