Re: [PATCH v4 5/7] NFSD: Use rhashtable for managing nfs4_file objects

Chuck Lever III <chuck.lever@xxxxxxxxxx> · Tue, 25 Oct 2022 17:23:37 +0000

> On Oct 25, 2022, at 1:18 PM, Jeff Layton <jlayton@xxxxxxxxxx> wrote:
> 
> On Mon, 2022-10-24 at 22:30 +0000, Chuck Lever III wrote:
>> 
>>> On Oct 24, 2022, at 6:02 PM, NeilBrown <neilb@xxxxxxx> wrote:
>>> 
>>> On Tue, 25 Oct 2022, Chuck Lever III wrote:
>>>> 
>>>>> On Oct 23, 2022, at 11:26 PM, NeilBrown <neilb@xxxxxxx> wrote:
>>>>> 
>>>>> On Wed, 19 Oct 2022, Chuck Lever wrote:
>>>>>> +/*
>>>>>> + * The returned hash value is based solely on the address of an in-code
>>>>>> + * inode, a pointer to a slab-allocated object. The entropy in such a
>>>>>> + * pointer is concentrated in its middle bits.
>>>>>> + */
>>>>>> +static u32 nfs4_file_inode_hash(const struct inode *inode, u32 seed)
>>>>>> +{
>>>>>> +	u32 k;
>>>>>> +
>>>>>> +	k = ((unsigned long)inode) >> L1_CACHE_SHIFT;
>>>>>> +	return jhash2(&k, 1, seed);
>>>>> 
>>>>> I still don't think this makes any sense at all.
>>>>> 
>>>>>      return jhash2(&inode, sizeof(inode)/4, seed);
>>>>> 
>>>>> uses all of the entropy, which is best for rhashtables.
>>>> 
>>>> I don't really disagree, but the L1_CACHE_SHIFT was added at
>>>> Al's behest. OK, I'll replace it.
>>> 
>>> I think that was in a different context though.
>>> 
>>> If you are using a simple fixed array of bucket-chains for a hash
>>> table, then shifting down L1_CACHE_SHIFT and masking off the number of
>>> bits to match the array size is a perfectly sensible hash function.  It
>>> will occasionally produce longer chains, but no often.
>>> 
>>> But rhashtables does something a bit different.  It mixes a seed into
>>> the key as part of producing the hash, and assumes that if an
>>> unfortunate distribution of keys results is a substantially non-uniform
>>> distribution into buckets, then choosing a new random seed will
>>> re-distribute the keys into buckets and will probably produce a more
>>> uniform distribution.
>>> 
>>> The purpose of this seed is (as I understand it) primarily focused on
>>> network-faced hash tables where an attacker might be able to choose keys
>>> that all hash to the same value.  The random seed will defeat the
>>> attacker.
>>> 
>>> When hashing inodes there is no opportunity for a deliberate attack, and
>>> we would probably be happy to not use a seed and accept the small
>>> possibility of occasional long chains.  However the rhashtable code
>>> doesn't offer us that option.  It insists on rehashing if any chain
>>> exceeds 16.  So we need to provide a much entropy as we can, and mix the
>>> seed in as effectively as we can, to ensure that rehashing really works.
>>> 
>>>>>> +static int nfs4_file_obj_cmpfn(struct rhashtable_compare_arg *arg,
>>>>>> +			       const void *ptr)
>>>>>> +{
>>>>>> +	const struct svc_fh *fhp = arg->key;
>>>>>> +	const struct nfs4_file *fi = ptr;
>>>>>> +
>>>>>> +	return fh_match(&fi->fi_fhandle, &fhp->fh_handle) ? 0 : 1;
>>>>>> +}
>>>>> 
>>>>> This doesn't make sense.  Maybe the subtleties of rhl-tables are a bit
>>>>> obscure.
>>>>> 
>>>>> An rhltable is like an rhashtable except that insertion can never fail. 
>>>>> Multiple entries with the same key can co-exist in a linked list.
>>>>> Lookup does not return an entry but returns a linked list of entries.
>>>>> 
>>>>> For the nfs4_file table this isn't really a good match (though I
>>>>> previously thought it was).  Insertion must be able to fail if an entry
>>>>> with the same filehandle exists.
>>>> 
>>>> I think I've arrived at the same conclusion (about nfs4_file insertion
>>>> needing to fail for duplicate filehandles). That's more or less open-coded
>>>> in my current revision of this work rather than relying on the behavior
>>>> of rhltable_insert().
>>>> 
>>>> 
>>>>> One approach that might work would be to take the inode->i_lock after a
>>>>> lookup fails then repeat the lookup and if it fails again, do the
>>>>> insert.  This keeps the locking fine-grained but means we don't have to
>>>>> depend on insert failing.
>>>>> 
>>>>> So the hash and cmp functions would only look at the inode pointer.
>>>>> If lookup returns something, we would walk the list calling fh_match()
>>>>> to see if we have the right entry - all under RCU and using
>>>>> refcount_inc_not_zero() when we find the matching filehandle.
>>>> 
>>>> That's basically what's going on now. But I tried removing obj_cmpfn()
>>>> and the rhltable_init() failed outright, so it's still there like a
>>>> vestigial organ.
>>> 
>>> You need an obj_cmpfn if you have an obj_hashfn.  If you don't have
>>> obj_hashfn and just provide hashfn and key_len, then you don't need the
>>> cmpfn.
>>> 
>>> If you have a cmpfn, just compare the inode (the same thing that you
>>> hash) - don't compare the file handle.
>>> 
>>>> 
>>>> If you now believe rhltable is not a good fit, I can revert all of the
>>>> rhltable changes and go back to rhashtable quite easily.
>>> 
>>> I wasn't very clear, as I was still working out what I though.
>>> 
>>> I think an rhashtable cannot work as you need two different keys, the
>>> inode and the filehandle.
>>> 
>>> I think rhltable can work but it needs help, particularly as it will
>>> never fail an insertion.
>>> The help we can provide is to take a lock, perform a lookup, then only
>>> try to insert if the lookup failed (and we are still holding an
>>> appropriate lock to stop another thread inserting).  Thus any time we
>>> try an insert, we don't want it to fail.
>>> 
>>> The lock I suggest is ->i_lock in the inode.
>> 
>> I will give that a try next.
>> 
>> 
>>> The rhltable should only consider the inode as a key, and should provide
>>> a linked list of nfs4_files with the same inode.
>> 
>> The implementation I've arrived at is rhltable keyed on inode only.
>> The bucket chains are searched with fh_match().
>> 
>> I've gotten rid of all of the special obj_hash/cmp functions as a
>> result of this simplification. If I've set up the rhashtable
>> parameters correctly, the rhashtable code should use jhash/jhash2
>> on the whole inode pointer by default.
>> 
>> 
>>> We then code a linear search of this linked list (which is expected to
>>> have one entry) to find if there is an entry with the required file
>>> handle.
>> 
>> I'm not sure about that expectation: multiple inode pointers could
>> hash to the same bucket. Also, there are cases where multiple file
>> handles refer to the same inode (the aliasing that a0ce48375a36
>> ("nfsd: track filehandle aliasing in nfs4_files") tries to deal
>> with).
>> 
>> I will post what I have right now. It's not passing all tests due
>> to very recent changes (and probably because of lack of insert
>> serialization). We can pick up from there; I know I've likely missed
>> some review comments still.
>> 
>> 
> 
> Thanks. I've started looking over this and some other changes, and found
> at least a couple of problems with the existing code (not necessarily
> due to your patches, fwiw):
> 
> 1/ the nf_lru is the linkage to the LRU, but it's also used when we move
> an object to a dispose list. I don't see anything that prevents you from
> trying to add an entry back onto the LRU while it's still on a dispose
> list (which could cause corruption). I think we need some clear rules
> around how that field get used.

I chased this when converting filecache to rhashtable. There don't seem
to be any logic errors in the current code, and it seems to be a common
trope in other consumers of list_lru, so I left it alone.

Yeah, it makes me nervous too.

> 2/ nfsd_file_close_inode_sync can end up putting an nf onto the dispose
> list without ensuring that nf_ref has gone to zero. I think there are
> some situations where the file can end up being freed out from under a
> valid reference due to that.
> 
> I'm working on some patches to clean this up along the lines of the
> scheme Neil is advocating for. For now, I'm basing this on top of your
> series.

I'll try to refresh kernel.org:cel for-next later today.

>>> An alternative is to not use a resizable table at all.  We could stick
>>> with the table we currently have and make small changes.
>>> 
>>> 1/ change the compare function to test the inode pointer first and only
>>>  call fh_match when the inodes match.  This should make it much
>>>  faster.
>> 
>> I had the same thought, but this doesn't reduce the number of CPU
>> cache misses I observed from long bucket chain walks. I'd prefer a
>> solution that maintains small buckets.
>> 
>> 
>>> 2/ Instead of using state_lock for locking, use a bit-spin-lock on the
>>>  lsb of the bucket pointers in the array to lock each bucket.  This
>>>  will substantially reduce locking conflicts.
>>> 
>>> I don't see a big performance difference between this approach and the
>>> rhltable approach.  It really depends which you would feel more
>>> comfortable working with.  Would you rather work with common code which
>>> isn't quite a perfect fit, or with private code that you have complete
>>> control over?
>> 
>> I think rhltable still has some legs, will keep chipping away at that.
>> 
>> --
>> Chuck Lever
>> 
>> 
>> 
> 
> -- 
> Jeff Layton <jlayton@xxxxxxxxxx>

--
Chuck Lever