Re: [PATCH v2 7/9] NFSD: Use rhashtable for managing nfs4_file objects

Chuck Lever III <chuck.lever@xxxxxxxxxx> · Wed, 12 Oct 2022 15:01:53 +0000

> On Oct 11, 2022, at 7:37 PM, NeilBrown <neilb@xxxxxxx> wrote:
> 
> On Tue, 11 Oct 2022, Chuck Lever III wrote:
>>> On Oct 10, 2022, at 8:16 PM, NeilBrown <neilb@xxxxxxx> wrote:
>>> 
>>> On Fri, 07 Oct 2022, Chuck Lever wrote:
>>>> 
>>>> -static unsigned int file_hashval(struct svc_fh *fh)
>>>> +/*
>>>> + * The returned hash value is based solely on the address of an in-code
>>>> + * inode, a pointer to a slab-allocated object. The entropy in such a
>>>> + * pointer is concentrated in its middle bits.
>>> 
>>> I think you need more justification than that for throwing away some of
>>> the entropy, even if you don't think it is much.
>> 
>> We might have that justification:
>> 
>> https://lore.kernel.org/linux-nfs/YrUFbLJ5uVbWtZbf@ZenIV/
>> 
>> Actually I believe we are not discarding /any/ entropy in
>> this function. The bits we discard are invariant.
>> 
> 
> Ok, I can accept that this:
> 
> +	k = ptr >> L1_CACHE_SHIFT;

> I searched for ">> *L1_CACHE_SHIFT".  Apart from the nfsd
> filecache you mentioned I find two.  One in quota and one in reiserfs.
> Both work with traditional hash tables which are more forgiving of
> longer chains.
> Do you have other evidence of this being a common trope?

This approach is based on the hash function in fs/inode.c,
which uses integer division instead of a shift.

 509 static unsigned long hash(struct super_block *sb, unsigned long hashval)
 510 {
 511         unsigned long tmp;
 512 
 513         tmp = (hashval * (unsigned long)sb) ^ (GOLDEN_RATIO_PRIME + hashval) /
 514                         L1_CACHE_BYTES;
 515         tmp = tmp ^ ((tmp ^ GOLDEN_RATIO_PRIME) >> i_hash_shift);
 516         return tmp & i_hash_mask;
 517 }

> only discards invariant bits, but how can you justify this:
> 
> +	k &= 0x00ffffff;
> 
> ??

After shifting an address, the top byte generally contains
invariant bits as well.

> And given that you pass it all to jhash anyway, why not just pass all of
> it?

I don't think this is a big deal, but these functions are
basically the same as what was recently merged without
complaint. It's not a high priority to revisit those.

It might be worth a clean-up to share this hash function
between the two hash tables... at that point we might
consider removing the extra mask.

>> And, note that this is exactly the same situation we just merged
>> in the filecache overhaul, and is a common trope amongst other
>> hash tables that base their function on the inode's address.
>> 
>> 
>>> Presumably you think hashing 32 bits is faster than hashing 64 bits.
>>> Maybe it is, but is it worth it?
>>> 
>>> rhashtable depends heavily on having a strong hash function.  In
>>> particular if any bucket ends up with more than 16 elements it will
>>> choose a new seed and rehash.  If you deliberately remove some bits that
>>> it might have been used to spread those 16 out, then you are asking for
>>> trouble.
>>> 
>>> We know that two different file handles can refer to the same inode
>>> ("rarely"), and you deliberately place them in the same hash bucket.
>>> So if an attacker arranges to access 17 files with the same inode but
>>> different file handles, then the hashtable will be continuously
>>> rehashed.
>>> 
>>> The preferred approach when you require things to share a hash chain is
>>> to use an rhl table.
>> 
>> Again, this is the same situation for the filecache. Do you
>> believe it is worth reworking that? I'm guessing "yes".
> 
> As a matter of principle: yes.
> rhashtable is designed to assume that hash collisions are bad and can be
> changed by choosing a different seed.
> rhl_tables was explicitly added for cases when people wanted multiple
> elements to hash to the same value.
> 
> The chance of it causing a problem without an attack are admittedly
> tiny.  Attacks are only possible with subtree_check enabled, or if the
> underlying filesystem does something "clever" with file handles, so
> there wouldn't be many situations where an attack would even be
> possible.  But if it were possible, it would likely be easy.
> The cost of the attack would be a minor-to-modest performance impact.
> 
> So it is hard to argue "this code is dangerous and must be changed", but
> we have different tools for a reason, and I believe that rhl-tables is
> the right tool for this job.

Agreed. I wasn't suggesting it's an emergency situation, but it's
something that should get fixed at some point if there's a problem
with it, even a minor one.

I think I stopped at the non-list variant of rhashtable because
using rhl was more complex, and the non-list variant seemed to
work fine. There's no architectural reason either file_hashtbl
or the filecache must use the non-list variant.

In any event, it's worth taking the trouble now to change the
nfs4_file implementation proposed here as you suggest.

--
Chuck Lever