> On Oct 25, 2022, at 1:18 PM, Jeff Layton <jlayton@xxxxxxxxxx> wrote: > > On Mon, 2022-10-24 at 22:30 +0000, Chuck Lever III wrote: >> >>> On Oct 24, 2022, at 6:02 PM, NeilBrown <neilb@xxxxxxx> wrote: >>> >>> On Tue, 25 Oct 2022, Chuck Lever III wrote: >>>> >>>>> On Oct 23, 2022, at 11:26 PM, NeilBrown <neilb@xxxxxxx> wrote: >>>>> >>>>> On Wed, 19 Oct 2022, Chuck Lever wrote: >>>>>> +/* >>>>>> + * The returned hash value is based solely on the address of an in-code >>>>>> + * inode, a pointer to a slab-allocated object. The entropy in such a >>>>>> + * pointer is concentrated in its middle bits. >>>>>> + */ >>>>>> +static u32 nfs4_file_inode_hash(const struct inode *inode, u32 seed) >>>>>> +{ >>>>>> + u32 k; >>>>>> + >>>>>> + k = ((unsigned long)inode) >> L1_CACHE_SHIFT; >>>>>> + return jhash2(&k, 1, seed); >>>>> >>>>> I still don't think this makes any sense at all. >>>>> >>>>> return jhash2(&inode, sizeof(inode)/4, seed); >>>>> >>>>> uses all of the entropy, which is best for rhashtables. >>>> >>>> I don't really disagree, but the L1_CACHE_SHIFT was added at >>>> Al's behest. OK, I'll replace it. >>> >>> I think that was in a different context though. >>> >>> If you are using a simple fixed array of bucket-chains for a hash >>> table, then shifting down L1_CACHE_SHIFT and masking off the number of >>> bits to match the array size is a perfectly sensible hash function. It >>> will occasionally produce longer chains, but no often. >>> >>> But rhashtables does something a bit different. It mixes a seed into >>> the key as part of producing the hash, and assumes that if an >>> unfortunate distribution of keys results is a substantially non-uniform >>> distribution into buckets, then choosing a new random seed will >>> re-distribute the keys into buckets and will probably produce a more >>> uniform distribution. >>> >>> The purpose of this seed is (as I understand it) primarily focused on >>> network-faced hash tables where an attacker might be able to choose keys >>> that all hash to the same value. The random seed will defeat the >>> attacker. >>> >>> When hashing inodes there is no opportunity for a deliberate attack, and >>> we would probably be happy to not use a seed and accept the small >>> possibility of occasional long chains. However the rhashtable code >>> doesn't offer us that option. It insists on rehashing if any chain >>> exceeds 16. So we need to provide a much entropy as we can, and mix the >>> seed in as effectively as we can, to ensure that rehashing really works. >>> >>>>>> +static int nfs4_file_obj_cmpfn(struct rhashtable_compare_arg *arg, >>>>>> + const void *ptr) >>>>>> +{ >>>>>> + const struct svc_fh *fhp = arg->key; >>>>>> + const struct nfs4_file *fi = ptr; >>>>>> + >>>>>> + return fh_match(&fi->fi_fhandle, &fhp->fh_handle) ? 0 : 1; >>>>>> +} >>>>> >>>>> This doesn't make sense. Maybe the subtleties of rhl-tables are a bit >>>>> obscure. >>>>> >>>>> An rhltable is like an rhashtable except that insertion can never fail. >>>>> Multiple entries with the same key can co-exist in a linked list. >>>>> Lookup does not return an entry but returns a linked list of entries. >>>>> >>>>> For the nfs4_file table this isn't really a good match (though I >>>>> previously thought it was). Insertion must be able to fail if an entry >>>>> with the same filehandle exists. >>>> >>>> I think I've arrived at the same conclusion (about nfs4_file insertion >>>> needing to fail for duplicate filehandles). That's more or less open-coded >>>> in my current revision of this work rather than relying on the behavior >>>> of rhltable_insert(). >>>> >>>> >>>>> One approach that might work would be to take the inode->i_lock after a >>>>> lookup fails then repeat the lookup and if it fails again, do the >>>>> insert. This keeps the locking fine-grained but means we don't have to >>>>> depend on insert failing. >>>>> >>>>> So the hash and cmp functions would only look at the inode pointer. >>>>> If lookup returns something, we would walk the list calling fh_match() >>>>> to see if we have the right entry - all under RCU and using >>>>> refcount_inc_not_zero() when we find the matching filehandle. >>>> >>>> That's basically what's going on now. But I tried removing obj_cmpfn() >>>> and the rhltable_init() failed outright, so it's still there like a >>>> vestigial organ. >>> >>> You need an obj_cmpfn if you have an obj_hashfn. If you don't have >>> obj_hashfn and just provide hashfn and key_len, then you don't need the >>> cmpfn. >>> >>> If you have a cmpfn, just compare the inode (the same thing that you >>> hash) - don't compare the file handle. >>> >>>> >>>> If you now believe rhltable is not a good fit, I can revert all of the >>>> rhltable changes and go back to rhashtable quite easily. >>> >>> I wasn't very clear, as I was still working out what I though. >>> >>> I think an rhashtable cannot work as you need two different keys, the >>> inode and the filehandle. >>> >>> I think rhltable can work but it needs help, particularly as it will >>> never fail an insertion. >>> The help we can provide is to take a lock, perform a lookup, then only >>> try to insert if the lookup failed (and we are still holding an >>> appropriate lock to stop another thread inserting). Thus any time we >>> try an insert, we don't want it to fail. >>> >>> The lock I suggest is ->i_lock in the inode. >> >> I will give that a try next. >> >> >>> The rhltable should only consider the inode as a key, and should provide >>> a linked list of nfs4_files with the same inode. >> >> The implementation I've arrived at is rhltable keyed on inode only. >> The bucket chains are searched with fh_match(). >> >> I've gotten rid of all of the special obj_hash/cmp functions as a >> result of this simplification. If I've set up the rhashtable >> parameters correctly, the rhashtable code should use jhash/jhash2 >> on the whole inode pointer by default. >> >> >>> We then code a linear search of this linked list (which is expected to >>> have one entry) to find if there is an entry with the required file >>> handle. >> >> I'm not sure about that expectation: multiple inode pointers could >> hash to the same bucket. Also, there are cases where multiple file >> handles refer to the same inode (the aliasing that a0ce48375a36 >> ("nfsd: track filehandle aliasing in nfs4_files") tries to deal >> with). >> >> I will post what I have right now. It's not passing all tests due >> to very recent changes (and probably because of lack of insert >> serialization). We can pick up from there; I know I've likely missed >> some review comments still. >> >> > > Thanks. I've started looking over this and some other changes, and found > at least a couple of problems with the existing code (not necessarily > due to your patches, fwiw): > > 1/ the nf_lru is the linkage to the LRU, but it's also used when we move > an object to a dispose list. I don't see anything that prevents you from > trying to add an entry back onto the LRU while it's still on a dispose > list (which could cause corruption). I think we need some clear rules > around how that field get used. I chased this when converting filecache to rhashtable. There don't seem to be any logic errors in the current code, and it seems to be a common trope in other consumers of list_lru, so I left it alone. Yeah, it makes me nervous too. > 2/ nfsd_file_close_inode_sync can end up putting an nf onto the dispose > list without ensuring that nf_ref has gone to zero. I think there are > some situations where the file can end up being freed out from under a > valid reference due to that. > > I'm working on some patches to clean this up along the lines of the > scheme Neil is advocating for. For now, I'm basing this on top of your > series. I'll try to refresh kernel.org:cel for-next later today. >>> An alternative is to not use a resizable table at all. We could stick >>> with the table we currently have and make small changes. >>> >>> 1/ change the compare function to test the inode pointer first and only >>> call fh_match when the inodes match. This should make it much >>> faster. >> >> I had the same thought, but this doesn't reduce the number of CPU >> cache misses I observed from long bucket chain walks. I'd prefer a >> solution that maintains small buckets. >> >> >>> 2/ Instead of using state_lock for locking, use a bit-spin-lock on the >>> lsb of the bucket pointers in the array to lock each bucket. This >>> will substantially reduce locking conflicts. >>> >>> I don't see a big performance difference between this approach and the >>> rhltable approach. It really depends which you would feel more >>> comfortable working with. Would you rather work with common code which >>> isn't quite a perfect fit, or with private code that you have complete >>> control over? >> >> I think rhltable still has some legs, will keep chipping away at that. >> >> -- >> Chuck Lever >> >> >> > > -- > Jeff Layton <jlayton@xxxxxxxxxx> -- Chuck Lever