Re: [PATCH 02/19] VFS: use global wait-queue table for d_alloc_parallel()

"NeilBrown" <neilb@xxxxxxx> · Wed, 12 Feb 2025 10:35:41 +1100

On Mon, 10 Feb 2025, Al Viro wrote:
> On Mon, Feb 10, 2025 at 03:58:02PM +1100, NeilBrown wrote:
> > On Sat, 08 Feb 2025, Al Viro wrote:
> > > 1) what's wrong with using middle bits of dentry as index?  What the hell
> > > is that thing about pid for?
> > 
> > That does "hell" have to do with it?
> > 
> > All we need here is a random number.  Preferably a cheap random number.
> > pid is cheap and quite random.
> > The dentry pointer would be even cheaper (no mem access) providing it
> > doesn't cost much to get the randomness out.  I considered hash_ptr()
> > but thought that was more code that it was worth.
> > 
> > Do you have a formula for selecting the "middle" bits in a way that is
> > expected to still give good randomness?
> 
> ((unsigned long) dentry / L1_CACHE_BYTES) % <table size>
> 
> Bits just over the cacheline size should have uniform distribution...

I tested this, doing the calculation on each allocation and counting the
number of times each bucket was hit.
On my test kernel with lockdep enabled the dentry is 328 bytes and
L1_CACHE_BYTES is 64.  So 6 cache lines per dentry and 10 dentries per
4K slab.  The indices created by the above formula were roughly 1 in 6
of available.
The 256 possibilities can be divided into 4 groups of 64 and within each
group there are 10 possible values.: 0 6 12 18 24 30 36 42 48 54

Without lockdep making the dentry extra large, struct dentry is 192
bytes, exactly 3 cache lines.  There are 16 entries per 4K slab.
Now exactly 1/4 of possible indices are used.
For every group of 16 possible indices, only 0, 4, 8, 12 are used.
slabinfo says the object size is 256 which explains some of the spread. 
But ultimately the problem is that addresses are not evenly distributed
inside a single slab.

If I divide by PAGE_SIZE instead of L1_CACHE_BYTES I get every possible
value used but it is far from uniform.
With 40000 allocations we would want about 160 in each slot.
The median I measured is 155 (good) but the range is from 16 to 330
which is nearly +/- 100% of the median.
So that isn't random - but then you weren't suggesting that exactly.

I don't think there is a good case here for selecting bits from the
middle of the dentry address.

If I use hash_ptr(dentry, 8) I get a more uniform distribution.  64000
entries would hope for 250 per bucket.  Median is 248.  Range is 186 to
324 so +/- 25%.

Maybe that is the better choice.

> 
> > > 2) part in d_add_ci() might be worth a comment re d_lookup_done() coming
> > > for the original dentry, no matter what.
> > 
> > I think the previous code deserved explanation more than the new, but
> > maybe I missed something.
> > In each case, d_wait_lookup() will wait for the given dentry to no
> > longer be d_in_lookup() which means waiting for DCACHE_PAR_LOOKUP to be
> > cleared.  The only place which clears DCACHE_PAR_LOOKUP is
> > __d_lookup_unhash_wake(). which always wakes the target.
> > In the previous code it would wake both the non-case-exact dentry and
> > the case-exact dentry waiters but they would go back to sleep if their
> > DCACHE_PAR_LOOKUP hadn't been cleared, so no interesting behaviour.
> > Reusing the wq from one to the other is a sensible simplification, but
> > not something we need any reminder of once it is no longer needed.
> 
> It's not just about the wakeups; any in-lookup dentry should be taken
> out of in-lookup hash before it gets dropped.
>  
> > > 3) the dance with conditional __wake_up() is worth a helper, IMO.
> 
> I mean an inlined helper function.

Yes.. Of course...

Maybe we should put

static inline void wake_up_key(struct wait_queue_head *wq, void *key)
{
	__wake_up(wq, TASK_NORMAL, 0, key);
}

in include/linux/wait.h to avoid the __wake_up() "internal" name, and
then use
	wake_up_key(d_wait, dentry);
in the two places in dcache.c, or did you want something
dcache-specific?
I'm not good at guessing what other people are thinking.

Thanks,
NeilBrown