On 12/5/19 7:52 PM, Frank Rowand wrote: > On 12/3/19 10:56 AM, Rob Herring wrote: >> On Mon, Dec 2, 2019 at 10:28 PM Frank Rowand <frowand.list@xxxxxxxxx> wrote: >>> >>> On 12/2/19 10:12 PM, Michael Ellerman wrote: >>>> Frank Rowand <frowand.list@xxxxxxxxx> writes: >>>>> On 11/29/19 9:10 AM, Sebastian Andrzej Siewior wrote: >>>>>> I've been looking at phandle_cache and noticed the following: The raw >>>>>> phandle value as generated by dtc starts at zero and is incremented by >>>>>> one for each phandle entry. The qemu pSeries model is using Slof (which >>>>>> is probably the same thing as used on real hardware) and this looks like >>>>>> a poiner value for the phandle. >>>>>> With >>>>>> qemu-system-ppc64le -m 16G -machine pseries -smp 8 >>>>>> >>>>>> I got the following output: >>>>>> | entries: 64 >>>>>> | phandle 7e732468 slot 28 hash c >>>>>> | phandle 7e732ad0 slot 10 hash 27 >>>>>> | phandle 7e732ee8 slot 28 hash 3a >>>>>> | phandle 7e734160 slot 20 hash 36 >>>>>> | phandle 7e734318 slot 18 hash 3a >>>>>> | phandle 7e734428 slot 28 hash 33 >>>>>> | phandle 7e734538 slot 38 hash 2c >>>>>> | phandle 7e734850 slot 10 hash e >>>>>> | phandle 7e735220 slot 20 hash 2d >>>>>> | phandle 7e735bf0 slot 30 hash d >>>>>> | phandle 7e7365c0 slot 0 hash 2d >>>>>> | phandle 7e736f90 slot 10 hash d >>>>>> | phandle 7e737960 slot 20 hash 2d >>>>>> | phandle 7e738330 slot 30 hash d >>>>>> | phandle 7e738d00 slot 0 hash 2d >>>>>> | phandle 7e739730 slot 30 hash 38 >>>>>> | phandle 7e73bd08 slot 8 hash 17 >>>>>> | phandle 7e73c2e0 slot 20 hash 32 >>>>>> | phandle 7e73c7f8 slot 38 hash 37 >>>>>> | phandle 7e782420 slot 20 hash 13 >>>>>> | phandle 7e782ed8 slot 18 hash 1b >>>>>> | phandle 7e73ce28 slot 28 hash 39 >>>>>> | phandle 7e73d390 slot 10 hash 22 >>>>>> | phandle 7e73d9a8 slot 28 hash 1a >>>>>> | phandle 7e73dc28 slot 28 hash 37 >>>>>> | phandle 7e73de00 slot 0 hash a >>>>>> | phandle 7e73e028 slot 28 hash 0 >>>>>> | phandle 7e7621a8 slot 28 hash 36 >>>>>> | phandle 7e73e458 slot 18 hash 1e >>>>>> | phandle 7e73e608 slot 8 hash 1e >>>>>> | phandle 7e740078 slot 38 hash 28 >>>>>> | phandle 7e740180 slot 0 hash 1d >>>>>> | phandle 7e740240 slot 0 hash 33 >>>>>> | phandle 7e740348 slot 8 hash 29 >>>>>> | phandle 7e740410 slot 10 hash 2 >>>>>> | phandle 7e740eb0 slot 30 hash 3e >>>>>> | phandle 7e745390 slot 10 hash 33 >>>>>> | phandle 7e747b08 slot 8 hash c >>>>>> | phandle 7e748528 slot 28 hash f >>>>>> | phandle 7e74a6e0 slot 20 hash 18 >>>>>> | phandle 7e74aab0 slot 30 hash b >>>>>> | phandle 7e74f788 slot 8 hash d >>>>>> | Used entries: 8, hashed: 29 >>>>>> >>>>>> So the hash array has 64 entries out which only 8 are populated. Using >>>>>> hash_32() populates 29 entries. >>>>>> Could someone with real hardware verify this? >>>>>> I'm not sure how important this performance wise, it looks just like a >>>>>> waste using only 1/8 of the array. >>>>> >>>>> The hash used is based on the assumptions you noted, and as stated in the >>>>> code, that phandle property values are in a contiguous range of 1..n >>>>> (not starting from zero), which is what dtc generates. >>>> >>>>> We knew that for systems that do not match the assumptions that the hash >>>>> will not be optimal. >>>> >>>> If we're going to have the phandle cache it should at least make some >>>> attempt to work across the systems that Linux supports. >>>> >>>>> Unless there is a serious performance problem for >>>>> such systems, I do not want to make the phandle hash code more complicated >>>>> to optimize for these cases. And the pseries have been performing ok >>>>> without phandle related performance issues that I remember hearing since >>>>> before the cache was added, which could have only helped the performance. >>>>> Yes, if your observations are correct, some memory is being wasted, but >>>>> a 64 entry cache is not very large on a pseries. >>>> >>>> A single line change to use an actual hash function is hardly >>>> complicating it, compared to the amount of code already there. >>> >>> With a dtc generated devicetree, the hash is perfect, with no >>> misses. That single line change then makes the hash bad for >>> dtc generated devicetrees. >> >> To quantify that, I did some tests with the hash algo and it's about a >> 10% collision rate using phandle values of 1-$cache_size. There's >> never more than 2 phandles colliding in a slot. The actual impact >> would be less given 100% thrashing seems unlikely. I also tested a few cache sizes. For cache size 64 entries, I get a 14% collision rate (and also 14% of entries unused). For 1024 entries, I get a 12% collision rate. And I can confirm no more than two phandles hashing to the same slot. So essentially confirming your data. -Frank > > Thank you for doing the tests. Actual data helps a lot. > > If there is only a 10% collision rate for this case, that does not > sound bad to me. There is the possibility of current or future > code resulting in ping ponging between two phandle values which > collide in the cache, but that should not be the common case. > > However, given a choice between two algorithms, one of which > guarantees no thrashing (the current cache algorithm) and one > which allows a pathologic use case which results in thrashing, > I prefer the first algorithm. This may seem theoretical, but > I have seen enough pathological code paths in my years of > performance analysis and tuning to be sensitive to this issue. > >> >>> The cache was added to address a problem with a system with a >>> dtc generated devicetree. >> >> The cache was added to address a problem with a system with a large >> number of nodes and phandles (814 phandles in the reported case). dtc >> happens to be used in that case. > > Yes, you stated that in a more general way than I did. Agreed. > > >> >>> I had not heard of any phandle >>> lookup performance issues on ppc systems. An imperfect hash >>> for ppc should not make the ppc performance worse (unless >>> every single phandle value hashes to a single bucket). So >>> the ppc performance is likely better than it was before >>> the hash was added, even without an optimal hash algorithm. >>> >>> So the change would not be a single line change. It would >>> be a change to use different hash algorithms for dtc >>> generated device trees versus other device trees. >>> >>> Another possibility would be to make the cache be dependent >>> upon not CONFIG_PPC. It might be possible to disable the >>> cache with a minimal code change. >> >> I'd rather not do that. > > I also agree with that. I threw that out as kind of a nuclear option. > If ppc did not benefit from the cache having been implemented and > perceived a negative impact from the cache, this is simply a way > to remove that negative impact. Then ppc would be in the same > state as before we implemented the cache. Again, I also do not > like such a special casing. > >> >> And yes, as mentioned earlier I don't like the complexity. I didn't >> from the start and I'm I'm still of the opinion we should have a >> fixed or 1 time sized true cache (i.e. smaller than total # of >> phandles). That would solve the RT memory allocation and locking issue >> too. > > I could be convinced to go to a one time sized cache, especially now > that the RT issue exists. If we change to that, it would be documented > as a possible overhead impact that devicetree overlays would have to > accept. > > -Frank > >> >> For reference, the performance difference between the current >> implementation (assuming fixes haven't regressed it) was ~400ms vs. a >> ~340ms improvement with a 64 entry cache (using a mask, not a hash). >> IMO, 340ms improvement was good enough. >> >> Rob >> > >