Hi Florian, On Sun, 23 Feb 2020 23:14:35 +0100 Florian Westphal <fw@xxxxxxxxx> wrote: > Stefano Brivio <sbrivio@xxxxxxxxxx> wrote: > > struct nft_pipapo_field { > > @@ -439,6 +456,9 @@ struct nft_pipapo_field { > > unsigned long rules; > > size_t bsize; > > int bb; > > +#ifdef NFT_PIPAPO_ALIGN > > + unsigned long *lt_aligned; > > +#endif > > unsigned long *lt; > > union nft_pipapo_map_bucket *mt; > > }; > > I wonder if these structs can be compressed. > AFAICS bsize is in sizes of longs, so when this number is > large then we also need to kvmalloc a large blob of memory. > > I think u32 would be enough? > > nft_pipapo_field is probably the most relevant one wrt. to size. ...so I tried rearranging that struct. The results (on both x86_64 and aarch64) are rather disappointing: the hole (that we get on 64-bit architectures) seems to actually be beneficial. If I turn the 'unsigned long' and 'size_t' members to u32, matching rates drop very slightly (1-3% depending on the case in the usual kselftest). I then tried to shrink it more aggressively ('bb' and 'groups' can be u8, 'bsize' can probably even be u16), and there the performance hit is much more apparent (> 10%) -- but this is something I can easily explain with word masks and shifts. I'm not sure exactly what happens with the pair of u32's. The assembly looks clean. I would probably need some micro-benchmarking to clearly relate this to execution pipeline "features" and to, perhaps, find a way to shuffle accesses to fields to actually speed this up while fitting two fields in the same word. However, I'm not so sure it's worth it at this stage. > > struct nft_pipapo_match { > > int field_count; > > +#ifdef NFT_PIPAPO_ALIGN > > + unsigned long * __percpu *scratch_aligned; > > +#endif > > unsigned long * __percpu *scratch; > > size_t bsize_max; > > Same here (bsize_max -- could fit with hole after field_count)? > > Also, since you know the size of nft_pipapo_match (including the > dynamically allocated array at the end), you could store the > original memory (*scratch) and the rcu_head at the end, since > they are not needed at lookup time and a little overhead to calculate > their storage offset is fine. > > Not sure its worth it, just an idea. I actually bothered with this: even without the trick you explained, struct field f[0] can safely become f[16] (NFT_REG32_COUNT), and I can move it to the top, and then push the rcu_head down. This, again, hits lookup rates quite badly. With f[4] lookup rates are the same as the current case. So, well, I wouldn't touch this either -- maybe some micro-benchmarking might suggest a better way, but I doubt it's worth it right now. -- Stefano