Re: [PATCH nf-next 3/5] nft_set_pipapo: Prepare for vectorised implementation: alignment

Stefano Brivio <sbrivio@xxxxxxxxxx> · Thu, 5 Mar 2020 21:35:13 +0100

Hi Florian,

On Sun, 23 Feb 2020 23:14:35 +0100
Florian Westphal <fw@xxxxxxxxx> wrote:

> Stefano Brivio <sbrivio@xxxxxxxxxx> wrote:
> >  struct nft_pipapo_field {
> > @@ -439,6 +456,9 @@ struct nft_pipapo_field {
> >  	unsigned long rules;
> >  	size_t bsize;
> >  	int bb;
> > +#ifdef NFT_PIPAPO_ALIGN
> > +	unsigned long *lt_aligned;
> > +#endif
> >  	unsigned long *lt;
> >  	union nft_pipapo_map_bucket *mt;
> >  };  
> 
> I wonder if these structs can be compressed.
> AFAICS bsize is in sizes of longs, so when this number is
> large then we also need to kvmalloc a large blob of memory.
> 
> I think u32 would be enough?
> 
> nft_pipapo_field is probably the most relevant one wrt. to size.

...so I tried rearranging that struct. The results (on both x86_64 and
aarch64) are rather disappointing: the hole (that we get on 64-bit
architectures) seems to actually be beneficial.

If I turn the 'unsigned long' and 'size_t' members to u32, matching
rates drop very slightly (1-3% depending on the case in the usual
kselftest).

I then tried to shrink it more aggressively ('bb' and 'groups' can be
u8, 'bsize' can probably even be u16), and there the performance hit is
much more apparent (> 10%) -- but this is something I can easily explain
with word masks and shifts.

I'm not sure exactly what happens with the pair of u32's. The assembly
looks clean. I would probably need some micro-benchmarking to
clearly relate this to execution pipeline "features" and to, perhaps,
find a way to shuffle accesses to fields to actually speed this up
while fitting two fields in the same word.

However, I'm not so sure it's worth it at this stage.

> >  struct nft_pipapo_match {
> >  	int field_count;
> > +#ifdef NFT_PIPAPO_ALIGN
> > +	unsigned long * __percpu *scratch_aligned;
> > +#endif
> >  	unsigned long * __percpu *scratch;
> >  	size_t bsize_max;  
> 
> Same here (bsize_max -- could fit with hole after field_count)?
> 
> Also, since you know the size of nft_pipapo_match (including the
> dynamically allocated array at the end), you could store the
> original memory (*scratch) and the rcu_head at the end, since
> they are not needed at lookup time and a little overhead to calculate
> their storage offset is fine.
> 
> Not sure its worth it, just an idea.

I actually bothered with this: even without the trick you explained,
struct field f[0] can safely become f[16] (NFT_REG32_COUNT), and I can
move it to the top, and then push the rcu_head down. This, again, hits
lookup rates quite badly. With f[4] lookup rates are the same as the
current case.

So, well, I wouldn't touch this either -- maybe some micro-benchmarking
might suggest a better way, but I doubt it's worth it right now.

-- 
Stefano