Re: [PATCH nf-next] nft_set_pipapo_avx2: Skip LDMXCSR, we don't need a valid MXCSR state

Stefano Brivio <sbrivio@xxxxxxxxxx> · Tue, 18 May 2021 20:27:15 +0200

Hi Andy,

On Tue, 18 May 2021 09:48:41 -0700
Andy Lutomirski <luto@xxxxxxxxxx> wrote:

> On 5/18/21 9:01 AM, Pablo Neira Ayuso wrote:
> > On Mon, May 10, 2021 at 07:58:52AM +0200, Stefano Brivio wrote:  
> >> We don't need a valid MXCSR state for the lookup routines, none of
> >> the instructions we use rely on or affect any bit in the MXCSR
> >> register.
> >>
> >> Instead of calling kernel_fpu_begin(), we can pass 0 as mask to
> >> kernel_fpu_begin_mask() and spare one LDMXCSR instruction.
> >>
> >> Commit 49200d17d27d ("x86/fpu/64: Don't FNINIT in kernel_fpu_begin()")
> >> already speeds up lookups considerably, and by dropping the MCXSR
> >> initialisation we can now get a much smaller, but measurable, increase
> >> in matching rates.
> >>
> >> The table below reports matching rates and a wild approximation of
> >> clock cycles needed for a match in a "port,net" test with 10 entries
> >> from selftests/netfilter/nft_concat_range.sh, limited to the first
> >> field, i.e. the port (with nft_set_rbtree initialisation skipped), run
> >> on a single AMD Epyc 7351 thread (2.9GHz, 512 KiB L1D$, 8 MiB   
> 
> Please consider reverting this patch.  You have papered over the actual
> problem, which is that the kernel does not get the AVX pipeline stalls
> right.  LDMXCSR merely exacerbates the problem, but your patch won't
> really fix it.

I didn't get your comment: all this patch does is to skip the initial
LDMXCSR in kernel_fpu_begin(), because the instructions we use in the
lookup functions don't touch or rely on MCXSR, so a valid initial state
doesn't look relevant to me.

Later on, in the lookup functions, I ordered AVX instructions manually
anyway to minimise stalls.

What am I missing? Do you mean that LDMXCSR would come at a lower or no
cost if ordered differently?

-- 
Stefano