Hi Andy, On Tue, 18 May 2021 09:48:41 -0700 Andy Lutomirski <luto@xxxxxxxxxx> wrote: > On 5/18/21 9:01 AM, Pablo Neira Ayuso wrote: > > On Mon, May 10, 2021 at 07:58:52AM +0200, Stefano Brivio wrote: > >> We don't need a valid MXCSR state for the lookup routines, none of > >> the instructions we use rely on or affect any bit in the MXCSR > >> register. > >> > >> Instead of calling kernel_fpu_begin(), we can pass 0 as mask to > >> kernel_fpu_begin_mask() and spare one LDMXCSR instruction. > >> > >> Commit 49200d17d27d ("x86/fpu/64: Don't FNINIT in kernel_fpu_begin()") > >> already speeds up lookups considerably, and by dropping the MCXSR > >> initialisation we can now get a much smaller, but measurable, increase > >> in matching rates. > >> > >> The table below reports matching rates and a wild approximation of > >> clock cycles needed for a match in a "port,net" test with 10 entries > >> from selftests/netfilter/nft_concat_range.sh, limited to the first > >> field, i.e. the port (with nft_set_rbtree initialisation skipped), run > >> on a single AMD Epyc 7351 thread (2.9GHz, 512 KiB L1D$, 8 MiB > > Please consider reverting this patch. You have papered over the actual > problem, which is that the kernel does not get the AVX pipeline stalls > right. LDMXCSR merely exacerbates the problem, but your patch won't > really fix it. I didn't get your comment: all this patch does is to skip the initial LDMXCSR in kernel_fpu_begin(), because the instructions we use in the lookup functions don't touch or rely on MCXSR, so a valid initial state doesn't look relevant to me. Later on, in the lookup functions, I ordered AVX instructions manually anyway to minimise stalls. What am I missing? Do you mean that LDMXCSR would come at a lower or no cost if ordered differently? -- Stefano