On 5/18/21 9:01 AM, Pablo Neira Ayuso wrote: > On Mon, May 10, 2021 at 07:58:52AM +0200, Stefano Brivio wrote: >> We don't need a valid MXCSR state for the lookup routines, none of >> the instructions we use rely on or affect any bit in the MXCSR >> register. >> >> Instead of calling kernel_fpu_begin(), we can pass 0 as mask to >> kernel_fpu_begin_mask() and spare one LDMXCSR instruction. >> >> Commit 49200d17d27d ("x86/fpu/64: Don't FNINIT in kernel_fpu_begin()") >> already speeds up lookups considerably, and by dropping the MCXSR >> initialisation we can now get a much smaller, but measurable, increase >> in matching rates. >> >> The table below reports matching rates and a wild approximation of >> clock cycles needed for a match in a "port,net" test with 10 entries >> from selftests/netfilter/nft_concat_range.sh, limited to the first >> field, i.e. the port (with nft_set_rbtree initialisation skipped), run >> on a single AMD Epyc 7351 thread (2.9GHz, 512 KiB L1D$, 8 MiB Please consider reverting this patch. You have papered over the actual problem, which is that the kernel does not get the AVX pipeline stalls right. LDMXCSR merely exacerbates the problem, but your patch won't really fix it. A real fix is on my radar. If you end up applying this patch, I'll probably revert it later.