On Mon, 3 Aug 2020 at 21:11, Ben Greear <greearb@xxxxxxxxxxxxxxx> wrote: > > Hello, > > This helps a bit...now download sw-crypt performance is about 150Mbps, > but still not as good as with my patch on 5.4 kernel, and fpu is still > high in perf top: > > 13.89% libc-2.29.so [.] __memset_sse2_unaligned_erms > 6.62% [kernel] [k] kernel_fpu_begin > 4.14% [kernel] [k] _aesni_enc1 > 2.06% [kernel] [k] __crypto_xor > 1.95% [kernel] [k] copy_user_generic_string > 1.93% libjvm.so [.] SpinPause > 1.01% [kernel] [k] aesni_encrypt > 0.98% [kernel] [k] crypto_ctr_crypt > 0.93% [kernel] [k] udp_sendmsg > 0.78% [kernel] [k] crypto_inc > 0.74% [kernel] [k] __ip_append_data.isra.53 > 0.65% [kernel] [k] aesni_cbc_enc > 0.64% [kernel] [k] __dev_queue_xmit > 0.62% [kernel] [k] ipt_do_table > 0.62% [kernel] [k] igb_xmit_frame_ring > 0.59% [kernel] [k] ip_route_output_key_hash_rcu > 0.57% [kernel] [k] memcpy > 0.57% libjvm.so [.] InstanceKlass::oop_follow_contents > 0.56% [kernel] [k] irq_fpu_usable > 0.56% [kernel] [k] mac_do_update > > If you'd like help setting up a test rig and have an ath10k pcie NIC or ath9k pcie NIC, > then I can help. Possibly hwsim would also be a good test case, but I have not tried > that. > I don't think this is likely to be reproducible on other micro-architectures, so setting up a test rig is unlikely to help. I'll send out a v2 which implements a ahash instead of a shash (and implements some other tweaks) so that kernel_fpu_begin() is only called twice for each packet on the cbcmac path. Do you have any numbers for the old kernel without your patch? This pathological FPU preserve/restore behavior could be caused be the optimizations, or by other changes that landed in the meantime, so I would like to know if kernel_fpu_begin() is as prominent in those traces as well.