Re: [PATCH] crypto: x86/aesni - implement accelerated CBCMAC, CMAC and XCBC shashes

Ben Greear <greearb@xxxxxxxxxxxxxxx> · Tue, 4 Aug 2020 06:22:39 -0700

On 8/4/20 6:08 AM, Ard Biesheuvel wrote:
On Tue, 4 Aug 2020 at 15:01, Ben Greear <greearb@xxxxxxxxxxxxxxx> wrote:

On 8/4/20 5:55 AM, Ard Biesheuvel wrote:
On Mon, 3 Aug 2020 at 21:11, Ben Greear <greearb@xxxxxxxxxxxxxxx> wrote:

Hello,

This helps a bit...now download sw-crypt performance is about 150Mbps,
but still not as good as with my patch on 5.4 kernel, and fpu is still
high in perf top:

      13.89%  libc-2.29.so   [.] __memset_sse2_unaligned_erms
        6.62%  [kernel]       [k] kernel_fpu_begin
        4.14%  [kernel]       [k] _aesni_enc1
        2.06%  [kernel]       [k] __crypto_xor
        1.95%  [kernel]       [k] copy_user_generic_string
        1.93%  libjvm.so      [.] SpinPause
        1.01%  [kernel]       [k] aesni_encrypt
        0.98%  [kernel]       [k] crypto_ctr_crypt
        0.93%  [kernel]       [k] udp_sendmsg
        0.78%  [kernel]       [k] crypto_inc
        0.74%  [kernel]       [k] __ip_append_data.isra.53
        0.65%  [kernel]       [k] aesni_cbc_enc
        0.64%  [kernel]       [k] __dev_queue_xmit
        0.62%  [kernel]       [k] ipt_do_table
        0.62%  [kernel]       [k] igb_xmit_frame_ring
        0.59%  [kernel]       [k] ip_route_output_key_hash_rcu
        0.57%  [kernel]       [k] memcpy
        0.57%  libjvm.so      [.] InstanceKlass::oop_follow_contents
        0.56%  [kernel]       [k] irq_fpu_usable
        0.56%  [kernel]       [k] mac_do_update

If you'd like help setting up a test rig and have an ath10k pcie NIC or ath9k pcie NIC,
then I can help.  Possibly hwsim would also be a good test case, but I have not tried
that.

I don't think this is likely to be reproducible on other
micro-architectures, so setting up a test rig is unlikely to help.

I'll send out a v2 which implements a ahash instead of a shash (and
implements some other tweaks) so that kernel_fpu_begin() is only
called twice for each packet on the cbcmac path.

Do you have any numbers for the old kernel without your patch? This
pathological FPU preserve/restore behavior could be caused be the
optimizations, or by other changes that landed in the meantime, so I
would like to know if kernel_fpu_begin() is as prominent in those
traces as well.

This same patch makes i7 mobile processors able to handle 1Gbps+ software
decrypt rates, where without the patch, the rate was badly constrained and CPU
load was much higher, so it is definitely noticeable on other processors too.

OK

The weak processor on the current test rig is convenient because the problem
is so noticeable even at slower wifi speeds.

We can do some tests on 5.4 with our patch reverted.

The issue with your CCM patch is that it keeps the FPU enabled for the
entire input, which also means that preemption is disabled, which
makes the -rt people grumpy. (Of course, it also uses APIs that no
longer exists, but that should be easy to fix)

So, if there is no other way to get back the performance, can it be a compile
or runtime option (disabled by default for -RT type folks) to re-enable the feature
that helps our CPU usage?

Or, can you do an add-on patch to enable keeping fpu enabled so that I can test
how that affects our performance?

Do you happen to have any ballpark figures for the packet sizes and
the time spent doing encryption?

This test was using MTU UDP frames I think, and mostly it is just sending
and receiving frames.  perf top output gives you as much detail as I have about
what the kernel is spending time doing.

Thanks,
Ben

--
Ben Greear <greearb@xxxxxxxxxxxxxxx>
Candela Technologies Inc  http://www.candelatech.com