Re: [PATCH bpf] bpf: plug hole in struct bpf_sk_lookup_kern

Jesper Dangaard Brouer <brouer@xxxxxxxxxx> · Thu, 10 Sep 2020 14:21:32 +0200

On Thu, 10 Sep 2020 12:02:48 +0100
Lorenz Bauer <lmb@xxxxxxxxxxxxxx> wrote:

> As Alexei points out, struct bpf_sk_lookup_kern has two 4-byte holes.
> This leads to suboptimal instructions being generated (IPv4, x86):
> 
>     1372                    struct bpf_sk_lookup_kern ctx = {
>        0xffffffff81b87f30 <+624>:   xor    %eax,%eax
>        0xffffffff81b87f32 <+626>:   mov    $0x6,%ecx
>        0xffffffff81b87f37 <+631>:   lea    0x90(%rsp),%rdi
>        0xffffffff81b87f3f <+639>:   movl   $0x110002,0x88(%rsp)
>        0xffffffff81b87f4a <+650>:   rep stos %rax,%es:(%rdi)
>        0xffffffff81b87f4d <+653>:   mov    0x8(%rsp),%eax
>        0xffffffff81b87f51 <+657>:   mov    %r13d,0x90(%rsp)
>        0xffffffff81b87f59 <+665>:   incl   %gs:0x7e4970a0(%rip)
>        0xffffffff81b87f60 <+672>:   mov    %eax,0x8c(%rsp)
>        0xffffffff81b87f67 <+679>:   movzwl 0x10(%rsp),%eax
>        0xffffffff81b87f6c <+684>:   mov    %ax,0xa8(%rsp)
>        0xffffffff81b87f74 <+692>:   movzwl 0x38(%rsp),%eax
>        0xffffffff81b87f79 <+697>:   mov    %ax,0xaa(%rsp)
> 
> Fix this by moving around sport and dport. pahole confirms there
> are no more holes:
> 
>     struct bpf_sk_lookup_kern {
>         u16                        family;       /*     0     2 */
>         u16                        protocol;     /*     2     2 */
>         __be16                     sport;        /*     4     2 */
>         u16                        dport;        /*     6     2 */
>         struct {
>                 __be32             saddr;        /*     8     4 */
>                 __be32             daddr;        /*    12     4 */
>         } v4;                                    /*     8     8 */
>         struct {
>                 const struct in6_addr  * saddr;  /*    16     8 */
>                 const struct in6_addr  * daddr;  /*    24     8 */
>         } v6;                                    /*    16    16 */
>         struct sock *              selected_sk;  /*    32     8 */
>         bool                       no_reuseport; /*    40     1 */
> 
>         /* size: 48, cachelines: 1, members: 8 */
>         /* padding: 7 */
>         /* last cacheline: 48 bytes */
>     };
> 
> The assembly also doesn't contain the pesky rep stos anymore:
> 
>     1372                    struct bpf_sk_lookup_kern ctx = {
>        0xffffffff81b87f60 <+624>:   movzwl 0x10(%rsp),%eax
>        0xffffffff81b87f65 <+629>:   movq   $0x0,0xa8(%rsp)
>        0xffffffff81b87f71 <+641>:   movq   $0x0,0xb0(%rsp)
>        0xffffffff81b87f7d <+653>:   mov    %ax,0x9c(%rsp)
>        0xffffffff81b87f85 <+661>:   movzwl 0x38(%rsp),%eax
>        0xffffffff81b87f8a <+666>:   movq   $0x0,0xb8(%rsp)
>        0xffffffff81b87f96 <+678>:   mov    %ax,0x9e(%rsp)
>        0xffffffff81b87f9e <+686>:   mov    0x8(%rsp),%eax
>        0xffffffff81b87fa2 <+690>:   movq   $0x0,0xc0(%rsp)
>        0xffffffff81b87fae <+702>:   movl   $0x110002,0x98(%rsp)
>        0xffffffff81b87fb9 <+713>:   mov    %eax,0xa0(%rsp)
>        0xffffffff81b87fc0 <+720>:   mov    %r13d,0xa4(%rsp)
> 
> 1: https://lore.kernel.org/bpf/CAADnVQKE6y9h2fwX6OS837v-Uf+aBXnT_JXiN_bbo2gitZQ3tA@xxxxxxxxxxxxxx/
> 
> Fixes: e9ddbb7707ff ("bpf: Introduce SK_LOOKUP program type with a dedicated attach point")
> Suggested-by: Alexei Starovoitov <ast@xxxxxxxxxx>
> Signed-off-by: Lorenz Bauer <lmb@xxxxxxxxxxxxxx>

Acked-by: Jesper Dangaard Brouer <brouer@xxxxxxxxxx>

I'm very happy to see others have also discovered the slowdown of 'rep stos',
as I've been hunting these for years.  My understanding is that the
'rep-stos' slowdown comes from the CPU-instruction saving the CPU-flags
to allow it to be interrupted.  That makes sense when memset zeroing
large areas, but for small mem size structs this is slower than
clearing them in other ways.  I have a micro-benchmark as a
kernel-module here[2], where I explore different methods of memset.

[2] https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/lib/time_bench_memset.c

As you have discovered the GCC compiler will generate these rep stos
for clearing a struct if not all members are initialized. If you want
to fix some more of these, then I remember there were some in the
net/core/flow_dissector.c code.

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer