Profiling XDP programs for performance issues

Neal Shukla <nshukla@xxxxxxxxxxxxx> · Wed, 7 Apr 2021 19:19:54 -0700

We’ve been introducing bpf_tail_call’s into our XDP programs and have run into
packet loss and latency increases when performing load tests. After profiling
our code we’ve come to the conclusion that this is the problem area in our code:
`int layer3_protocol = bpf_ntohs(ethernet_header->h_proto);`

This is the first time we read from the packet in the first XDP program. We have
yet to make a tail call at this point. However, we do write into the metadata
section prior to this line.

How We Profiled Our Code:
To profile our code, we used https://github.com/iovisor/bpftrace. We ran this
command while sending traffic to our machine:
`sudo bpftrace bpftrace -e 'profile:hz:99 { @[kstack] = count(); }' >
/tmp/stack_samples.out`

>From there we got a kernel stack trace with the most frequently counted spots at
the bottom of the output file. The most commonly hit spot asides from the CPU
idle look like:
```
@[
    bpf_prog_986b0b3beb6f0873_some_program+290
    i40e_napi_poll+1897
    net_rx_action+309
    __softirqentry_text_start+202
    run_ksoftirqd+38
    smpboot_thread_fn+197
    kthread+283
    ret_from_fork+34
]: 8748
```

We then took the program id and ran this command to retrieve the jited code:
`sudo bpftool prog dump jited tag 986b0b3beb6f0873`

By converting the decimal offset (290) from the stack trace to hex format (122)
we found the line which it’s referring to in the jited code:
```
11d:   movzbq 0xc(%r15),%rsi
122:   movzbq 0xd(%r15),%rdi
127:   shl         $0x8,%rdi
12b:   or          %rsi,%rdi
12e:   ror         $0x8,%di
132:   movzwl %di,%edi
```
We've mapped this portion to refer to the line mentioned earlier:
`int layer3_protocol = bpf_ntohs(ethernet_header->h_proto);`

1) Are we correctly profiling our XDP programs?

2) Is there a reason why our first read into the packet would cause this issue?
And what would be the best way to solve the issue?
We've theorized it may have to do with cache or TLB misses as we've added a lot
more instructions to our programs.

Thanks for your time!