Neal Shukla <nshukla@xxxxxxxxxxxxx> writes: > We’ve been introducing bpf_tail_call’s into our XDP programs and have run into > packet loss and latency increases when performing load tests. After profiling > our code we’ve come to the conclusion that this is the problem area in our code: > `int layer3_protocol = bpf_ntohs(ethernet_header->h_proto);` > > This is the first time we read from the packet in the first XDP program. We have > yet to make a tail call at this point. However, we do write into the metadata > section prior to this line. > > How We Profiled Our Code: > To profile our code, we used https://github.com/iovisor/bpftrace. We ran this > command while sending traffic to our machine: > `sudo bpftrace bpftrace -e 'profile:hz:99 { @[kstack] = count(); }' > > /tmp/stack_samples.out` > > From there we got a kernel stack trace with the most frequently counted spots at > the bottom of the output file. The most commonly hit spot asides from the CPU > idle look like: > ``` > @[ > bpf_prog_986b0b3beb6f0873_some_program+290 > i40e_napi_poll+1897 > net_rx_action+309 > __softirqentry_text_start+202 > run_ksoftirqd+38 > smpboot_thread_fn+197 > kthread+283 > ret_from_fork+34 > ]: 8748 > ``` > > We then took the program id and ran this command to retrieve the jited code: > `sudo bpftool prog dump jited tag 986b0b3beb6f0873` > > By converting the decimal offset (290) from the stack trace to hex format (122) > we found the line which it’s referring to in the jited code: > ``` > 11d: movzbq 0xc(%r15),%rsi > 122: movzbq 0xd(%r15),%rdi > 127: shl $0x8,%rdi > 12b: or %rsi,%rdi > 12e: ror $0x8,%di > 132: movzwl %di,%edi > ``` > We've mapped this portion to refer to the line mentioned earlier: > `int layer3_protocol = bpf_ntohs(ethernet_header->h_proto);` > > 1) Are we correctly profiling our XDP programs? > > 2) Is there a reason why our first read into the packet would cause this issue? > And what would be the best way to solve the issue? > We've theorized it may have to do with cache or TLB misses as we've added a lot > more instructions to our programs. Yeah, this sounds like a caching issue. What system are you running this on? Intel's DDIO feature that DMAs packets directly to L3 cache tends to help with these sorts of things, but maybe your system doesn't have that, or it's not being used for some reason? Adding a few other people who have a better grasp of these details than me, in the hope that they can be more helpful :) -Toke