We’ve been introducing bpf_tail_call’s into our XDP programs and have run into packet loss and latency increases when performing load tests. After profiling our code we’ve come to the conclusion that this is the problem area in our code: `int layer3_protocol = bpf_ntohs(ethernet_header->h_proto);` This is the first time we read from the packet in the first XDP program. We have yet to make a tail call at this point. However, we do write into the metadata section prior to this line. How We Profiled Our Code: To profile our code, we used https://github.com/iovisor/bpftrace. We ran this command while sending traffic to our machine: `sudo bpftrace bpftrace -e 'profile:hz:99 { @[kstack] = count(); }' > /tmp/stack_samples.out` >From there we got a kernel stack trace with the most frequently counted spots at the bottom of the output file. The most commonly hit spot asides from the CPU idle look like: ``` @[ bpf_prog_986b0b3beb6f0873_some_program+290 i40e_napi_poll+1897 net_rx_action+309 __softirqentry_text_start+202 run_ksoftirqd+38 smpboot_thread_fn+197 kthread+283 ret_from_fork+34 ]: 8748 ``` We then took the program id and ran this command to retrieve the jited code: `sudo bpftool prog dump jited tag 986b0b3beb6f0873` By converting the decimal offset (290) from the stack trace to hex format (122) we found the line which it’s referring to in the jited code: ``` 11d: movzbq 0xc(%r15),%rsi 122: movzbq 0xd(%r15),%rdi 127: shl $0x8,%rdi 12b: or %rsi,%rdi 12e: ror $0x8,%di 132: movzwl %di,%edi ``` We've mapped this portion to refer to the line mentioned earlier: `int layer3_protocol = bpf_ntohs(ethernet_header->h_proto);` 1) Are we correctly profiling our XDP programs? 2) Is there a reason why our first read into the packet would cause this issue? And what would be the best way to solve the issue? We've theorized it may have to do with cache or TLB misses as we've added a lot more instructions to our programs. Thanks for your time!