First of all, thank you for this VERY excellent bug-report! My comments inlined below. Cc. Stephen as this might be related to: commit 458bf2f224f0 ("net: core: support XDP generic on stacked devices.") (Author: Stephen Hemminger). On Wed, 31 Jul 2019 21:12:23 +0000 Brandon Cazander <brandon.cazander@xxxxxxxxxxxxxxx> wrote: > I am having an issue with xdpgeneric specifically when using XDP_PASS after > bpf_xdp_adjust_head to pop some headers off. My test environment is qemu using > virtio_net specifically, but it also happens with e1000 in qemu/physical devices. > > On a real NIC (ixgbe), the same program is successfully passing decapsulated > traffic, but fails in the same way when forcing xdpgeneric mode. Are you sure you are using "xdpgeneric" mode? As virtio_net do have "native" XDP mode. > Here is the packet outside of my VM, with the IP+UDP+tun header. > > # sudo tcpdump -ni ens5-outside -e udp -vvvXc1 > tcpdump: listening on ens5-outside, link-type EN10MB (Ethernet), capture size 262144 bytes > 15:54:14.263306 06:54:00:00:00:01 > 06:54:01:00:00:01, ethertype IPv4 (0x0800), length 127: (tos 0x0, ttl 255, id 0, offset 0, flags [DF], proto UDP (17), length 113) > 172.64.0.108.39999 > 172.64.0.101.7803: [no cksum] UDP, length 85 > 0x0000: 4500 0071 0000 4000 ff11 222a ac40 006c E..q..@..."*.@.l ^^^^ iph->tot_len $ echo $((0x71)) 113 $ echo $((0x71+14)) 127 > 0x0010: ac40 0065 9c3f 1e7b 005d 0000 0145 0000 .@.e.?.{.]...E.. ^^ ^^ Encapsulated packet starts at a funny offset, making it harder to follow. > 0x0020: 54e3 6b40 003f 01d5 e8c0 a800 02c0 a801 T.k@.?.......... ^^ ^^ Encapsulated iph->tot_len and ipproto=ICMP > 0x0030: 0208 002c b1b0 5eb8 f196 ca40 5d00 0000 ...,..^....@]... > 0x0040: 00c8 0304 0000 0000 0010 1112 1314 1516 ................ > 0x0050: 1718 191a 1b1c 1d1e 1f20 2122 2324 2526 ..........!"#$%& > 0x0060: 2728 292a 2b2c 2d2e 2f30 3132 3334 3536 '()*+,-./0123456 > 0x0070: 37 7 > > My XDP program copies the original ethhdr to the new offset and adjusts the head > forward, and I can see the resulting ICMP packet is valid with tcpdump inside > the VM: > > # tcpdump -niens5 icmp -c1 -vXe > tcpdump: listening on ens5, link-type EN10MB (Ethernet), capture size 262144 bytes > 15:53:25.602109 06:54:00:00:00:01 > 06:54:01:00:00:01, ethertype IPv4 (0x0800), length 98: (tos 0x0, ttl 63, id 35986, offset 0, flags [DF], proto ICMP (1), length 84) > 192.168.0.2 > 192.168.1.2: ICMP echo request, id 45150, seq 43683, length 64 > 0x0000: 4500 0054 8c92 4000 3f01 2cc2 c0a8 0002 E..T..@.?.,..... ^^ ^^ iph->tot_len and ipproto=ICMP $ echo $((0x54)) 84 $ echo $((0x54+14)) 98 > 0x0010: c0a8 0102 0800 ed79 b05e aaa3 6dca 405d .......y.^..m.@] > 0x0020: 0000 0000 3589 0d00 0000 0000 1011 1213 ....5........... > 0x0030: 1415 1617 1819 1a1b 1c1d 1e1f 2021 2223 .............!"# > 0x0040: 2425 2627 2829 2a2b 2c2d 2e2f 3031 3233 $%&'()*+,-./0123 > 0x0050: 3435 3637 4567 > > Unfortunately, at this point, the packet is dropped in ip_rcv_core. I added a > perf probe on the specific drop line that I'm hitting, and printing the skb->len > and len (from ntohs(iph->tot_len)) variables. You can see the obviously wrong > len value below, while a packet capture in the VM does show the correct value > for tot_len in the IP header. The len from ntohs(iph->tot_len) is obviously wrong (len=4294931717), but given it comes the packet data, it can easily be a mistake in your adjustment of headers. But from above packet dump, the iph->tot_len is correct. Thus, this indicate that how we find the IP-header offset is likely broken... > # perf probe -L ip_rcv_core:64+4 | cat > <ip_rcv_core@/usr/src/debug/kernel-debug-5.2.2-1.2.x86_64/linux-5.2/linux-obj/../net/ipv4/ip_input.c:64> > 64 if (pskb_trim_rcsum(skb, len)) { > 65 __IP_INC_STATS(net, IPSTATS_MIB_INDISCARDS); > 66 goto drop; > } > Looking at the code, I think the packet will be dropped earlier than the pskb_trim_rcsum() call, in the check if (skb->len < len). iph = ip_hdr(skb); [...] len = ntohs(iph->tot_len); if (skb->len < len) { __IP_INC_STATS(net, IPSTATS_MIB_INTRUNCATEDPKTS); goto drop; } else if (len < (iph->ihl*4)) goto inhdr_error; /* Our transport medium may have padded the buffer out. Now we know it * is IP we can trim to the true length of the frame. * Note this now means skb->len holds ntohs(iph->tot_len). */ if (pskb_trim_rcsum(skb, len)) { __IP_INC_STATS(net, IPSTATS_MIB_INDISCARDS); goto drop; } The bug might hide in iph=ip_hdr(skb) which is found via skb->network_header (via helper ip_hdr => skb_network_header(skb)). > # perf probe -a 'ip_rcv_core:66 skb=skb->len:u32 len=len:u32' > swapper 0 [001] 84794.954487: probe:ip_rcv_core: (ffffffffbcbd5a7d) skb=84 len=4294931717 > swapper 0 [001] 84794.965473: probe:ip_rcv_core: (ffffffffbcbd5a7d) skb=84 len=4294936833 > > In contrast, here's what it looks like in XDP native mode (different line number > but looking): > > # perf probe -a 'ip_rcv_core:57 skb=skb->len:u32 len=len:u32' > swapper 0 [000] 353.187439: probe:ip_rcv_core: (ffffffffac9dcca7) skb=84 len=84 > swapper 0 [003] 353.187577: probe:ip_rcv_core: (ffffffffac9dcca7) skb=84 len=84 > > Here's the relevant portion of my program where I decapsulate: > > static __always_inline int handle_peer_data_ipv4(struct xdp_md *ctx) > { > void *data, *data_end; > struct ethhdr *eth, *orig_eth; > __u32 csum = 0; > > data = (void *)(unsigned long)ctx->data; > data_end = (void *)(unsigned long)ctx->data_end; > > if (data + sizeof(struct ethhdr) + sizeof(struct iphdr) + sizeof(struct udphdr) + sizeof(struct tunnel_header) > data_end) { > return XDP_DROP; > } > > orig_eth = data; > eth = data + sizeof(struct iphdr) + sizeof(struct udphdr) + sizeof(struct tunnel_header); > memcpy(ð->h_source, &orig_eth->h_source, ETH_ALEN); > memcpy(ð->h_dest, &orig_eth->h_dest, ETH_ALEN); > eth->h_proto = __constant_htons(ETH_P_IP); You only support IPv4 encapsulated packets? Do you handle ARP requests separately? > /* Decapsulate by removing IP + UDP + tunnel headers */ > if (bpf_xdp_adjust_head(ctx, (int)(sizeof(struct iphdr) + sizeof(struct udphdr) + > sizeof(struct tunnel_header)))) { > return XDP_DROP; > } > > return XDP_PASS; > } > > struct tunnel_header { > __u8 flags; > }; > > Kernel is 5.2.2-1-debug, OS is openSUSE Tumbleweed 20190724. Can you test an earlier kernel, specifically before: commit 458bf2f224f0 ("net: core: support XDP generic on stacked devices.") (Author: Stephen Hemminger) $ git describe --contains 458bf2f224f04 v5.2-rc3~26^2~11^2 I fear that this commit, which moved generic-XDP to a later call point, might cause this. Because it could be that the SKB network_header update, is now done before calling XDP program (... still looking at code details). > For qemu I run with these arguments for the NICs: > > # qemu (...) -netdev tap,br=vm-bridge,id=hostnet1,ifname=ens5-outside,queues=4,vhost=on \ > -device virtio-net-pci,mq=on,vectors=9,guest_tso4=off,guest_tso6=off,netdev=hostnet1,id=net1,mac=06:54:01:00:00:01 > -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat LinkedIn: http://www.linkedin.com/in/brouer