On Wed, 2024-06-26 at 14:55 +0800, Fred Li wrote: > When using calico bpf based NAT, hits a kernel BUG_ON at function > skb_segment(), line 4560. Performing NAT translation when accessing > a Service IP across subnets, the calico will encap vxlan and calls the > bpf_skb_adjust_room to decrease the gso_size, and then call bpf_redirect > send packets out. > > 4438 struct sk_buff *skb_segment(struct sk_buff *head_skb, > 4439 netdev_features_t features) > 4440 { > 4441 struct sk_buff *segs = NULL; > 4442 struct sk_buff *tail = NULL; > ... > 4558 if (hsize <= 0 && i >= nfrags && skb_headlen(list_skb) && > 4559 (skb_headlen(list_skb) == len || sg)) { > 4560 BUG_ON(skb_headlen(list_skb) > len); > 4561 > 4562 nskb = skb_clone(list_skb, GFP_ATOMIC); > 4563 if (unlikely(!nskb)) > 4564 goto err; > > call stack: > ... > [exception RIP: skb_segment+3016] > RIP: ffffffffb97df2a8 RSP: ffffa3f2cce08728 RFLAGS: 00010293 > RAX: 000000000000007d RBX: 00000000fffff7b3 RCX: 0000000000000011 > RDX: 0000000000000000 RSI: ffff895ea32c76c0 RDI: 00000000000008c1 > RBP: ffffa3f2cce087f8 R8: 000000000000088f R9: 0000000000000011 > R10: 000000000000090c R11: ffff895e47e68000 R12: ffff895eb2022f00 > R13: 000000000000004b R14: ffff895ecdaf2000 R15: ffff895eb2023f00 > ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018 > #9 [ffffa3f2cce08720] skb_segment at ffffffffb97ded63 > ... > > The skb has the following properties: > doffset = 66 > list_skb = skb_shinfo(skb)->frag_list > list_skb->head_frag = true > skb->len = 2441 && skb->data_len = 2250 > skb_shinfo(skb)->nr_frags = 17 > skb_shinfo(skb)->gso_size = 75 > skb_shinfo(skb)->frags[0...16].bv_len = 125 > list_skb->len = 125 > list_skb->data_len = 0 > > When slicing the frag_list skb, there three cases: > 1. Only *non* head_frag > sg will be false, only when skb_headlen(list_skb)==len is satisfied, > it will enter the branch at line 4560, and there will be no crash. > 2. Mixed head_frag > sg will be false, Only when skb_headlen(list_skb)==len is satisfied, > it will enter the branch at line 4560, and there will be no crash. > 3. Only frag_list with head_frag=true > sg is true, three cases below: > (1) skb_headlen(list_skb)==len is satisfied, it will enter the branch > at line 4560, and there will be no crash. > (2) skb_headlen(list_skb)<len is satisfied, it will enter the branch > at line 4560, and there will be no crash. > (3) skb_headlen(list_skb)>len is satisfied, it will be crash. > > Applying this patch, three cases will be: > 1. Only *non* head_frag > sg will be false. No difference with before. > 2. Mixed head_frag > sg will be false. No difference with before. > 3. Only frag_list with head_frag=true > sg is true, there also three cases: > (1) skb_headlen(list_skb)==len is satisfied, no difference with before. > (2) skb_headlen(list_skb)<len is satisfied, will be revert to copying > in this case. > (3) skb_headlen(list_skb)>len is satisfied, will be revert to copying > in this case. > > Since commit 13acc94eff122("net: permit skb_segment on head_frag frag_list > skb"), it is allowed to segment the head_frag frag_list skb. > > Signed-off-by: Fred Li <dracodingfly@xxxxxxxxx> > --- > net/core/skbuff.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/net/core/skbuff.c b/net/core/skbuff.c > index f0a9ef1aeaa2..b1dab1b071fc 100644 > --- a/net/core/skbuff.c > +++ b/net/core/skbuff.c > @@ -4556,7 +4556,7 @@ struct sk_buff *skb_segment(struct sk_buff *head_skb, > hsize = skb_headlen(head_skb) - offset; > > if (hsize <= 0 && i >= nfrags && skb_headlen(list_skb) && > - (skb_headlen(list_skb) == len || sg)) { > + (skb_headlen(list_skb) == len)) { > BUG_ON(skb_headlen(list_skb) > len); > > nskb = skb_clone(list_skb, GFP_ATOMIC); I must admit I more than a bit lost in the many turns of skb_segment(), but the above does not look like the correct solution, as it will make the later BUG_ON() unreachable/meaningless. Do I read correctly that when the BUG_ON() triggers: list_skb->len is 125 len is 75 list_skb->frag_head is true It looks like skb_segment() is becoming even and ever more complex to cope with unexpected skb layouts, only possibly when skb_segment() is called by some BPF helpers. I think it should be up to the helper itself to adjust the skb and/or the device features to respect skb_segment() constraints, instead of making the common code increasingly not maintainable. Thanks, Paolo