Hi, On Wed, 2024-03-20 at 15:14 +0100, Eric Dumazet wrote: > On Tue, Mar 19, 2024 at 12:36 PM Florian Westphal <fw@xxxxxxxxx> wrote: > > > > ip_local_out() and other functions can pass skb->sk as function argument. > > > > If the skb is a fragment and reassembly happens before such function call > > returns, the sk must not be released. > > > > This affects skb fragments reassembled via netfilter or similar > > modules, e.g. openvswitch or ct_act.c, when run as part of tx pipeline. > > > > Eric Dumazet made an initial analysis of this bug. Quoting Eric: > > Calling ip_defrag() in output path is also implying skb_orphan(), > > which is buggy because output path relies on sk not disappearing. > > > > A relevant old patch about the issue was : > > 8282f27449bf ("inet: frag: Always orphan skbs inside ip_defrag()") > > > > [..] > > > > net/ipv4/ip_output.c depends on skb->sk being set, and probably to an > > inet socket, not an arbitrary one. > > > > If we orphan the packet in ipvlan, then downstream things like FQ > > packet scheduler will not work properly. > > > > We need to change ip_defrag() to only use skb_orphan() when really > > needed, ie whenever frag_list is going to be used. > > > > Eric suggested to stash sk in fragment queue and made an initial patch. > > However there is a problem with this: > > > > If skb is refragmented again right after, ip_do_fragment() will copy > > head->sk to the new fragments, and sets up destructor to sock_wfree. > > IOW, we have no choice but to fix up sk_wmem accouting to reflect the > > fully reassembled skb, else wmem will underflow. > > > > This change moves the orphan down into the core, to last possible moment. > > As ip_defrag_offset is aliased with sk_buff->sk member, we must move the > > offset into the FRAG_CB, else skb->sk gets clobbered. > > > > This allows to delay the orphaning long enough to learn if the skb has > > to be queued or if the skb is completing the reasm queue. > > > > In the former case, things work as before, skb is orphaned. This is > > safe because skb gets queued/stolen and won't continue past reasm engine. > > > > In the latter case, we will steal the skb->sk reference, reattach it to > > the head skb, and fix up wmem accouting when inet_frag inflates truesize. > > > > Diagnosed-by: Eric Dumazet <edumazet@xxxxxxxxxx> > > Reported-by: xingwei lee <xrivendell7@xxxxxxxxx> > > Reported-by: yue sun <samsun1006219@xxxxxxxxx> > > Reported-by: syzbot+e5167d7144a62715044c@xxxxxxxxxxxxxxxxxxxxxxxxx Possibly: Fixes: 2ad7bf363841 ("ipvlan: Initial check-in of the IPVLAN driver.") it's not very accurate but should be a reasonable oldest affected version. > > Signed-off-by: Florian Westphal <fw@xxxxxxxxx> > > --- > > include/linux/skbuff.h | 7 +-- > > net/ipv4/inet_fragment.c | 71 ++++++++++++++++++++----- > > net/ipv4/ip_fragment.c | 2 +- > > net/ipv6/netfilter/nf_conntrack_reasm.c | 2 +- > > 4 files changed, 61 insertions(+), 21 deletions(-) > > > > diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h > > index 7d56ce195120..6d08ff8a9357 100644 > > --- a/include/linux/skbuff.h > > +++ b/include/linux/skbuff.h > > @@ -753,8 +753,6 @@ typedef unsigned char *sk_buff_data_t; > > * @list: queue head > > * @ll_node: anchor in an llist (eg socket defer_list) > > * @sk: Socket we are owned by > > - * @ip_defrag_offset: (aka @sk) alternate use of @sk, used in > > - * fragmentation management > > * @dev: Device we arrived on/are leaving by > > * @dev_scratch: (aka @dev) alternate use of @dev when @dev would be %NULL > > * @cb: Control buffer. Free for use by every layer. Put private vars here > > @@ -875,10 +873,7 @@ struct sk_buff { > > struct llist_node ll_node; > > }; > > > > - union { > > - struct sock *sk; > > - int ip_defrag_offset; > > - }; > > + struct sock *sk; > > > > union { > > ktime_t tstamp; > > diff --git a/net/ipv4/inet_fragment.c b/net/ipv4/inet_fragment.c > > index 7072fc0783ef..7254b640ba06 100644 > > --- a/net/ipv4/inet_fragment.c > > +++ b/net/ipv4/inet_fragment.c > > @@ -24,6 +24,8 @@ > > #include <net/ip.h> > > #include <net/ipv6.h> > > > > +#include "../core/sock_destructor.h" > > + > > /* Use skb->cb to track consecutive/adjacent fragments coming at > > * the end of the queue. Nodes in the rb-tree queue will > > * contain "runs" of one or more adjacent fragments. > > @@ -39,6 +41,7 @@ struct ipfrag_skb_cb { > > }; > > struct sk_buff *next_frag; > > int frag_run_len; > > + int ip_defrag_offset; > > }; > > > > #define FRAG_CB(skb) ((struct ipfrag_skb_cb *)((skb)->cb)) > > @@ -396,12 +399,12 @@ int inet_frag_queue_insert(struct inet_frag_queue *q, struct sk_buff *skb, > > */ > > if (!last) > > fragrun_create(q, skb); /* First fragment. */ > > - else if (last->ip_defrag_offset + last->len < end) { > > + else if (FRAG_CB(last)->ip_defrag_offset + last->len < end) { > > /* This is the common case: skb goes to the end. */ > > /* Detect and discard overlaps. */ > > - if (offset < last->ip_defrag_offset + last->len) > > + if (offset < FRAG_CB(last)->ip_defrag_offset + last->len) > > return IPFRAG_OVERLAP; > > - if (offset == last->ip_defrag_offset + last->len) > > + if (offset == FRAG_CB(last)->ip_defrag_offset + last->len) > > fragrun_append_to_last(q, skb); > > else > > fragrun_create(q, skb); > > @@ -418,13 +421,13 @@ int inet_frag_queue_insert(struct inet_frag_queue *q, struct sk_buff *skb, > > > > parent = *rbn; > > curr = rb_to_skb(parent); > > - curr_run_end = curr->ip_defrag_offset + > > + curr_run_end = FRAG_CB(curr)->ip_defrag_offset + > > FRAG_CB(curr)->frag_run_len; > > - if (end <= curr->ip_defrag_offset) > > + if (end <= FRAG_CB(curr)->ip_defrag_offset) > > rbn = &parent->rb_left; > > else if (offset >= curr_run_end) > > rbn = &parent->rb_right; > > - else if (offset >= curr->ip_defrag_offset && > > + else if (offset >= FRAG_CB(curr)->ip_defrag_offset && > > end <= curr_run_end) > > return IPFRAG_DUP; > > else > > @@ -438,23 +441,39 @@ int inet_frag_queue_insert(struct inet_frag_queue *q, struct sk_buff *skb, > > rb_insert_color(&skb->rbnode, &q->rb_fragments); > > } > > > > - skb->ip_defrag_offset = offset; > > + FRAG_CB(skb)->ip_defrag_offset = offset; > > > > return IPFRAG_OK; > > } > > EXPORT_SYMBOL(inet_frag_queue_insert); > > > > +void tcp_wfree(struct sk_buff *skb); > > Thanks a lot Florian for looking at this ! > > Since you had : #include "../core/sock_destructor.h", perhaps the line > can be removed, > because it includes <net/tcp.h> I think Florian will not able to reply for a few days. Since the issue looks ancient and we are early in the cycle, I guess there are no problems with that. Cheers, Paolo