On Tue, Mar 19, 2024 at 12:36 PM Florian Westphal <fw@xxxxxxxxx> wrote: > > ip_local_out() and other functions can pass skb->sk as function argument. > > If the skb is a fragment and reassembly happens before such function call > returns, the sk must not be released. > > This affects skb fragments reassembled via netfilter or similar > modules, e.g. openvswitch or ct_act.c, when run as part of tx pipeline. > > Eric Dumazet made an initial analysis of this bug. Quoting Eric: > Calling ip_defrag() in output path is also implying skb_orphan(), > which is buggy because output path relies on sk not disappearing. > > A relevant old patch about the issue was : > 8282f27449bf ("inet: frag: Always orphan skbs inside ip_defrag()") > > [..] > > net/ipv4/ip_output.c depends on skb->sk being set, and probably to an > inet socket, not an arbitrary one. > > If we orphan the packet in ipvlan, then downstream things like FQ > packet scheduler will not work properly. > > We need to change ip_defrag() to only use skb_orphan() when really > needed, ie whenever frag_list is going to be used. > > Eric suggested to stash sk in fragment queue and made an initial patch. > However there is a problem with this: > > If skb is refragmented again right after, ip_do_fragment() will copy > head->sk to the new fragments, and sets up destructor to sock_wfree. > IOW, we have no choice but to fix up sk_wmem accouting to reflect the > fully reassembled skb, else wmem will underflow. > > This change moves the orphan down into the core, to last possible moment. > As ip_defrag_offset is aliased with sk_buff->sk member, we must move the > offset into the FRAG_CB, else skb->sk gets clobbered. > > This allows to delay the orphaning long enough to learn if the skb has > to be queued or if the skb is completing the reasm queue. > > In the former case, things work as before, skb is orphaned. This is > safe because skb gets queued/stolen and won't continue past reasm engine. > > In the latter case, we will steal the skb->sk reference, reattach it to > the head skb, and fix up wmem accouting when inet_frag inflates truesize. > > Diagnosed-by: Eric Dumazet <edumazet@xxxxxxxxxx> > Reported-by: xingwei lee <xrivendell7@xxxxxxxxx> > Reported-by: yue sun <samsun1006219@xxxxxxxxx> > Reported-by: syzbot+e5167d7144a62715044c@xxxxxxxxxxxxxxxxxxxxxxxxx > Signed-off-by: Florian Westphal <fw@xxxxxxxxx> > --- > include/linux/skbuff.h | 7 +-- > net/ipv4/inet_fragment.c | 71 ++++++++++++++++++++----- > net/ipv4/ip_fragment.c | 2 +- > net/ipv6/netfilter/nf_conntrack_reasm.c | 2 +- > 4 files changed, 61 insertions(+), 21 deletions(-) > > diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h > index 7d56ce195120..6d08ff8a9357 100644 > --- a/include/linux/skbuff.h > +++ b/include/linux/skbuff.h > @@ -753,8 +753,6 @@ typedef unsigned char *sk_buff_data_t; > * @list: queue head > * @ll_node: anchor in an llist (eg socket defer_list) > * @sk: Socket we are owned by > - * @ip_defrag_offset: (aka @sk) alternate use of @sk, used in > - * fragmentation management > * @dev: Device we arrived on/are leaving by > * @dev_scratch: (aka @dev) alternate use of @dev when @dev would be %NULL > * @cb: Control buffer. Free for use by every layer. Put private vars here > @@ -875,10 +873,7 @@ struct sk_buff { > struct llist_node ll_node; > }; > > - union { > - struct sock *sk; > - int ip_defrag_offset; > - }; > + struct sock *sk; > > union { > ktime_t tstamp; > diff --git a/net/ipv4/inet_fragment.c b/net/ipv4/inet_fragment.c > index 7072fc0783ef..7254b640ba06 100644 > --- a/net/ipv4/inet_fragment.c > +++ b/net/ipv4/inet_fragment.c > @@ -24,6 +24,8 @@ > #include <net/ip.h> > #include <net/ipv6.h> > > +#include "../core/sock_destructor.h" > + > /* Use skb->cb to track consecutive/adjacent fragments coming at > * the end of the queue. Nodes in the rb-tree queue will > * contain "runs" of one or more adjacent fragments. > @@ -39,6 +41,7 @@ struct ipfrag_skb_cb { > }; > struct sk_buff *next_frag; > int frag_run_len; > + int ip_defrag_offset; > }; > > #define FRAG_CB(skb) ((struct ipfrag_skb_cb *)((skb)->cb)) > @@ -396,12 +399,12 @@ int inet_frag_queue_insert(struct inet_frag_queue *q, struct sk_buff *skb, > */ > if (!last) > fragrun_create(q, skb); /* First fragment. */ > - else if (last->ip_defrag_offset + last->len < end) { > + else if (FRAG_CB(last)->ip_defrag_offset + last->len < end) { > /* This is the common case: skb goes to the end. */ > /* Detect and discard overlaps. */ > - if (offset < last->ip_defrag_offset + last->len) > + if (offset < FRAG_CB(last)->ip_defrag_offset + last->len) > return IPFRAG_OVERLAP; > - if (offset == last->ip_defrag_offset + last->len) > + if (offset == FRAG_CB(last)->ip_defrag_offset + last->len) > fragrun_append_to_last(q, skb); > else > fragrun_create(q, skb); > @@ -418,13 +421,13 @@ int inet_frag_queue_insert(struct inet_frag_queue *q, struct sk_buff *skb, > > parent = *rbn; > curr = rb_to_skb(parent); > - curr_run_end = curr->ip_defrag_offset + > + curr_run_end = FRAG_CB(curr)->ip_defrag_offset + > FRAG_CB(curr)->frag_run_len; > - if (end <= curr->ip_defrag_offset) > + if (end <= FRAG_CB(curr)->ip_defrag_offset) > rbn = &parent->rb_left; > else if (offset >= curr_run_end) > rbn = &parent->rb_right; > - else if (offset >= curr->ip_defrag_offset && > + else if (offset >= FRAG_CB(curr)->ip_defrag_offset && > end <= curr_run_end) > return IPFRAG_DUP; > else > @@ -438,23 +441,39 @@ int inet_frag_queue_insert(struct inet_frag_queue *q, struct sk_buff *skb, > rb_insert_color(&skb->rbnode, &q->rb_fragments); > } > > - skb->ip_defrag_offset = offset; > + FRAG_CB(skb)->ip_defrag_offset = offset; > > return IPFRAG_OK; > } > EXPORT_SYMBOL(inet_frag_queue_insert); > > +void tcp_wfree(struct sk_buff *skb); Thanks a lot Florian for looking at this ! Since you had : #include "../core/sock_destructor.h", perhaps the line can be removed, because it includes <net/tcp.h>