Re: [PATCH net] inet: inet_defrag: prevent sk release while still in use

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi,

On Wed, 2024-03-20 at 15:14 +0100, Eric Dumazet wrote:
> On Tue, Mar 19, 2024 at 12:36 PM Florian Westphal <fw@xxxxxxxxx> wrote:
> > 
> > ip_local_out() and other functions can pass skb->sk as function argument.
> > 
> > If the skb is a fragment and reassembly happens before such function call
> > returns, the sk must not be released.
> > 
> > This affects skb fragments reassembled via netfilter or similar
> > modules, e.g. openvswitch or ct_act.c, when run as part of tx pipeline.
> > 
> > Eric Dumazet made an initial analysis of this bug.  Quoting Eric:
> >   Calling ip_defrag() in output path is also implying skb_orphan(),
> >   which is buggy because output path relies on sk not disappearing.
> > 
> >   A relevant old patch about the issue was :
> >   8282f27449bf ("inet: frag: Always orphan skbs inside ip_defrag()")
> > 
> >   [..]
> > 
> >   net/ipv4/ip_output.c depends on skb->sk being set, and probably to an
> >   inet socket, not an arbitrary one.
> > 
> >   If we orphan the packet in ipvlan, then downstream things like FQ
> >   packet scheduler will not work properly.
> > 
> >   We need to change ip_defrag() to only use skb_orphan() when really
> >   needed, ie whenever frag_list is going to be used.
> > 
> > Eric suggested to stash sk in fragment queue and made an initial patch.
> > However there is a problem with this:
> > 
> > If skb is refragmented again right after, ip_do_fragment() will copy
> > head->sk to the new fragments, and sets up destructor to sock_wfree.
> > IOW, we have no choice but to fix up sk_wmem accouting to reflect the
> > fully reassembled skb, else wmem will underflow.
> > 
> > This change moves the orphan down into the core, to last possible moment.
> > As ip_defrag_offset is aliased with sk_buff->sk member, we must move the
> > offset into the FRAG_CB, else skb->sk gets clobbered.
> > 
> > This allows to delay the orphaning long enough to learn if the skb has
> > to be queued or if the skb is completing the reasm queue.
> > 
> > In the former case, things work as before, skb is orphaned.  This is
> > safe because skb gets queued/stolen and won't continue past reasm engine.
> > 
> > In the latter case, we will steal the skb->sk reference, reattach it to
> > the head skb, and fix up wmem accouting when inet_frag inflates truesize.
> > 
> > Diagnosed-by: Eric Dumazet <edumazet@xxxxxxxxxx>
> > Reported-by: xingwei lee <xrivendell7@xxxxxxxxx>
> > Reported-by: yue sun <samsun1006219@xxxxxxxxx>
> > Reported-by: syzbot+e5167d7144a62715044c@xxxxxxxxxxxxxxxxxxxxxxxxx

Possibly:

Fixes: 2ad7bf363841 ("ipvlan: Initial check-in of the IPVLAN driver.")

it's not very accurate but should be a reasonable oldest affected
version.

> > Signed-off-by: Florian Westphal <fw@xxxxxxxxx>
> > ---
> >  include/linux/skbuff.h                  |  7 +--
> >  net/ipv4/inet_fragment.c                | 71 ++++++++++++++++++++-----
> >  net/ipv4/ip_fragment.c                  |  2 +-
> >  net/ipv6/netfilter/nf_conntrack_reasm.c |  2 +-
> >  4 files changed, 61 insertions(+), 21 deletions(-)
> > 
> > diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
> > index 7d56ce195120..6d08ff8a9357 100644
> > --- a/include/linux/skbuff.h
> > +++ b/include/linux/skbuff.h
> > @@ -753,8 +753,6 @@ typedef unsigned char *sk_buff_data_t;
> >   *     @list: queue head
> >   *     @ll_node: anchor in an llist (eg socket defer_list)
> >   *     @sk: Socket we are owned by
> > - *     @ip_defrag_offset: (aka @sk) alternate use of @sk, used in
> > - *             fragmentation management
> >   *     @dev: Device we arrived on/are leaving by
> >   *     @dev_scratch: (aka @dev) alternate use of @dev when @dev would be %NULL
> >   *     @cb: Control buffer. Free for use by every layer. Put private vars here
> > @@ -875,10 +873,7 @@ struct sk_buff {
> >                 struct llist_node       ll_node;
> >         };
> > 
> > -       union {
> > -               struct sock             *sk;
> > -               int                     ip_defrag_offset;
> > -       };
> > +       struct sock             *sk;
> > 
> >         union {
> >                 ktime_t         tstamp;
> > diff --git a/net/ipv4/inet_fragment.c b/net/ipv4/inet_fragment.c
> > index 7072fc0783ef..7254b640ba06 100644
> > --- a/net/ipv4/inet_fragment.c
> > +++ b/net/ipv4/inet_fragment.c
> > @@ -24,6 +24,8 @@
> >  #include <net/ip.h>
> >  #include <net/ipv6.h>
> > 
> > +#include "../core/sock_destructor.h"
> > +
> >  /* Use skb->cb to track consecutive/adjacent fragments coming at
> >   * the end of the queue. Nodes in the rb-tree queue will
> >   * contain "runs" of one or more adjacent fragments.
> > @@ -39,6 +41,7 @@ struct ipfrag_skb_cb {
> >         };
> >         struct sk_buff          *next_frag;
> >         int                     frag_run_len;
> > +       int                     ip_defrag_offset;
> >  };
> > 
> >  #define FRAG_CB(skb)           ((struct ipfrag_skb_cb *)((skb)->cb))
> > @@ -396,12 +399,12 @@ int inet_frag_queue_insert(struct inet_frag_queue *q, struct sk_buff *skb,
> >          */
> >         if (!last)
> >                 fragrun_create(q, skb);  /* First fragment. */
> > -       else if (last->ip_defrag_offset + last->len < end) {
> > +       else if (FRAG_CB(last)->ip_defrag_offset + last->len < end) {
> >                 /* This is the common case: skb goes to the end. */
> >                 /* Detect and discard overlaps. */
> > -               if (offset < last->ip_defrag_offset + last->len)
> > +               if (offset < FRAG_CB(last)->ip_defrag_offset + last->len)
> >                         return IPFRAG_OVERLAP;
> > -               if (offset == last->ip_defrag_offset + last->len)
> > +               if (offset == FRAG_CB(last)->ip_defrag_offset + last->len)
> >                         fragrun_append_to_last(q, skb);
> >                 else
> >                         fragrun_create(q, skb);
> > @@ -418,13 +421,13 @@ int inet_frag_queue_insert(struct inet_frag_queue *q, struct sk_buff *skb,
> > 
> >                         parent = *rbn;
> >                         curr = rb_to_skb(parent);
> > -                       curr_run_end = curr->ip_defrag_offset +
> > +                       curr_run_end = FRAG_CB(curr)->ip_defrag_offset +
> >                                         FRAG_CB(curr)->frag_run_len;
> > -                       if (end <= curr->ip_defrag_offset)
> > +                       if (end <= FRAG_CB(curr)->ip_defrag_offset)
> >                                 rbn = &parent->rb_left;
> >                         else if (offset >= curr_run_end)
> >                                 rbn = &parent->rb_right;
> > -                       else if (offset >= curr->ip_defrag_offset &&
> > +                       else if (offset >= FRAG_CB(curr)->ip_defrag_offset &&
> >                                  end <= curr_run_end)
> >                                 return IPFRAG_DUP;
> >                         else
> > @@ -438,23 +441,39 @@ int inet_frag_queue_insert(struct inet_frag_queue *q, struct sk_buff *skb,
> >                 rb_insert_color(&skb->rbnode, &q->rb_fragments);
> >         }
> > 
> > -       skb->ip_defrag_offset = offset;
> > +       FRAG_CB(skb)->ip_defrag_offset = offset;
> > 
> >         return IPFRAG_OK;
> >  }
> >  EXPORT_SYMBOL(inet_frag_queue_insert);
> > 
> > +void tcp_wfree(struct sk_buff *skb);
> 
> Thanks a lot Florian for looking at this !
> 
> Since you had : #include "../core/sock_destructor.h", perhaps the line
> can be removed,
> because it includes <net/tcp.h>

I think Florian will not able to reply for a few days.

Since the issue looks ancient and we are early in the cycle, I guess
there are no problems with that.

Cheers,

Paolo






[Index of Archives]     [Netfitler Users]     [Berkeley Packet Filter]     [LARTC]     [Bugtraq]     [Yosemite Forum]

  Powered by Linux