Re: [PATCH net] inet: inet_defrag: prevent sk release while still in use

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Tue, Mar 19, 2024 at 12:36 PM Florian Westphal <fw@xxxxxxxxx> wrote:
>
> ip_local_out() and other functions can pass skb->sk as function argument.
>
> If the skb is a fragment and reassembly happens before such function call
> returns, the sk must not be released.
>
> This affects skb fragments reassembled via netfilter or similar
> modules, e.g. openvswitch or ct_act.c, when run as part of tx pipeline.
>
> Eric Dumazet made an initial analysis of this bug.  Quoting Eric:
>   Calling ip_defrag() in output path is also implying skb_orphan(),
>   which is buggy because output path relies on sk not disappearing.
>
>   A relevant old patch about the issue was :
>   8282f27449bf ("inet: frag: Always orphan skbs inside ip_defrag()")
>
>   [..]
>
>   net/ipv4/ip_output.c depends on skb->sk being set, and probably to an
>   inet socket, not an arbitrary one.
>
>   If we orphan the packet in ipvlan, then downstream things like FQ
>   packet scheduler will not work properly.
>
>   We need to change ip_defrag() to only use skb_orphan() when really
>   needed, ie whenever frag_list is going to be used.
>
> Eric suggested to stash sk in fragment queue and made an initial patch.
> However there is a problem with this:
>
> If skb is refragmented again right after, ip_do_fragment() will copy
> head->sk to the new fragments, and sets up destructor to sock_wfree.
> IOW, we have no choice but to fix up sk_wmem accouting to reflect the
> fully reassembled skb, else wmem will underflow.
>
> This change moves the orphan down into the core, to last possible moment.
> As ip_defrag_offset is aliased with sk_buff->sk member, we must move the
> offset into the FRAG_CB, else skb->sk gets clobbered.
>
> This allows to delay the orphaning long enough to learn if the skb has
> to be queued or if the skb is completing the reasm queue.
>
> In the former case, things work as before, skb is orphaned.  This is
> safe because skb gets queued/stolen and won't continue past reasm engine.
>
> In the latter case, we will steal the skb->sk reference, reattach it to
> the head skb, and fix up wmem accouting when inet_frag inflates truesize.
>
> Diagnosed-by: Eric Dumazet <edumazet@xxxxxxxxxx>
> Reported-by: xingwei lee <xrivendell7@xxxxxxxxx>
> Reported-by: yue sun <samsun1006219@xxxxxxxxx>
> Reported-by: syzbot+e5167d7144a62715044c@xxxxxxxxxxxxxxxxxxxxxxxxx
> Signed-off-by: Florian Westphal <fw@xxxxxxxxx>
> ---
>  include/linux/skbuff.h                  |  7 +--
>  net/ipv4/inet_fragment.c                | 71 ++++++++++++++++++++-----
>  net/ipv4/ip_fragment.c                  |  2 +-
>  net/ipv6/netfilter/nf_conntrack_reasm.c |  2 +-
>  4 files changed, 61 insertions(+), 21 deletions(-)
>
> diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
> index 7d56ce195120..6d08ff8a9357 100644
> --- a/include/linux/skbuff.h
> +++ b/include/linux/skbuff.h
> @@ -753,8 +753,6 @@ typedef unsigned char *sk_buff_data_t;
>   *     @list: queue head
>   *     @ll_node: anchor in an llist (eg socket defer_list)
>   *     @sk: Socket we are owned by
> - *     @ip_defrag_offset: (aka @sk) alternate use of @sk, used in
> - *             fragmentation management
>   *     @dev: Device we arrived on/are leaving by
>   *     @dev_scratch: (aka @dev) alternate use of @dev when @dev would be %NULL
>   *     @cb: Control buffer. Free for use by every layer. Put private vars here
> @@ -875,10 +873,7 @@ struct sk_buff {
>                 struct llist_node       ll_node;
>         };
>
> -       union {
> -               struct sock             *sk;
> -               int                     ip_defrag_offset;
> -       };
> +       struct sock             *sk;
>
>         union {
>                 ktime_t         tstamp;
> diff --git a/net/ipv4/inet_fragment.c b/net/ipv4/inet_fragment.c
> index 7072fc0783ef..7254b640ba06 100644
> --- a/net/ipv4/inet_fragment.c
> +++ b/net/ipv4/inet_fragment.c
> @@ -24,6 +24,8 @@
>  #include <net/ip.h>
>  #include <net/ipv6.h>
>
> +#include "../core/sock_destructor.h"
> +
>  /* Use skb->cb to track consecutive/adjacent fragments coming at
>   * the end of the queue. Nodes in the rb-tree queue will
>   * contain "runs" of one or more adjacent fragments.
> @@ -39,6 +41,7 @@ struct ipfrag_skb_cb {
>         };
>         struct sk_buff          *next_frag;
>         int                     frag_run_len;
> +       int                     ip_defrag_offset;
>  };
>
>  #define FRAG_CB(skb)           ((struct ipfrag_skb_cb *)((skb)->cb))
> @@ -396,12 +399,12 @@ int inet_frag_queue_insert(struct inet_frag_queue *q, struct sk_buff *skb,
>          */
>         if (!last)
>                 fragrun_create(q, skb);  /* First fragment. */
> -       else if (last->ip_defrag_offset + last->len < end) {
> +       else if (FRAG_CB(last)->ip_defrag_offset + last->len < end) {
>                 /* This is the common case: skb goes to the end. */
>                 /* Detect and discard overlaps. */
> -               if (offset < last->ip_defrag_offset + last->len)
> +               if (offset < FRAG_CB(last)->ip_defrag_offset + last->len)
>                         return IPFRAG_OVERLAP;
> -               if (offset == last->ip_defrag_offset + last->len)
> +               if (offset == FRAG_CB(last)->ip_defrag_offset + last->len)
>                         fragrun_append_to_last(q, skb);
>                 else
>                         fragrun_create(q, skb);
> @@ -418,13 +421,13 @@ int inet_frag_queue_insert(struct inet_frag_queue *q, struct sk_buff *skb,
>
>                         parent = *rbn;
>                         curr = rb_to_skb(parent);
> -                       curr_run_end = curr->ip_defrag_offset +
> +                       curr_run_end = FRAG_CB(curr)->ip_defrag_offset +
>                                         FRAG_CB(curr)->frag_run_len;
> -                       if (end <= curr->ip_defrag_offset)
> +                       if (end <= FRAG_CB(curr)->ip_defrag_offset)
>                                 rbn = &parent->rb_left;
>                         else if (offset >= curr_run_end)
>                                 rbn = &parent->rb_right;
> -                       else if (offset >= curr->ip_defrag_offset &&
> +                       else if (offset >= FRAG_CB(curr)->ip_defrag_offset &&
>                                  end <= curr_run_end)
>                                 return IPFRAG_DUP;
>                         else
> @@ -438,23 +441,39 @@ int inet_frag_queue_insert(struct inet_frag_queue *q, struct sk_buff *skb,
>                 rb_insert_color(&skb->rbnode, &q->rb_fragments);
>         }
>
> -       skb->ip_defrag_offset = offset;
> +       FRAG_CB(skb)->ip_defrag_offset = offset;
>
>         return IPFRAG_OK;
>  }
>  EXPORT_SYMBOL(inet_frag_queue_insert);
>
> +void tcp_wfree(struct sk_buff *skb);

Thanks a lot Florian for looking at this !

Since you had : #include "../core/sock_destructor.h", perhaps the line
can be removed,
because it includes <net/tcp.h>





[Index of Archives]     [Netfitler Users]     [Berkeley Packet Filter]     [LARTC]     [Bugtraq]     [Yosemite Forum]

  Powered by Linux