Re: [PATCH net v2] ipv4, ipv6: Fix handling of transhdrlen in __ip{,6}_append_data()

Willem de Bruijn <willemdebruijn.kernel@xxxxxxxxx> · Wed, 20 Sep 2023 21:41:38 -0400




On Wed, Sep 20, 2023 at 9:54 AM Willem de Bruijn
<willemdebruijn.kernel@xxxxxxxxx> wrote:
>
> David Howells wrote:
> > Including the transhdrlen in length is a problem when the packet is
> > partially filled (e.g. something like send(MSG_MORE) happened previously)
> > when appending to an IPv4 or IPv6 packet as we don't want to repeat the
> > transport header or account for it twice.  This can happen under some
> > circumstances, such as splicing into an L2TP socket.
> >
> > The symptom observed is a warning in __ip6_append_data():
> >
> >     WARNING: CPU: 1 PID: 5042 at net/ipv6/ip6_output.c:1800 __ip6_append_data.isra.0+0x1be8/0x47f0 net/ipv6/ip6_output.c:1800
> >
> > that occurs when MSG_SPLICE_PAGES is used to append more data to an already
> > partially occupied skbuff.  The warning occurs when 'copy' is larger than
> > the amount of data in the message iterator.  This is because the requested
> > length includes the transport header length when it shouldn't.  This can be
> > triggered by, for example:
> >
> >         sfd = socket(AF_INET6, SOCK_DGRAM, IPPROTO_L2TP);
> >         bind(sfd, ...); // ::1
> >         connect(sfd, ...); // ::1 port 7
> >         send(sfd, buffer, 4100, MSG_MORE);
> >         sendfile(sfd, dfd, NULL, 1024);
> >
> > Fix this by deducting transhdrlen from length in ip{,6}_append_data() right
> > before we clear transhdrlen if there is already a packet that we're going
> > to try appending to.
> >
> > Reported-by: syzbot+62cbf263225ae13ff153@xxxxxxxxxxxxxxxxxxxxxxxxx
> > Link: https://lore.kernel.org/r/0000000000001c12b30605378ce8@xxxxxxxxxx/
> > Signed-off-by: David Howells <dhowells@xxxxxxxxxx>
> > cc: Eric Dumazet <edumazet@xxxxxxxxxx>
> > cc: Willem de Bruijn <willemdebruijn.kernel@xxxxxxxxx>
> > cc: "David S. Miller" <davem@xxxxxxxxxxxxx>
> > cc: David Ahern <dsahern@xxxxxxxxxx>
> > cc: Paolo Abeni <pabeni@xxxxxxxxxx>
> > cc: Jakub Kicinski <kuba@xxxxxxxxxx>
> > cc: netdev@xxxxxxxxxxxxxxx
> > cc: bpf@xxxxxxxxxxxxxxx
> > cc: syzkaller-bugs@xxxxxxxxxxxxxxxx
> > Link: https://lore.kernel.org/r/75315.1695139973@xxxxxxxxxxxxxxxxxxxxxx/ # v1
> > ---
> >  net/ipv4/ip_output.c  |    1 +
> >  net/ipv6/ip6_output.c |    1 +
> >  2 files changed, 2 insertions(+)
> >
> > diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
> > index 4ab877cf6d35..9646f2d9afcf 100644
> > --- a/net/ipv4/ip_output.c
> > +++ b/net/ipv4/ip_output.c
> > @@ -1354,6 +1354,7 @@ int ip_append_data(struct sock *sk, struct flowi4 *fl4,
> >               if (err)
> >                       return err;
> >       } else {
> > +             length -= transhdrlen;
> >               transhdrlen = 0;
> >       }
> >
> > diff --git a/net/ipv6/ip6_output.c b/net/ipv6/ip6_output.c
> > index 54fc4c711f2c..6a4ce7f622e9 100644
> > --- a/net/ipv6/ip6_output.c
> > +++ b/net/ipv6/ip6_output.c
> > @@ -1888,6 +1888,7 @@ int ip6_append_data(struct sock *sk,
> >               length += exthdrlen;
> >               transhdrlen += exthdrlen;
> >       } else {
> > +             length -= transhdrlen;
> >               transhdrlen = 0;
> >       }
> >
>
> Definitely a much simpler patch, thanks.
>
> So the current model is that callers with non-zero transhdrlen always
> pass to __ip_append_data payload length + transhdrlen.
>
> I do see that udp does this: ulen += sizeof(struct udphdr); This calls
> ip_make_skb if not corked, but directly ip_append_data if corked.
>
> Then __ip_append_data will use transhdrlen in its packet calculations,
> and reset that to zero after allocating the first new skb.
>
> So if corked *and* fragmentation, which would cause a new skb to be
> allocated, the next skb would incorrectly reserve udp header space,
> because the second __ip_append_data call will again pass transhdrlen.
> If so, then this patch fixes that. But that has never been reported,
> so I'm most likely misreading some part..

This works today because udp only includes transhdrlen if not corked.
In udpv6_sendmsg:

        if (up->pending) {
                       ...
                       goto do_append_data;
        }
        ulen += sizeof(struct udphdr);

So ip6_append_data is called with ulen == len once data is pending, so
subtracting transhdrlen (which is still sizeof(udphdr)) would not be
correct.

l2tp_ip6_sendmsg more or less follows udpv6_sendmsg, but it
unconditionally sets ulen = len + transhdrlen. So maybe the fix is in
L2TP:

+++ b/net/l2tp/l2tp_ip6.c
@@ -507,7 +507,6 @@ static int l2tp_ip6_sendmsg(struct sock *sk,
struct msghdr *msg, size_t len)
         */
        if (len > INT_MAX - transhdrlen)
                return -EMSGSIZE;
-       ulen = len + transhdrlen;

        /* Mirror BSD error message compatibility */
        if (msg->msg_flags & MSG_OOB)
@@ -628,6 +627,7 @@ static int l2tp_ip6_sendmsg(struct sock *sk,
struct msghdr *msg, size_t len)

 back_from_confirm:
        lock_sock(sk);
+       ulen = len + skb_queue_empty(&sk->sk_write_queue) ? transhdrlen : 0;

As said, only raw, udp and l2p can possibly pass MSG_MORE and so cause
secondary invocations of ip6_append_data for the same send. With raw
passing transhdrlen 0, and udp as discussed above, we only have to
consider l2tp.