On Wed, Sep 20, 2023 at 9:54 AM Willem de Bruijn <willemdebruijn.kernel@xxxxxxxxx> wrote: > > David Howells wrote: > > Including the transhdrlen in length is a problem when the packet is > > partially filled (e.g. something like send(MSG_MORE) happened previously) > > when appending to an IPv4 or IPv6 packet as we don't want to repeat the > > transport header or account for it twice. This can happen under some > > circumstances, such as splicing into an L2TP socket. > > > > The symptom observed is a warning in __ip6_append_data(): > > > > WARNING: CPU: 1 PID: 5042 at net/ipv6/ip6_output.c:1800 __ip6_append_data.isra.0+0x1be8/0x47f0 net/ipv6/ip6_output.c:1800 > > > > that occurs when MSG_SPLICE_PAGES is used to append more data to an already > > partially occupied skbuff. The warning occurs when 'copy' is larger than > > the amount of data in the message iterator. This is because the requested > > length includes the transport header length when it shouldn't. This can be > > triggered by, for example: > > > > sfd = socket(AF_INET6, SOCK_DGRAM, IPPROTO_L2TP); > > bind(sfd, ...); // ::1 > > connect(sfd, ...); // ::1 port 7 > > send(sfd, buffer, 4100, MSG_MORE); > > sendfile(sfd, dfd, NULL, 1024); > > > > Fix this by deducting transhdrlen from length in ip{,6}_append_data() right > > before we clear transhdrlen if there is already a packet that we're going > > to try appending to. > > > > Reported-by: syzbot+62cbf263225ae13ff153@xxxxxxxxxxxxxxxxxxxxxxxxx > > Link: https://lore.kernel.org/r/0000000000001c12b30605378ce8@xxxxxxxxxx/ > > Signed-off-by: David Howells <dhowells@xxxxxxxxxx> > > cc: Eric Dumazet <edumazet@xxxxxxxxxx> > > cc: Willem de Bruijn <willemdebruijn.kernel@xxxxxxxxx> > > cc: "David S. Miller" <davem@xxxxxxxxxxxxx> > > cc: David Ahern <dsahern@xxxxxxxxxx> > > cc: Paolo Abeni <pabeni@xxxxxxxxxx> > > cc: Jakub Kicinski <kuba@xxxxxxxxxx> > > cc: netdev@xxxxxxxxxxxxxxx > > cc: bpf@xxxxxxxxxxxxxxx > > cc: syzkaller-bugs@xxxxxxxxxxxxxxxx > > Link: https://lore.kernel.org/r/75315.1695139973@xxxxxxxxxxxxxxxxxxxxxx/ # v1 > > --- > > net/ipv4/ip_output.c | 1 + > > net/ipv6/ip6_output.c | 1 + > > 2 files changed, 2 insertions(+) > > > > diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c > > index 4ab877cf6d35..9646f2d9afcf 100644 > > --- a/net/ipv4/ip_output.c > > +++ b/net/ipv4/ip_output.c > > @@ -1354,6 +1354,7 @@ int ip_append_data(struct sock *sk, struct flowi4 *fl4, > > if (err) > > return err; > > } else { > > + length -= transhdrlen; > > transhdrlen = 0; > > } > > > > diff --git a/net/ipv6/ip6_output.c b/net/ipv6/ip6_output.c > > index 54fc4c711f2c..6a4ce7f622e9 100644 > > --- a/net/ipv6/ip6_output.c > > +++ b/net/ipv6/ip6_output.c > > @@ -1888,6 +1888,7 @@ int ip6_append_data(struct sock *sk, > > length += exthdrlen; > > transhdrlen += exthdrlen; > > } else { > > + length -= transhdrlen; > > transhdrlen = 0; > > } > > > > Definitely a much simpler patch, thanks. > > So the current model is that callers with non-zero transhdrlen always > pass to __ip_append_data payload length + transhdrlen. > > I do see that udp does this: ulen += sizeof(struct udphdr); This calls > ip_make_skb if not corked, but directly ip_append_data if corked. > > Then __ip_append_data will use transhdrlen in its packet calculations, > and reset that to zero after allocating the first new skb. > > So if corked *and* fragmentation, which would cause a new skb to be > allocated, the next skb would incorrectly reserve udp header space, > because the second __ip_append_data call will again pass transhdrlen. > If so, then this patch fixes that. But that has never been reported, > so I'm most likely misreading some part.. This works today because udp only includes transhdrlen if not corked. In udpv6_sendmsg: if (up->pending) { ... goto do_append_data; } ulen += sizeof(struct udphdr); So ip6_append_data is called with ulen == len once data is pending, so subtracting transhdrlen (which is still sizeof(udphdr)) would not be correct. l2tp_ip6_sendmsg more or less follows udpv6_sendmsg, but it unconditionally sets ulen = len + transhdrlen. So maybe the fix is in L2TP: +++ b/net/l2tp/l2tp_ip6.c @@ -507,7 +507,6 @@ static int l2tp_ip6_sendmsg(struct sock *sk, struct msghdr *msg, size_t len) */ if (len > INT_MAX - transhdrlen) return -EMSGSIZE; - ulen = len + transhdrlen; /* Mirror BSD error message compatibility */ if (msg->msg_flags & MSG_OOB) @@ -628,6 +627,7 @@ static int l2tp_ip6_sendmsg(struct sock *sk, struct msghdr *msg, size_t len) back_from_confirm: lock_sock(sk); + ulen = len + skb_queue_empty(&sk->sk_write_queue) ? transhdrlen : 0; As said, only raw, udp and l2p can possibly pass MSG_MORE and so cause secondary invocations of ip6_append_data for the same send. With raw passing transhdrlen 0, and udp as discussed above, we only have to consider l2tp.