On Tue, 4 Aug 2020 at 11:24, Stefano Brivio <sbrivio@xxxxxxxxxx> wrote: > > It's currently possible to bridge Ethernet tunnels carrying IP > packets directly to external interfaces without assigning them > addresses and routes on the bridged network itself: this is the case > for UDP tunnels bridged with a standard bridge or by Open vSwitch. > > PMTU discovery is currently broken with those configurations, because > the encapsulation effectively decreases the MTU of the link, and > while we are able to account for this using PMTU discovery on the > lower layer, we don't have a way to relay ICMP or ICMPv6 messages > needed by the sender, because we don't have valid routes to it. > > On the other hand, as a tunnel endpoint, we can't fragment packets > as a general approach: this is for instance clearly forbidden for > VXLAN by RFC 7348, section 4.3: > > VTEPs MUST NOT fragment VXLAN packets. Intermediate routers may > fragment encapsulated VXLAN packets due to the larger frame size. > The destination VTEP MAY silently discard such VXLAN fragments. > > The same paragraph recommends that the MTU over the physical network > accomodates for encapsulations, but this isn't a practical option for > complex topologies, especially for typical Open vSwitch use cases. > > Further, it states that: > > Other techniques like Path MTU discovery (see [RFC1191] and > [RFC1981]) MAY be used to address this requirement as well. > > Now, PMTU discovery already works for routed interfaces, we get > route exceptions created by the encapsulation device as they receive > ICMP Fragmentation Needed and ICMPv6 Packet Too Big messages, and > we already rebuild those messages with the appropriate MTU and route > them back to the sender. > > Add the missing bits for bridged cases: > > - checks in skb_tunnel_check_pmtu() to understand if it's appropriate > to trigger a reply according to RFC 1122 section 3.2.2 for ICMP and > RFC 4443 section 2.4 for ICMPv6. This function is already called by > UDP tunnels > > - a new function generating those ICMP or ICMPv6 replies. We can't > reuse icmp_send() and icmp6_send() as we don't see the sender as a > valid destination. This doesn't need to be generic, as we don't > cover any other type of ICMP errors given that we only provide an > encapsulation function to the sender > > While at it, make the MTU check in skb_tunnel_check_pmtu() accurate: > we might receive GSO buffers here, and the passed headroom already > includes the inner MAC length, so we don't have to account for it > a second time (that would imply three MAC headers on the wire, but > there are just two). > > This issue became visible while bridging IPv6 packets with 4500 bytes > of payload over GENEVE using IPv4 with a PMTU of 4000. Given the 50 > bytes of encapsulation headroom, we would advertise MTU as 3950, and > we would reject fragmented IPv6 datagrams of 3958 bytes size on the > wire. We're exclusively dealing with network MTU here, though, so we > could get Ethernet frames up to 3964 octets in that case. > > v2: > - moved skb_tunnel_check_pmtu() to ip_tunnel_core.c (David Ahern) > - split IPv4/IPv6 functions (David Ahern) > > Signed-off-by: Stefano Brivio <sbrivio@xxxxxxxxxx> > --- > drivers/net/bareudp.c | 5 +- > drivers/net/geneve.c | 5 +- > drivers/net/vxlan.c | 4 +- > include/net/dst.h | 10 -- > include/net/ip_tunnels.h | 2 + > net/ipv4/ip_tunnel_core.c | 244 ++++++++++++++++++++++++++++++++++++++ > 6 files changed, 254 insertions(+), 16 deletions(-) > > diff --git a/drivers/net/bareudp.c b/drivers/net/bareudp.c > index 3b6664c7e73c..841910f1db65 100644 > --- a/drivers/net/bareudp.c > +++ b/drivers/net/bareudp.c > @@ -308,7 +308,7 @@ static int bareudp_xmit_skb(struct sk_buff *skb, struct net_device *dev, > return PTR_ERR(rt); > > skb_tunnel_check_pmtu(skb, &rt->dst, > - BAREUDP_IPV4_HLEN + info->options_len); > + BAREUDP_IPV4_HLEN + info->options_len, false); > > sport = udp_flow_src_port(bareudp->net, skb, > bareudp->sport_min, USHRT_MAX, > @@ -369,7 +369,8 @@ static int bareudp6_xmit_skb(struct sk_buff *skb, struct net_device *dev, > if (IS_ERR(dst)) > return PTR_ERR(dst); > > - skb_tunnel_check_pmtu(skb, dst, BAREUDP_IPV6_HLEN + info->options_len); > + skb_tunnel_check_pmtu(skb, dst, BAREUDP_IPV6_HLEN + info->options_len, > + false); > > sport = udp_flow_src_port(bareudp->net, skb, > bareudp->sport_min, USHRT_MAX, > diff --git a/drivers/net/geneve.c b/drivers/net/geneve.c > index 017c13acc911..de86b6d82132 100644 > --- a/drivers/net/geneve.c > +++ b/drivers/net/geneve.c > @@ -894,7 +894,7 @@ static int geneve_xmit_skb(struct sk_buff *skb, struct net_device *dev, > return PTR_ERR(rt); > > skb_tunnel_check_pmtu(skb, &rt->dst, > - GENEVE_IPV4_HLEN + info->options_len); > + GENEVE_IPV4_HLEN + info->options_len, false); > > sport = udp_flow_src_port(geneve->net, skb, 1, USHRT_MAX, true); > if (geneve->cfg.collect_md) { > @@ -955,7 +955,8 @@ static int geneve6_xmit_skb(struct sk_buff *skb, struct net_device *dev, > if (IS_ERR(dst)) > return PTR_ERR(dst); > > - skb_tunnel_check_pmtu(skb, dst, GENEVE_IPV6_HLEN + info->options_len); > + skb_tunnel_check_pmtu(skb, dst, GENEVE_IPV6_HLEN + info->options_len, > + false); > > sport = udp_flow_src_port(geneve->net, skb, 1, USHRT_MAX, true); > if (geneve->cfg.collect_md) { > diff --git a/drivers/net/vxlan.c b/drivers/net/vxlan.c > index 77658425db8a..1432544da507 100644 > --- a/drivers/net/vxlan.c > +++ b/drivers/net/vxlan.c > @@ -2720,7 +2720,7 @@ static void vxlan_xmit_one(struct sk_buff *skb, struct net_device *dev, > } > > ndst = &rt->dst; > - skb_tunnel_check_pmtu(skb, ndst, VXLAN_HEADROOM); > + skb_tunnel_check_pmtu(skb, ndst, VXLAN_HEADROOM, false); > > tos = ip_tunnel_ecn_encap(RT_TOS(tos), old_iph, skb); > ttl = ttl ? : ip4_dst_hoplimit(&rt->dst); > @@ -2760,7 +2760,7 @@ static void vxlan_xmit_one(struct sk_buff *skb, struct net_device *dev, > goto out_unlock; > } > > - skb_tunnel_check_pmtu(skb, ndst, VXLAN6_HEADROOM); > + skb_tunnel_check_pmtu(skb, ndst, VXLAN6_HEADROOM, false); > > tos = ip_tunnel_ecn_encap(RT_TOS(tos), old_iph, skb); > ttl = ttl ? : ip6_dst_hoplimit(ndst); > diff --git a/include/net/dst.h b/include/net/dst.h > index 852d8fb36ab7..6ae2e625050d 100644 > --- a/include/net/dst.h > +++ b/include/net/dst.h > @@ -535,14 +535,4 @@ static inline void skb_dst_update_pmtu_no_confirm(struct sk_buff *skb, u32 mtu) > dst->ops->update_pmtu(dst, NULL, skb, mtu, false); > } > > -static inline void skb_tunnel_check_pmtu(struct sk_buff *skb, > - struct dst_entry *encap_dst, > - int headroom) > -{ > - u32 encap_mtu = dst_mtu(encap_dst); > - > - if (skb->len > encap_mtu - headroom) > - skb_dst_update_pmtu_no_confirm(skb, encap_mtu - headroom); > -} > - > #endif /* _NET_DST_H */ > diff --git a/include/net/ip_tunnels.h b/include/net/ip_tunnels.h > index 36025dea7612..02ccd32542d0 100644 > --- a/include/net/ip_tunnels.h > +++ b/include/net/ip_tunnels.h > @@ -420,6 +420,8 @@ void iptunnel_xmit(struct sock *sk, struct rtable *rt, struct sk_buff *skb, > u8 tos, u8 ttl, __be16 df, bool xnet); > struct metadata_dst *iptunnel_metadata_reply(struct metadata_dst *md, > gfp_t flags); > +int skb_tunnel_check_pmtu(struct sk_buff *skb, struct dst_entry *encap_dst, > + int headroom, bool reply); > > int iptunnel_handle_offloads(struct sk_buff *skb, int gso_type_mask); > > diff --git a/net/ipv4/ip_tunnel_core.c b/net/ipv4/ip_tunnel_core.c > index f8b419e2475c..9ddee2a0c66d 100644 > --- a/net/ipv4/ip_tunnel_core.c > +++ b/net/ipv4/ip_tunnel_core.c > @@ -184,6 +184,250 @@ int iptunnel_handle_offloads(struct sk_buff *skb, > } > EXPORT_SYMBOL_GPL(iptunnel_handle_offloads); > > +/** > + * iptunnel_pmtud_build_icmp() - Build ICMP error message for PMTUD > + * @skb: Original packet with L2 header > + * @mtu: MTU value for ICMP error > + * > + * Return: length on success, negative error code if message couldn't be built. > + */ > +static int iptunnel_pmtud_build_icmp(struct sk_buff *skb, int mtu) > +{ > + const struct iphdr *iph = ip_hdr(skb); > + struct icmphdr *icmph; > + struct iphdr *niph; > + struct ethhdr eh; > + int len, err; > + > + if (!pskb_may_pull(skb, ETH_HLEN + sizeof(struct iphdr))) > + return -EINVAL; > + > + skb_copy_bits(skb, skb_mac_offset(skb), &eh, ETH_HLEN); > + pskb_pull(skb, ETH_HLEN); > + skb_reset_network_header(skb); > + > + err = pskb_trim(skb, 576 - sizeof(*niph) - sizeof(*icmph)); > + if (err) > + return err; > + > + len = skb->len + sizeof(*icmph); > + err = skb_cow(skb, sizeof(*niph) + sizeof(*icmph) + ETH_HLEN); > + if (err) > + return err; > + > + icmph = skb_push(skb, sizeof(*icmph)); > + *icmph = (struct icmphdr) { > + .type = ICMP_DEST_UNREACH, > + .code = ICMP_FRAG_NEEDED, > + .checksum = 0, > + .un.frag.__unused = 0, > + .un.frag.mtu = ntohs(mtu), > + }; > + icmph->checksum = ip_compute_csum(icmph, len); > + skb_reset_transport_header(skb); > + > + niph = skb_push(skb, sizeof(*niph)); > + *niph = (struct iphdr) { > + .ihl = sizeof(*niph) / 4u, > + .version = 4, > + .tos = 0, > + .tot_len = htons(len + sizeof(*niph)), > + .id = 0, > + .frag_off = htons(IP_DF), > + .ttl = iph->ttl, > + .protocol = IPPROTO_ICMP, > + .saddr = iph->daddr, > + .daddr = iph->saddr, > + }; > + ip_send_check(niph); > + skb_reset_network_header(skb); > + > + skb->ip_summed = CHECKSUM_NONE; > + > + eth_header(skb, skb->dev, htons(eh.h_proto), eh.h_source, eh.h_dest, 0); > + skb_reset_mac_header(skb); > + > + return skb->len; > +} > + > +/** > + * iptunnel_pmtud_check_icmp() - Trigger ICMP reply if needed and allowed > + * @skb: Buffer being sent by encapsulation, L2 headers expected > + * @mtu: Network MTU for path > + * > + * Return: 0 for no ICMP reply, length if built, negative value on error. > + */ > +static int iptunnel_pmtud_check_icmp(struct sk_buff *skb, int mtu) > +{ > + const struct icmphdr *icmph = icmp_hdr(skb); > + const struct iphdr *iph = ip_hdr(skb); > + > + if (mtu <= 576 || iph->frag_off != htons(IP_DF)) > + return 0; > + > + if (ipv4_is_lbcast(iph->daddr) || ipv4_is_multicast(iph->daddr) || > + ipv4_is_zeronet(iph->saddr) || ipv4_is_loopback(iph->saddr) || > + ipv4_is_lbcast(iph->saddr) || ipv4_is_multicast(iph->saddr)) > + return 0; > + > + if (iph->protocol == IPPROTO_ICMP && icmp_is_err(icmph->type)) > + return 0; > + > + return iptunnel_pmtud_build_icmp(skb, mtu); > +} > + > +#if IS_ENABLED(CONFIG_IPV6) > +/** > + * iptunnel_pmtud_build_icmpv6() - Build ICMPv6 error message for PMTUD > + * @skb: Original packet with L2 header > + * @mtu: MTU value for ICMPv6 error > + * > + * Return: length on success, negative error code if message couldn't be built. > + */ > +static int iptunnel_pmtud_build_icmpv6(struct sk_buff *skb, int mtu) > +{ > + const struct ipv6hdr *ip6h = ipv6_hdr(skb); > + struct icmp6hdr *icmp6h; > + struct ipv6hdr *nip6h; > + struct ethhdr eh; > + int len, err; > + __wsum csum; > + > + if (!pskb_may_pull(skb, ETH_HLEN + sizeof(struct ipv6hdr))) > + return -EINVAL; > + > + skb_copy_bits(skb, skb_mac_offset(skb), &eh, ETH_HLEN); > + pskb_pull(skb, ETH_HLEN); > + skb_reset_network_header(skb); > + > + err = pskb_trim(skb, IPV6_MIN_MTU - sizeof(*nip6h) - sizeof(*icmp6h)); > + if (err) > + return err; > + > + len = skb->len + sizeof(*icmp6h); > + err = skb_cow(skb, sizeof(*nip6h) + sizeof(*icmp6h) + ETH_HLEN); > + if (err) > + return err; > + > + icmp6h = skb_push(skb, sizeof(*icmp6h)); > + *icmp6h = (struct icmp6hdr) { > + .icmp6_type = ICMPV6_PKT_TOOBIG, > + .icmp6_code = 0, > + .icmp6_cksum = 0, > + .icmp6_mtu = htonl(mtu), > + }; > + skb_reset_transport_header(skb); > + > + nip6h = skb_push(skb, sizeof(*nip6h)); > + *nip6h = (struct ipv6hdr) { > + .priority = 0, > + .version = 6, > + .flow_lbl = { 0 }, > + .payload_len = htons(len), > + .nexthdr = IPPROTO_ICMPV6, > + .hop_limit = ip6h->hop_limit, > + .saddr = ip6h->daddr, > + .daddr = ip6h->saddr, > + }; > + skb_reset_network_header(skb); > + > + csum = csum_partial(icmp6h, len, 0); > + icmp6h->icmp6_cksum = csum_ipv6_magic(&nip6h->saddr, &nip6h->daddr, len, > + IPPROTO_ICMPV6, csum); Linux next build breaks for riscv architecture defconfig build. git_repo: https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git target_arch: riscv toolchain: gcc-9 git_short_log: d15fe4ec0435 (\Add linux-next specific files for 20200805\) git_describe: next-20200805 Reported-by: Naresh Kamboju <naresh.kamboju@xxxxxxxxxx> make -sk KBUILD_BUILD_USER=TuxBuild -C/linux ARCH=riscv CROSS_COMPILE=riscv64-linux-gnu- HOSTCC=gcc CC="sccache riscv64-linux-gnu-gcc" O=build defconfig # # # make -sk KBUILD_BUILD_USER=TuxBuild -C/linux -j16 ARCH=riscv CROSS_COMPILE=riscv64-linux-gnu- HOSTCC=gcc CC="sccache riscv64-linux-gnu-gcc" O=build # ../net/ipv4/ip_tunnel_core.c: In function ‘iptunnel_pmtud_build_icmpv6’: ../net/ipv4/ip_tunnel_core.c:335:24: error: implicit declaration of function ‘csum_ipv6_magic’; did you mean ‘csum_tcpudp_magic’? [-Werror=implicit-function-declaration] 335 | icmp6h->icmp6_cksum = csum_ipv6_magic(&nip6h->saddr, &nip6h->daddr, len, | ^~~~~~~~~~~~~~~ | csum_tcpudp_magic cc1: some warnings being treated as errors make[3]: *** [../scripts/Makefile.build:283: net/ipv4/ip_tunnel_core.o] Error 1 Full build log link, https://gitlab.com/Linaro/lkft/kernel-runs/-/jobs/671651045 -- Linaro LKFT https://lkft.linaro.org