On 1/7/20 9:04 AM, Eric Dumazet wrote: > > > On 1/7/20 5:32 AM, RENARD Pierre-Francois wrote: >> >> Hello all >> >> I am facing an issue related to Raspberry PI 3B+ and onboard ethernet card. >> >> When doing a huge transfer (more than 1GB) in a row, transfer hanges and failed after a few minutes. >> >> >> I have two ways to reproduce this issue >> >> >> using NFS (v3 or v4) >> >> dd if=/dev/zero of=/NFSPATH/file bs=4M count=1000 status=progress >> >> >> we can see that at some point dd hangs and becomes non interrutible (no way to ctrl-c it or kill it) >> >> after afew minutes, dd dies and a bunch of NFS server not responding / NFS server is OK are seens into the journal >> >> >> Using SCP >> >> dd if=/dev/zero of=/tmp/file bs=4M count=1000 >> >> scp /tmp/file user@server:/directory >> >> >> scp hangs after 1GB and after a few minutes scp is failing with message "client_loop: send disconnect: Broken pipe lostconnection" >> >> >> >> >> It appears, this is a known bug relatted to TCP Segmentation Offload & Selective Acknowledge. >> >> disabling this TSO (ethtool -K eth0 tso off & ethtool -K eth0 gso off) solves the issue. >> >> A patch has been created to disable the feature by default by the raspberry team and is by default applied wihtin raspbian. >> >> comment from the patch : >> >> /* TSO seems to be having some issue with Selective Acknowledge (SACK) that >> * results in lost data never being retransmitted. >> * Disable it by default now, but adds a module parameter to enable it for >> * debug purposes (the full cause is not currently understood). >> */ >> >> >> For reference you can find >> >> a link to the issue I created yesterday : https://github.com/raspberrypi/linux/issues/3395 >> >> links to raspberry dev team : https://github.com/raspberrypi/linux/issues/2482 & https://github.com/raspberrypi/linux/issues/2449 >> >> >> >> If you need me to test things, or give you more informations, I ll be pleased to help. >> > > > I doubt TSO and SACK have a serious generic bug like that. > > Most likely the TSO implementation on the driver/NIC has a bug . > > Anyway you do not provide a kernel version, I am not sure what you expect from us. > Oh well, drivers/net/usb/lan78xx.c is horribly buggy. It wants linear skbs, which is likely to fail with too big packets. And if skb linearization fails, skb is not freed, so a big memory leak happens. Please try this patch : diff --git a/drivers/net/usb/lan78xx.c b/drivers/net/usb/lan78xx.c index f940dc6485e56a7e8f905082ce920f5dd83232b0..5e2d3c8c34dc8d8ac6f2ab3fd8a59dba5b348882 100644 --- a/drivers/net/usb/lan78xx.c +++ b/drivers/net/usb/lan78xx.c @@ -2724,11 +2724,6 @@ static int lan78xx_stop(struct net_device *net) return 0; } -static int lan78xx_linearize(struct sk_buff *skb) -{ - return skb_linearize(skb); -} - static struct sk_buff *lan78xx_tx_prep(struct lan78xx_net *dev, struct sk_buff *skb, gfp_t flags) { @@ -2740,8 +2735,10 @@ static struct sk_buff *lan78xx_tx_prep(struct lan78xx_net *dev, return NULL; } - if (lan78xx_linearize(skb) < 0) + if (skb_linearize(skb)) { + dev_kfree_skb_any(skb); return NULL; + } tx_cmd_a = (u32)(skb->len & TX_CMD_A_LEN_MASK_) | TX_CMD_A_FCS_; @@ -3790,6 +3787,9 @@ static int lan78xx_probe(struct usb_interface *intf, if (ret < 0) goto out4; + /* since we want linear skb, avoid high-order allocations */ + netif_set_gso_max_size(netdev, SKB_WITH_OVERHEAD(16000)); + ret = register_netdev(netdev); if (ret != 0) { netif_err(dev, probe, netdev, "couldn't register the device\n");