On 5/30/24 12:21 PM, Paul Barker wrote: [...] >>> This patch makes multiple changes that can't be separated: >>> >>> 1) Allocate plain RX buffers via a page pool instead of allocating >>> SKBs, then use build_skb() when a packet is received. >>> 2) For GbEth IP, reduce the RX buffer size to 2kB. >>> 3) For GbEth IP, merge packets which span more than one RX descriptor >>> as SKB fragments instead of copying data. >>> >>> Implementing (1) without (2) would require the use of an order-1 page >>> pool (instead of an order-0 page pool split into page fragments) for >>> GbEth. >>> >>> Implementing (2) without (3) would leave us no space to re-assemble >>> packets which span more than one RX descriptor. >>> >>> Implementing (3) without (1) would not be possible as the network stack >>> expects to use put_page() or page_pool_put_page() to free SKB fragments >>> after an SKB is consumed. >>> >>> RX checksum offload support is adjusted to handle both linear and >>> nonlinear (fragmented) packets. >>> >>> This patch gives the following improvements during testing with iperf3. >>> >>> * RZ/G2L: >>> * TCP RX: same bandwidth at -43% CPU load (70% -> 40%) >>> * UDP RX: same bandwidth at -17% CPU load (88% -> 74%) >>> >>> * RZ/G2UL: >>> * TCP RX: +30% bandwidth (726Mbps -> 941Mbps) >>> * UDP RX: +417% bandwidth (108Mbps -> 558Mbps) >>> >>> * RZ/G3S: >>> * TCP RX: +64% bandwidth (562Mbps -> 920Mbps) >>> * UDP RX: +420% bandwidth (90Mbps -> 468Mbps) >>> >>> * RZ/Five: >>> * TCP RX: +217% bandwidth (145Mbps -> 459Mbps) >>> * UDP RX: +470% bandwidth (20Mbps -> 114Mbps) >>> >>> There is no significant impact on bandwidth or CPU load in testing on >>> RZ/G2H or R-Car M3N. >>> >>> Signed-off-by: Paul Barker <paul.barker.ct@xxxxxxxxxxxxxx> [...] >>> diff --git a/drivers/net/ethernet/renesas/ravb_main.c b/drivers/net/ethernet/renesas/ravb_main.c >>> index dd92f074881a..bb7f7d44be6e 100644 >>> --- a/drivers/net/ethernet/renesas/ravb_main.c >>> +++ b/drivers/net/ethernet/renesas/ravb_main.c [...] >>> + return 0; >>> +} >>> + >>> static u32 >>> ravb_rx_ring_refill(struct net_device *ndev, int q, u32 count, gfp_t gfp_mask) >>> { >>> struct ravb_private *priv = netdev_priv(ndev); >>> - const struct ravb_hw_info *info = priv->info; >>> struct ravb_rx_desc *rx_desc; >>> - dma_addr_t dma_addr; >>> u32 i, entry; >>> >>> for (i = 0; i < count; i++) { >>> entry = (priv->dirty_rx[q] + i) % priv->num_rx_ring[q]; >>> rx_desc = ravb_rx_get_desc(priv, q, entry); >>> - rx_desc->ds_cc = cpu_to_le16(info->rx_max_desc_use); >>> >>> - if (!priv->rx_skb[q][entry]) { >>> - priv->rx_skb[q][entry] = ravb_alloc_skb(ndev, info, gfp_mask); >>> - if (!priv->rx_skb[q][entry]) >>> + if (!priv->rx_buffers[q][entry].page) { >>> + if (unlikely(ravb_alloc_rx_buffer(ndev, q, entry, >> >> Well, IIRC Greg KH is against using unlikely() unless you have actually >> instrumented the code and this gives an improvement... have you? :-) > > My understanding was that we should use unlikely() for error checking in > hot code paths where we want the "good" path to be optimised. I can drop > this if I'm wrong though. OK, keep it... :-) [...] >>> @@ -865,7 +894,16 @@ static int ravb_rx_gbeth(struct net_device *ndev, int budget, int q) >>> stats->rx_bytes += skb->len; >>> napi_gro_receive(&priv->napi[q], skb); >>> rx_packets++; >>> + >>> + /* Clear rx_1st_skb so that it will only be >>> + * non-NULL when valid. >>> + */ >>> + if (die_dt == DT_FEND) >>> + priv->rx_1st_skb = NULL; >> >> Hm, can't we do this under *case* DT_FEND above? > > It makes more logical sense to me to do this as the last step, but I > guess it's a little more optimal to do it earlier. I'll move it. Looking at it once more, we can't... unless I'm missing s/th. :-) > Thanks, MBR, Sergey