On 29/05/2024 21:52, Sergey Shtylyov wrote: > On 5/28/24 6:03 PM, Paul Barker wrote: > >> This patch makes multiple changes that can't be separated: >> >> 1) Allocate plain RX buffers via a page pool instead of allocating >> SKBs, then use build_skb() when a packet is received. >> 2) For GbEth IP, reduce the RX buffer size to 2kB. >> 3) For GbEth IP, merge packets which span more than one RX descriptor >> as SKB fragments instead of copying data. >> >> Implementing (1) without (2) would require the use of an order-1 page >> pool (instead of an order-0 page pool split into page fragments) for >> GbEth. >> >> Implementing (2) without (3) would leave us no space to re-assemble >> packets which span more than one RX descriptor. >> >> Implementing (3) without (1) would not be possible as the network stack >> expects to use put_page() or page_pool_put_page() to free SKB fragments >> after an SKB is consumed. >> >> RX checksum offload support is adjusted to handle both linear and >> nonlinear (fragmented) packets. >> >> This patch gives the following improvements during testing with iperf3. >> >> * RZ/G2L: >> * TCP RX: same bandwidth at -43% CPU load (70% -> 40%) >> * UDP RX: same bandwidth at -17% CPU load (88% -> 74%) >> >> * RZ/G2UL: >> * TCP RX: +30% bandwidth (726Mbps -> 941Mbps) >> * UDP RX: +417% bandwidth (108Mbps -> 558Mbps) >> >> * RZ/G3S: >> * TCP RX: +64% bandwidth (562Mbps -> 920Mbps) >> * UDP RX: +420% bandwidth (90Mbps -> 468Mbps) >> >> * RZ/Five: >> * TCP RX: +217% bandwidth (145Mbps -> 459Mbps) >> * UDP RX: +470% bandwidth (20Mbps -> 114Mbps) >> >> There is no significant impact on bandwidth or CPU load in testing on >> RZ/G2H or R-Car M3N. >> >> Signed-off-by: Paul Barker <paul.barker.ct@xxxxxxxxxxxxxx> >> --- >> Changes v3->v4: >> * Used a separate page pool for each RX queue. >> * Passed struct ravb_rx_desc to ravb_alloc_rx_buffer() so that we can >> simplify the calling function. >> * Explained the calculation of rx_desc->ds_cc. >> * Added handling of nonlinear SKBs in ravb_rx_csum_gbeth(). >> >> drivers/net/ethernet/renesas/ravb.h | 10 +- >> drivers/net/ethernet/renesas/ravb_main.c | 230 ++++++++++++++--------- >> 2 files changed, 146 insertions(+), 94 deletions(-) >> >> diff --git a/drivers/net/ethernet/renesas/ravb.h b/drivers/net/ethernet/renesas/ravb.h >> index 6a7aa7dd17e6..f2091a17fcf7 100644 >> --- a/drivers/net/ethernet/renesas/ravb.h >> +++ b/drivers/net/ethernet/renesas/ravb.h > [...]> @@ -1094,7 +1099,8 @@ struct ravb_private { >> struct ravb_tx_desc *tx_ring[NUM_TX_QUEUE]; >> void *tx_align[NUM_TX_QUEUE]; >> struct sk_buff *rx_1st_skb; >> - struct sk_buff **rx_skb[NUM_RX_QUEUE]; >> + struct page_pool *rx_pool[NUM_RX_QUEUE]; > > Don't we need #include <net/page_pool/types.h> Yes. I got away with it as ravb_main.c includes <net/page_pool/helpers.h> before including "ravb.h", but the header shouldn't assume that. > > [...] >> diff --git a/drivers/net/ethernet/renesas/ravb_main.c b/drivers/net/ethernet/renesas/ravb_main.c >> index dd92f074881a..bb7f7d44be6e 100644 >> --- a/drivers/net/ethernet/renesas/ravb_main.c >> +++ b/drivers/net/ethernet/renesas/ravb_main.c > [...] >> @@ -317,35 +289,56 @@ static void ravb_ring_free(struct net_device *ndev, int q) >> priv->tx_skb[q] = NULL; >> } >> >> +static int >> +ravb_alloc_rx_buffer(struct net_device *ndev, int q, u32 entry, gfp_t gfp_mask, >> + struct ravb_rx_desc *rx_desc) >> +{ >> + struct ravb_private *priv = netdev_priv(ndev); >> + const struct ravb_hw_info *info = priv->info; >> + struct ravb_rx_buffer *rx_buff = &priv->rx_buffers[q][entry]; >> + dma_addr_t dma_addr; >> + unsigned int size; >> + >> + size = info->rx_buffer_size; >> + rx_buff->page = page_pool_alloc(priv->rx_pool[q], &rx_buff->offset, &size, >> + gfp_mask); >> + if (unlikely(!rx_buff->page)) { >> + /* We just set the data size to 0 for a failed mapping >> + * which should prevent DMA from happening... >> + */ >> + rx_desc->ds_cc = cpu_to_le16(0); >> + return -ENOMEM; >> + } >> + >> + dma_addr = page_pool_get_dma_addr(rx_buff->page) + rx_buff->offset; >> + dma_sync_single_for_device(ndev->dev.parent, dma_addr, >> + info->rx_buffer_size, DMA_FROM_DEVICE); > > Do we really need this call? Looking at .config I see CONFIG_DMA_NEED_SYNC=y so yes I think this is needed. > >> + rx_desc->dptr = cpu_to_le32(dma_addr); >> + >> + /* The end of the RX buffer is used to store skb shared data, so we need >> + * to ensure that the hardware leaves enough space for this. >> + */ >> + rx_desc->ds_cc = cpu_to_le16(info->rx_buffer_size >> + - SKB_DATA_ALIGN(sizeof(struct skb_shared_info)) > > Please leave the - operator on the previous line... Ack. > >> + - ETH_FCS_LEN + sizeof(__sum16)); > > Here as well... Ack. > >> + return 0; >> +} >> + >> static u32 >> ravb_rx_ring_refill(struct net_device *ndev, int q, u32 count, gfp_t gfp_mask) >> { >> struct ravb_private *priv = netdev_priv(ndev); >> - const struct ravb_hw_info *info = priv->info; >> struct ravb_rx_desc *rx_desc; >> - dma_addr_t dma_addr; >> u32 i, entry; >> >> for (i = 0; i < count; i++) { >> entry = (priv->dirty_rx[q] + i) % priv->num_rx_ring[q]; >> rx_desc = ravb_rx_get_desc(priv, q, entry); >> - rx_desc->ds_cc = cpu_to_le16(info->rx_max_desc_use); >> >> - if (!priv->rx_skb[q][entry]) { >> - priv->rx_skb[q][entry] = ravb_alloc_skb(ndev, info, gfp_mask); >> - if (!priv->rx_skb[q][entry]) >> + if (!priv->rx_buffers[q][entry].page) { >> + if (unlikely(ravb_alloc_rx_buffer(ndev, q, entry, > > Well, IIRC Greg KH is against using unlikely() unless you have actually > instrumented the code and this gives an improvement... have you? :-) My understanding was that we should use unlikely() for error checking in hot code paths where we want the "good" path to be optimised. I can drop this if I'm wrong though. > > [...] >> @@ -727,12 +739,22 @@ static void ravb_rx_csum_gbeth(struct sk_buff *skb) >> if (unlikely(skb->len < sizeof(__sum16) * 2)) >> return; >> >> - hw_csum = skb_tail_pointer(skb) - sizeof(__sum16); >> + if (skb_is_nonlinear(skb)) { >> + last_frag = &shinfo->frags[shinfo->nr_frags - 1]; >> + hw_csum = skb_frag_address(last_frag) + skb_frag_size(last_frag) - sizeof(__sum16); >> + } else { >> + hw_csum = skb_tail_pointer(skb) - sizeof(__sum16); >> + } > > We can do the subtraction only once here... Ack. I'll pull that out of the if. > > [...] >> @@ -816,14 +824,26 @@ static int ravb_rx_gbeth(struct net_device *ndev, int budget, int q) >> if (desc_status & MSC_CEEF) >> stats->rx_missed_errors++; >> } else { >> + struct ravb_rx_buffer *rx_buff = &priv->rx_buffers[q][entry]; >> + void *rx_addr = page_address(rx_buff->page) + rx_buff->offset; > > Need an empty line here... Ack. > >> die_dt = desc->die_dt & 0xF0; >> - skb = ravb_get_skb_gbeth(ndev, entry, desc); >> + dma_sync_single_for_cpu(ndev->dev.parent, le32_to_cpu(desc->dptr), >> + desc_len, DMA_FROM_DEVICE); >> + >> switch (die_dt) { >> case DT_FSINGLE: >> case DT_FSTART: >> /* Start of packet: >> - * Set initial data length. >> + * Prepare an SKB and add initial data. > > I'd prefer calling it skb in the comments... Ack. > > [...] >> @@ -865,7 +894,16 @@ static int ravb_rx_gbeth(struct net_device *ndev, int budget, int q) >> stats->rx_bytes += skb->len; >> napi_gro_receive(&priv->napi[q], skb); >> rx_packets++; >> + >> + /* Clear rx_1st_skb so that it will only be >> + * non-NULL when valid. >> + */ >> + if (die_dt == DT_FEND) >> + priv->rx_1st_skb = NULL; > > Hm, can't we do this under *case* DT_FEND above? It makes more logical sense to me to do this as the last step, but I guess it's a little more optimal to do it earlier. I'll move it. Thanks, -- Paul Barker
Attachment:
OpenPGP_0x27F4B3459F002257.asc
Description: OpenPGP public key
Attachment:
OpenPGP_signature.asc
Description: OpenPGP digital signature