> On Fri, Jan 24, 2025 at 06:07:01PM +0530, Basharath Hussain Khaja wrote: >> From: Roger Quadros <rogerq@xxxxxx> >> >> Changes corresponding to link configuration such as speed and duplexity. >> IRQ and handler initializations are performed for packet reception.Firmware >> receives the packet from the wire and stores it into OCMC queue. Next, it >> notifies the CPU via interrupt. Upon receiving the interrupt CPU will >> service the IRQ and packet will be processed by pushing the newly allocated >> SKB to upper layers. >> >> When the user application want to transmit a packet, it will invoke >> sys_send() which will inturn invoke the PRUETH driver, then it will write >> the packet into OCMC queues. PRU firmware will pick up the packet and >> transmit it on to the wire. >> >> Signed-off-by: Roger Quadros <rogerq@xxxxxx> >> Signed-off-by: Andrew F. Davis <afd@xxxxxx> >> Signed-off-by: Parvathi Pudi <parvathi@xxxxxxxxxxx> >> Signed-off-by: Basharath Hussain Khaja <basharath@xxxxxxxxxxx> >> --- >> drivers/net/ethernet/ti/icssm/icssm_prueth.c | 599 ++++++++++++++++++- >> drivers/net/ethernet/ti/icssm/icssm_prueth.h | 46 ++ >> 2 files changed, 640 insertions(+), 5 deletions(-) >> >> diff --git a/drivers/net/ethernet/ti/icssm/icssm_prueth.c >> b/drivers/net/ethernet/ti/icssm/icssm_prueth.c >> index 82ed0e3a0d88..0ba1d16a7a15 100644 >> --- a/drivers/net/ethernet/ti/icssm/icssm_prueth.c >> +++ b/drivers/net/ethernet/ti/icssm/icssm_prueth.c > > [...] > >> +/** >> + * icssm_prueth_tx_enqueue - queue a packet to firmware for transmission >> + * >> + * @emac: EMAC data structure >> + * @skb: packet data buffer >> + * @queue_id: priority queue id >> + * >> + * Return: 0 (Success) >> + */ >> +static int icssm_prueth_tx_enqueue(struct prueth_emac *emac, >> + struct sk_buff *skb, >> + enum prueth_queue_id queue_id) >> +{ > > [...] > >> + >> + /* which port to tx: MII0 or MII1 */ >> + txport = emac->tx_port_queue; >> + src_addr = skb->data; >> + pktlen = skb->len; >> + /* Get the tx queue */ >> + queue_desc = emac->tx_queue_descs + queue_id; >> + txqueue = &queue_infos[txport][queue_id]; >> + >> + buffer_desc_count = txqueue->buffer_desc_end - >> + txqueue->buffer_desc_offset; >> + buffer_desc_count /= BD_SIZE; >> + buffer_desc_count++; >> + >> + bd_rd_ptr = readw(&queue_desc->rd_ptr); >> + bd_wr_ptr = readw(&queue_desc->wr_ptr); >> + >> + /* the PRU firmware deals mostly in pointers already >> + * offset into ram, we would like to deal in indexes >> + * within the queue we are working with for code >> + * simplicity, calculate this here >> + */ >> + write_block = (bd_wr_ptr - txqueue->buffer_desc_offset) / BD_SIZE; >> + read_block = (bd_rd_ptr - txqueue->buffer_desc_offset) / BD_SIZE; > > Seems like there's a lot of similar code repeated in the rx code > path. > > Maybe there's a way to simplify it all with a helper of some sort? > We will plan to re-look into common code (more than twice) and create a helper function and use it. >> + if (write_block > read_block) { >> + free_blocks = buffer_desc_count - write_block; >> + free_blocks += read_block; >> + } else if (write_block < read_block) { >> + free_blocks = read_block - write_block; >> + } else { /* they are all free */ >> + free_blocks = buffer_desc_count; >> + } >> + >> + pkt_block_size = DIV_ROUND_UP(pktlen, ICSS_BLOCK_SIZE); >> + if (pkt_block_size > free_blocks) /* out of queue space */ >> + return -ENOBUFS; >> + >> + /* calculate end BD address post write */ >> + update_block = write_block + pkt_block_size; >> + >> + /* Check for wrap around */ >> + if (update_block >= buffer_desc_count) { >> + update_block %= buffer_desc_count; >> + buffer_wrapped = true; >> + } >> + >> + /* OCMC RAM is not cached and write order is not important */ >> + ocmc_ram = (__force void *)emac->prueth->mem[PRUETH_MEM_OCMC].va; >> + dst_addr = ocmc_ram + txqueue->buffer_offset + >> + (write_block * ICSS_BLOCK_SIZE); >> + >> + /* Copy the data from socket buffer(DRAM) to PRU buffers(OCMC) */ >> + if (buffer_wrapped) { /* wrapped around buffer */ >> + int bytes = (buffer_desc_count - write_block) * ICSS_BLOCK_SIZE; >> + int remaining; >> + >> + /* bytes is integral multiple of ICSS_BLOCK_SIZE but >> + * entire packet may have fit within the last BD >> + * if pkt_info.length is not integral multiple of >> + * ICSS_BLOCK_SIZE >> + */ >> + if (pktlen < bytes) >> + bytes = pktlen; >> + >> + /* copy non-wrapped part */ >> + memcpy(dst_addr, src_addr, bytes); >> + >> + /* copy wrapped part */ >> + src_addr += bytes; >> + remaining = pktlen - bytes; >> + dst_addr = ocmc_ram + txqueue->buffer_offset; >> + memcpy(dst_addr, src_addr, remaining); >> + } else { >> + memcpy(dst_addr, src_addr, pktlen); >> + } >> + >> + /* update first buffer descriptor */ >> + wr_buf_desc = (pktlen << PRUETH_BD_LENGTH_SHIFT) & >> + PRUETH_BD_LENGTH_MASK; >> + writel(wr_buf_desc, dram + bd_wr_ptr); >> + >> + /* update the write pointer in this queue descriptor, the firmware >> + * polls for this change so this will signal the start of transmission >> + */ >> + update_wr_ptr = txqueue->buffer_desc_offset + (update_block * BD_SIZE); >> + writew(update_wr_ptr, &queue_desc->wr_ptr); >> + >> + return 0; >> +} > > [...] > >> + >> +/* get packet from queue >> + * negative for error >> + */ > > The comment above seems superfluous and does not seem to follow the > format of the tx comment which appears to use kdoc style > We will address in the next version. >> +int icssm_emac_rx_packet(struct prueth_emac *emac, u16 *bd_rd_ptr, >> + struct prueth_packet_info *pkt_info, >> + const struct prueth_queue_info *rxqueue) >> +{ > > [...] > >> + >> + /* the PRU firmware deals mostly in pointers already >> + * offset into ram, we would like to deal in indexes >> + * within the queue we are working with for code >> + * simplicity, calculate this here >> + */ >> + buffer_desc_count = rxqueue->buffer_desc_end - >> + rxqueue->buffer_desc_offset; >> + buffer_desc_count /= BD_SIZE; >> + buffer_desc_count++; >> + read_block = (*bd_rd_ptr - rxqueue->buffer_desc_offset) / BD_SIZE; >> + pkt_block_size = DIV_ROUND_UP(pkt_info->length, ICSS_BLOCK_SIZE); >> + >> + /* calculate end BD address post read */ >> + update_block = read_block + pkt_block_size; >> + >> + /* Check for wrap around */ >> + if (update_block >= buffer_desc_count) { >> + update_block %= buffer_desc_count; >> + if (update_block) >> + buffer_wrapped = true; >> + } >> + >> + /* calculate new pointer in ram */ >> + *bd_rd_ptr = rxqueue->buffer_desc_offset + (update_block * BD_SIZE); > > Seems like there's a lot of repeated math here and in the above > function. Maybe this can be simplified with a helper function to > avoid repeating the math in multiple places? > We will plan to re-look into common code (more than twice) and create a helper function and use it. >> + >> + /* Pkt len w/ HSR tag removed, If applicable */ >> + actual_pkt_len = pkt_info->length - start_offset; >> + >> + /* Allocate a socket buffer for this packet */ >> + skb = netdev_alloc_skb_ip_align(ndev, actual_pkt_len); >> + if (!skb) { >> + if (netif_msg_rx_err(emac) && net_ratelimit()) >> + netdev_err(ndev, "failed rx buffer alloc\n"); >> + return -ENOMEM; >> + } >> + >> + dst_addr = skb->data; >> + >> + /* OCMC RAM is not cached and read order is not important */ >> + ocmc_ram = (__force void *)emac->prueth->mem[PRUETH_MEM_OCMC].va; >> + >> + /* Get the start address of the first buffer from >> + * the read buffer description >> + */ >> + src_addr = ocmc_ram + rxqueue->buffer_offset + >> + (read_block * ICSS_BLOCK_SIZE); >> + src_addr += start_offset; >> + >> + /* Copy the data from PRU buffers(OCMC) to socket buffer(DRAM) */ >> + if (buffer_wrapped) { /* wrapped around buffer */ >> + int bytes = (buffer_desc_count - read_block) * ICSS_BLOCK_SIZE; >> + int remaining; >> + /* bytes is integral multiple of ICSS_BLOCK_SIZE but >> + * entire packet may have fit within the last BD >> + * if pkt_info.length is not integral multiple of >> + * ICSS_BLOCK_SIZE >> + */ >> + if (pkt_info->length < bytes) >> + bytes = pkt_info->length; >> + >> + /* If applicable, account for the HSR tag removed */ >> + bytes -= start_offset; >> + >> + /* copy non-wrapped part */ >> + memcpy(dst_addr, src_addr, bytes); >> + >> + /* copy wrapped part */ >> + dst_addr += bytes; >> + remaining = actual_pkt_len - bytes; >> + >> + src_addr = ocmc_ram + rxqueue->buffer_offset; >> + memcpy(dst_addr, src_addr, remaining); >> + src_addr += remaining; >> + } else { >> + memcpy(dst_addr, src_addr, actual_pkt_len); >> + src_addr += actual_pkt_len; >> + } >> + >> + if (!pkt_info->sv_frame) { >> + skb_put(skb, actual_pkt_len); >> + >> + /* send packet up the stack */ >> + skb->protocol = eth_type_trans(skb, ndev); >> + local_bh_disable(); >> + netif_receive_skb(skb); >> + local_bh_enable(); >> + } else { >> + dev_kfree_skb_any(skb); >> + } >> + >> + /* update stats */ >> + ndev->stats.rx_bytes += actual_pkt_len; >> + ndev->stats.rx_packets++; > > See comment below about atomicity. > >> + return 0; >> +} >> + >> +/** >> + * icssm_emac_rx_thread - EMAC Rx interrupt thread handler >> + * @irq: interrupt number >> + * @dev_id: pointer to net_device >> + * >> + * EMAC Rx Interrupt thread handler - function to process the rx frames in a >> + * irq thread function. There is only limited buffer at the ingress to >> + * queue the frames. As the frames are to be emptied as quickly as >> + * possible to avoid overflow, irq thread is necessary. Current implementation >> + * based on NAPI poll results in packet loss due to overflow at >> + * the ingress queues. Industrial use case requires loss free packet >> + * processing. Tests shows that with threaded irq based processing, >> + * no overflow happens when receiving at ~92Mbps for MTU sized frames and thus >> + * meet the requirement for industrial use case. > > That's interesting. Any idea why this is the case? Is there not > enough CPU for softirq/NAPI processing to run or something? I > suppose I'd imagine that NAPI would run and if data is arriving fast > enough, the NAPI would be added to the repoll list and processed > later. > > So I guess either there isn't enough CPU or the ingress queues don't > have many descriptors or something like that ? > This is due to combination of both limited number of buffer descriptors (memory constraints) and also not having enough CPU. >> + * >> + * Return: interrupt handled condition >> + */ >> +static irqreturn_t icssm_emac_rx_thread(int irq, void *dev_id) >> +{ >> + struct net_device *ndev = (struct net_device *)dev_id; >> + struct prueth_emac *emac = netdev_priv(ndev); >> + struct prueth_queue_desc __iomem *queue_desc; >> + const struct prueth_queue_info *rxqueue; >> + struct prueth *prueth = emac->prueth; >> + struct net_device_stats *ndevstats; >> + struct prueth_packet_info pkt_info; >> + int start_queue, end_queue; >> + void __iomem *shared_ram; >> + u16 bd_rd_ptr, bd_wr_ptr; >> + u16 update_rd_ptr; >> + u8 overflow_cnt; >> + u32 rd_buf_desc; >> + int used = 0; >> + int i, ret; >> + >> + ndevstats = &emac->ndev->stats; > > FWIW the docs in include/linux/netdevice.h say: > > /** > ... > * @stats: Statistics struct, which was left as a legacy, use > * rtnl_link_stats64 instead > ... > */ > struct net_device { > ... > struct net_device_stats stats; /* not used by modern drivers */ > ... > }; > > perhaps consider using rtnl_link_stats64 as suggested instead? > >> + shared_ram = emac->prueth->mem[PRUETH_MEM_SHARED_RAM].va; >> + >> + start_queue = emac->rx_queue_start; >> + end_queue = emac->rx_queue_end; >> +retry: >> + /* search host queues for packets */ >> + for (i = start_queue; i <= end_queue; i++) { >> + queue_desc = emac->rx_queue_descs + i; >> + rxqueue = &queue_infos[PRUETH_PORT_HOST][i]; >> + >> + overflow_cnt = readb(&queue_desc->overflow_cnt); >> + if (overflow_cnt > 0) { >> + emac->ndev->stats.rx_over_errors += overflow_cnt; >> + /* reset to zero */ >> + writeb(0, &queue_desc->overflow_cnt); >> + } >> + >> + bd_rd_ptr = readw(&queue_desc->rd_ptr); >> + bd_wr_ptr = readw(&queue_desc->wr_ptr); >> + >> + /* while packets are available in this queue */ >> + while (bd_rd_ptr != bd_wr_ptr) { >> + /* get packet info from the read buffer descriptor */ >> + rd_buf_desc = readl(shared_ram + bd_rd_ptr); >> + icssm_parse_packet_info(prueth, rd_buf_desc, &pkt_info); >> + >> + if (pkt_info.length <= 0) { >> + /* a packet length of zero will cause us to >> + * never move the read pointer ahead, locking >> + * the driver, so we manually have to move it >> + * to the write pointer, discarding all >> + * remaining packets in this queue. This should >> + * never happen. >> + */ >> + update_rd_ptr = bd_wr_ptr; >> + ndevstats->rx_length_errors++; > > See question below. > >> + } else if (pkt_info.length > EMAC_MAX_FRM_SUPPORT) { >> + /* if the packet is too large we skip it but we >> + * still need to move the read pointer ahead >> + * and assume something is wrong with the read >> + * pointer as the firmware should be filtering >> + * these packets >> + */ >> + update_rd_ptr = bd_wr_ptr; >> + ndevstats->rx_length_errors++; > > in netdevice.h it says: > > * struct net_device_stats* (*ndo_get_stats)(struct net_device *dev); > * Called when a user wants to get the network device usage > * statistics. Drivers must do one of the following: > * 1. Define @ndo_get_stats64 to fill in a zero-initialised > * rtnl_link_stats64 structure passed by the caller. > * 2. Define @ndo_get_stats to update a net_device_stats structure > * (which should normally be dev->stats) and return a pointer to > * it. The structure may be changed asynchronously only if each > * field is written atomically. > * 3. Update dev->stats asynchronously and atomically, and define > * neither operation. > > I didn't look in the other patches to see if ndo_get_stats is > defined or not, but are these increments atomic? This module maintain 32 bit statistics inside PRU firmware. we will check the feasibility of using rtnl_link_stats64 and make appropriate changes as per your suggestion in next version. Thanks & Best Regards, Basharath