Re: Improving TX for m_can peripherals

Marc Kleine-Budde <mkl@xxxxxxxxxxxxxx> · Wed, 5 May 2021 13:56:24 +0200

On 05.05.2021 11:30:49, Torin Cooper-Bennun wrote:
> I've been testing the TCAN4550 recently with proper kit (no more jumper
> wires, hooray!) and I'm happy to say the RX path is fixed in v5.12 with
> the latest patches, and even with heavy load I see no missed frames or
> errors.

\o/

> However, TX still needs work. It's easy to break the driver due to the
> following logic in m_can_start_xmit():
> 
> | 	if (cdev->tx_skb) {
> | 		netdev_err(dev, "hard_xmit called while tx busy\n");
> | 		return NETDEV_TX_BUSY;
> | 	}
> 
> Regardless of your netif TX queue length or the number of TX buffers
> allocated in the M_CAN core, if you try to transmit too quickly, you
> will hit this. For the application I'm working on, I run into this very
> quickly with real-world scenarios.
> 
> Also, the queue is always stopped before the tx work is queued in
> m_can_start_xmit(), which seems wrong and clearly doesn't solve the
> problem:

The simple approach is to stop the tx queue in ndo_start_xmit() and to
restart it if the skb has been send. If the above error message is
triggered I suppose there's a race condition in the driver, cdev->tx_skb
is cleared _after_ the queue has been restarted.

I think it's somewhere here:

| static netdev_tx_t m_can_tx_handler(struct m_can_classdev *cdev)
| {
[...]
| 		/* Enable TX FIFO element to start transfer  */
| 		m_can_write(cdev, M_CAN_TXBAR, (1 << putidx));

TX complete IRQ can come after this...

| 		/* stop network queue if fifo full */
| 		if (m_can_tx_fifo_full(cdev) ||

This is a SPI transfer and may take a lot of time....

| 		    m_can_next_echo_skb_occupied(dev, putidx))
| 			netif_stop_queue(dev);
[...]
| }

| static void m_can_tx_work_queue(struct work_struct *ws)
| {
| 	struct m_can_classdev *cdev = container_of(ws, struct m_can_classdev,
| 						   tx_work);
| 
| 	m_can_tx_handler(cdev);

... here the tx_skb is set to NULL

| 	cdev->tx_skb = NULL;
| }

I've send a patch (compile time tested, only!) to fix the probem.

> | 	/* Need to stop the queue to avoid numerous requests
> | 	 * from being sent.  Suggested improvement is to create
> | 	 * a queueing mechanism that will queue the skbs and
> | 	 * process them in order.
> | 	 */
> | 	cdev->tx_skb = skb;
> | 	netif_stop_queue(cdev->net);
> | 	queue_work(cdev->tx_wq, &cdev->tx_work);
> 
> 
> So - I'd like to fix this. The comment in the snippet above suggests a
> queueing mechanism. It would be good to hear your take on this, Marc -
> AFAIU you have written a similar mechanism for mcp251xfd. :)

For easier queuing, get rid of the worker. Directly send the can_frame
from the xmit handler. The problem is, you cannot sleep inside it. This
means, you cannot use regmap(), the only thing that works is
spi_async(). And as it's async, you can only write data via SPI, reading
doesn't make sense.

Have a look at the mcp251xfd_start_xmit() function. All data structures
for the SPI messages are prepared in beforehand in
mcp251xfd_tx_ring_init_tx_obj(). The xmit function looks up the correct
data structure, converts the skb to the chip's format
mcp251xfd_tx_obj_from_skb() and then sends the data over spi. Special
care is taken of the handling of the head and tail pointers and the
stopping of the queue to avoid race conditions - see
mcp251xfd_get_tx_free(), mcp251xfd_tx_busy(), netif_stop_queue(),
netif_wake_queue().

Hope that helps,
Marc

-- 
Pengutronix e.K.                 | Marc Kleine-Budde           |
Embedded Linux                   | https://www.pengutronix.de  |
Vertretung West/Dortmund         | Phone: +49-231-2826-924     |
Amtsgericht Hildesheim, HRA 2686 | Fax:   +49-5121-206917-5555 |
Attachment:
signature.asc

Description: PGP signature