Hi Marc, mkl@xxxxxxxxxxxxxx wrote on Wed, 27 Sep 2023 11:33:32 +0200: > On 27.09.2023 11:30:16, Marc Kleine-Budde wrote: > > On 22.09.2023 17:47:27, Miquel Raynal wrote: > > > Upstream commit 717c6ec241b5 ("can: sja1000: Prevent overrun stalls with > > > a soft reset on Renesas SoCs") fixes an issue with Renesas own SJA1000 > > > CAN controller reception: the Rx buffer is only 5 messages long, so when > > > the bus loaded (eg. a message every 50us), overrun may easily > > > happen. Upon an overrun situation, due to a possible internal crosstalk > > > situation, the controller enters a frozen state which only can be > > > unlocked with a soft reset (experimentally). The solution was to offload > > > a call to sja1000_start() in a threaded handler. This needs to happen in > > > process context as this operation requires to sleep. sja1000_start() > > > basically enters "reset mode", performs a proper software reset and > > > returns back into "normal mode". > > > > > > Since this fix was introduced, we no longer observe any stalls in > > > reception. However it was sporadically observed that the transmit path > > > would now freeze. Further investigation blamed the fix mentioned above, > > > and especially the reset operation. Reproducing the reset in a loop > > > helped identifying what could possibly go wrong. The sja1000 is a single > > > Tx queue device, which leverages the netdev helpers to process one Tx > > > message at a time. The logic is: the queue is stopped, the message sent > > > to the transceiver, once properly transmitted the controller sets a > > > status bit which triggers an interrupt, in the interrupt handler the > > > transmission status is checked and the queue woken up. Unfortunately, if > > > an overrun happens, we might perform the soft reset precisely between > > > the transmission of the buffer to the transceiver and the advent of the > > > transmission status bit. We would then stop the transmission operation > > > without re-enabling the queue, leading to all further transmissions to > > > be ignored. > > > > > > The reset interrupt can only happen while the device is "open", and > > > after a reset we anyway want to resume normal operations, no matter if a > > > packet to transmit got dropped in the process, so we shall wake up the > > > queue. Restarting the device and waking-up the queue is exactly what > > > sja1000_set_mode(CAN_MODE_START) does. In order to be consistent about > > > the queue state, we must acquire a lock both in the reset handler and in > > > the transmit path to ensure serialization of both operations. As the > > > reset handler might still be called after the transmission of a frame to > > > the transceiver but before it actually gets transmitted, we must ensure > > > we don't leak the skb, so we free it (the behavior is consistent, no > > > matter if there was an skb on the stack or not). > > > > Can you make use of netif_tx_disable() and netif_wake_queue() in > > sja1000_reset_interrupt() instead of the lock? > > ...or netif_tx_lock()/netif_tx_unlock(). As that's also a spinlock behind I guess it would fit. A quick look does not seem to show any specific constraint in using it, so let's go for it. Thanks, Miquèl