On 21.09.2022 09:25:41, dariobin@xxxxxxxxx wrote: > > On 9/16/22 06:14, Jacob Kroon wrote: > > ...> What I do know is that if I revert commit: > > > > > > "can: c_can: cache frames to operate as a true FIFO" > > > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=387da6bc7a826cc6d532b1c0002b7c7513238d5f > > > > > > then everything looks good. I don't get any BUG messages, and the host > > > has been running overnight without problems, so it seems to have fixed > > > the network interface lockup as well. > > Here's what I think: > If one or more messages are cached, the controller has to transmit more frames > in the unit of time when they can be transmitted (IF_COMM_TXRQST), different from > when the transmission occurs directly on request from the user space. In the case > of cached data transmission I therefore think that the controller is more heavily > loaded. Can this shift the balance ? > > > > > I ran the kernel *with* the commit above, and also with the following patch: > > > > > diff --git a/drivers/net/can/c_can/c_can_main.c b/drivers/net/can/c_can/c_can_main.c > > > index 52671d1ea17d..4375dc70e21f 100644 > > > --- a/drivers/net/can/c_can/c_can_main.c > > > +++ b/drivers/net/can/c_can/c_can_main.c > > > @@ -1,3 +1,4 @@ > > > +#define DEBUG > > > /* > > > * CAN bus driver for Bosch C_CAN controller > > > * > > > @@ -469,8 +470,15 @@ static netdev_tx_t c_can_start_xmit(struct sk_buff *skb, > > > if (c_can_get_tx_free(tx_ring) == 0) > > > netif_stop_queue(dev); > > > > > > - if (idx < c_can_get_tx_tail(tx_ring)) > > > + netdev_dbg(dev, "JAKR:%d:%d:%d:%d\n", idx, > > > + c_can_get_tx_head(tx_ring), > > > + c_can_get_tx_tail(tx_ring), > > > + c_can_get_tx_free(tx_ring)); > > > + > > > + if (idx < c_can_get_tx_tail(tx_ring)) { > > > cmd &= ~IF_COMM_TXRQST; /* Cache the message */ > > > + netdev_dbg(dev, "JAKR:Caching messages\n"); > > > + } > > > > > > /* Store the message in the interface so we can call > > > * can_put_echo_skb(). We must do this before we enable > > > > and I've uploaded the entire log I could capture from /dev/kmsg, right > > up to the hang, here: > > > > https://pastebin.com/6hvAcPc9 > > > > What looks odd to me right from the start is that sometimes when idx > > rolls over to 0, and *only* when it rolls over to 0, the CAN frame gets > > cached because "idx < c_can_get_tx_tail(tx_ring)". > > If the message were not stored but transmitted, the order of transmission > would not be respected. > > > > > Is it possible there is some difference between c_can and d_can in how > > the HW buffers are working, which breaks the driver on my particular HW > > setup ? > > > > I tested the patch on a beaglebone board without encountering any problems. > There is also a version of the driver I submitted to Xenomai running on a custom > board without problems. But surely the setup and context is different from yours. > > What compatible are you using in your device tree? > I used "ti,am3352-d_can". I think Jacob's board has a c_can core, while the beagle bone uses a d_can. Maybe there's a subtle difference between these cores? Dario, do you have access to a real c_can core to test? As reverting 387da6bc7a82 ("can: c_can: cache frames to operate as a true FIFO") helps to fix Jacob's problem, a temporary solution might be to only cache frames on d_can cores. regards, Marc -- Pengutronix e.K. | Marc Kleine-Budde | Embedded Linux | https://www.pengutronix.de | Vertretung West/Dortmund | Phone: +49-231-2826-924 | Amtsgericht Hildesheim, HRA 2686 | Fax: +49-5121-206917-5555 |
Attachment:
signature.asc
Description: PGP signature