Re: CM-ITC, pch_can/c_can_pci, sendto() returning ENOBUFS

Oliver Hartkopp <socketcan@xxxxxxxxxxxx> · Wed, 21 Sep 2022 11:55:59 +0200

On 21.09.22 09:47, Marc Kleine-Budde wrote:
On 21.09.2022 09:25:41, dariobin@xxxxxxxxx wrote:
On 9/16/22 06:14, Jacob Kroon wrote:
...> What I do know is that if I revert commit:

"can: c_can: cache frames to operate as a true FIFO"
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=387da6bc7a826cc6d532b1c0002b7c7513238d5f

then everything looks good. I don't get any BUG messages, and the host
has been running overnight without problems, so it seems to have fixed
the network interface lockup as well.

Here's what I think:
If one or more messages are cached, the controller has to transmit more frames
in the unit of time when they can be transmitted (IF_COMM_TXRQST), different from
when the transmission occurs directly on request from the user space. In the case
of cached data transmission I therefore think that the controller is more heavily
loaded. Can this shift the balance ?


I ran the kernel *with* the commit above, and also with the following patch:

diff --git a/drivers/net/can/c_can/c_can_main.c b/drivers/net/can/c_can/c_can_main.c
index 52671d1ea17d..4375dc70e21f 100644
--- a/drivers/net/can/c_can/c_can_main.c
+++ b/drivers/net/can/c_can/c_can_main.c
@@ -1,3 +1,4 @@
+#define DEBUG
  /*
   * CAN bus driver for Bosch C_CAN controller
   *
@@ -469,8 +470,15 @@ static netdev_tx_t c_can_start_xmit(struct sk_buff *skb,
  	if (c_can_get_tx_free(tx_ring) == 0)
  		netif_stop_queue(dev);
  
-	if (idx < c_can_get_tx_tail(tx_ring))
+	netdev_dbg(dev, "JAKR:%d:%d:%d:%d\n", idx,
+	                                      c_can_get_tx_head(tx_ring),
+	                                      c_can_get_tx_tail(tx_ring),
+	                                      c_can_get_tx_free(tx_ring));
+
+	if (idx < c_can_get_tx_tail(tx_ring)) {
  		cmd &= ~IF_COMM_TXRQST; /* Cache the message */
+		netdev_dbg(dev, "JAKR:Caching messages\n");
+	}
  
  	/* Store the message in the interface so we can call
  	 * can_put_echo_skb(). We must do this before we enable

and I've uploaded the entire log I could capture from /dev/kmsg, right
up to the hang, here:

https://pastebin.com/6hvAcPc9

What looks odd to me right from the start is that sometimes when idx
rolls over to 0, and *only* when it rolls over to 0, the CAN frame gets
cached because "idx < c_can_get_tx_tail(tx_ring)".

If the message were not stored but transmitted, the order of transmission
would not be respected.


Is it possible there is some difference between c_can and d_can in how
the HW buffers are working, which breaks the driver on my particular HW
setup ?


I tested the patch on a beaglebone board without encountering any problems.
There is also a version of the driver I submitted to Xenomai running on a custom
board without problems. But surely the setup and context is different from yours.

What compatible are you using in your device tree?
I used "ti,am3352-d_can".

I think Jacob's board has a c_can core, while the beagle bone uses a
d_can. Maybe there's a subtle difference between these cores?

Dario, do you have access to a real c_can core to test?

As reverting 387da6bc7a82 ("can: c_can: cache frames to operate as a
true FIFO") helps to fix Jacob's problem, a temporary solution might be
to only cache frames on d_can cores.

Btw. I uploaded the 'latest' C_CAN manuals on

https://github.com/linux-can/can-doc

... as it could only be found on archive.org :-/

Unfortunately I was not able to find any (latest?) D_CAN manual anymore, 
which was originally hosted at 
http://www.semiconductors.bosch.de/media/en/pdf/ipmodules_1/can/d_can_users_manual_111.pdf

Archive.org did not crawl this PDF ;-(

If someone still has this D_CAN PDF please send a URL or the PDF itself 
to me, so that I can put it there too.

Thanks,
Oliver