Re: [Issue] Bosch D_CAN echo skb occupied error

Vincent Mailhol <mailhol.vincent@xxxxxxxxxx> · Fri, 24 Jan 2025 23:16:11 +0900

Hi Stefan,

Thank you for your great write-up on your finding.

On Fri. 24 Janv. 2025 à 18:29, Schmidt, Stefan
<schmidtssstefan@xxxxxxxxxxxxxxxxxxxxxxxx> a écrit :
> Dear linux-can Members,
>
> I hope this message finds you well. My name is Stefan Schmidt, and I am reaching out to seek your guidance regarding the "echo socket buffer occupied error".

Welcome on the mailing list!

> Context:
> Recently, we updated our kernel from 4.14 to 6.1.43 and our Debian from 10 to 12.
> The system runs on an Altera Cyclone V, which has two Bosch D_CAN CAN Controllers on board. We use both. Additionally, it is a dual core system with two Arm cores.
>
> Issue:
> After the update, I saw that both CAN devices sporadically print the message "BUG! echo_skb %d is occupied!" once into the syslog, with %d always being index 0.
> From there on the affected device was unable to send any messages. A restart of the device (ifdown/up) recovered the problem.
>
> Steps Taken:
> After heavily polluting the driver with trace points, I found out that in this case, the echo socket buffer is indeed still occupied by an old message on index 0.
> Looking at the preceding usage of index 0, I saw that both functions c_can_start_xmit and c_can_do_tx  (drivers/net/can/c_can/c_can_main.c ) were executed on different cores at the same time.
> One was called by the tx-interrupt of the CAN controller and was scheduled on CPU0.
> The other one was called by the application / network layer and was scheduled on CPU1.
> As both functions operate on the same resources (variables in RAM and CAN controller registers), I think that there should be some kind of synchronization mechanism, but I did not find any in the code.
>
> Theory:
> I think that the caching mechanism, that was introduced somewhen after the 4.14 version of the kernel, can get compromised by this parallel execution.
> In the beginning of c_can_start_xmit, the function checks if there is a wrap around in the buffer. If this is the case, the message just gets prepared, everything gets written to the chip, but the final transmission request is not set.
> ```c
> // (1) drivers/net/can/c_can/c_can_main.c - c_can_start_xmit
> cmd &= ~IF_COMM_TXRQST; /* Cache the message */
> ```
> In the end of c_can_start_xmit this command gets written to the CAN controller.
> ```c
> // (2) drivers/net/can/c_can/c_can_main.c - c_can_start_xmit
> c_can_object_put(dev, IF_TX, obj, cmd);
> ```
> The final trigger to send the message shall then be set in the tx-interrupt, as soon as message from object index 15 (last index) was processed, because otherwise, the new message on index 0 would bypass the one on index 15 and change the order in which they are put on the bus.
> ```c
> // (3) drivers/net/can/c_can/c_can_main.c - c_can_do_tx
> c_can_object_put(dev, IF_NAPI, obj, IF_COMM_TXRQST);
> ```
> My theory is, that as it is possible that both functions run parallel on different CPUs, I think that it is possible that this can happen that it happens in this order (1) -> (3) -> (2). Which leads to the situation that the transmission request bit is immediately reset after being set.
> To support my theory, I put a spinlock around both critical code sections. As of now, this seems to properly solve the issue, but I am aware that this is probably the clumsiest way to do it.
>
> Request:
> I am reaching out to you because I have not found anyone else with a similar problem on the internet.
> It would be great if you could share your thought about this problem and maybe give some hints for a cleaner solution.
>
> Additional Details:
> Please find my clumsy fix with the spinlock below. I don't think that it is ready to be merged to any official repository, but maybe it is a short-term help for someone who is in a similar situation.

So, theoretically, it should be possible to have a lockless queue
under the conditions that:

  - there is only one reader at a time
  - there is only one writer at a time
  - the queue size is a power of 2

This design will then abuse the C unsigned integer wraparound
property. The c_can/d_can meets all above requirements, so normally,
no locks are needed here.

So I think that the core issue is somewhere else. Looking at the
recent commits, [1] and [2] caught my attention:

[1] commit 387da6bc7a82 ("can: c_can: cache frames to operate as a true FIFO")
https://git.kernel.org/torvalds/c/387da6bc7a82

[2] commit 81d192c2ce74 ("can: c_can: don't cache TX messages for C_CAN cores")
https://git.kernel.org/torvalds/c/81d192c2ce74

Those two are present in 6.1.x on which you are having your problem.

What triggers me the most is the fact that the D_CAN, on which you are
having your problems, have a special if condition:

  https://elixir.bootlin.com/linux/v6.12/source/drivers/net/can/c_can/c_can.h#L244

In addition, [2] says that, I quote:

  the optimization introduced in [1] doesn't properly work on
  C_CAN, but on D_CAN IP cores. The exact reasons are still
  unknown.

The other part of the queue implementation looks good to me. Only this
c_can_get_tx_free() caught my attention.

Can you try to revert above commits [1] and [2] and tell us if you
still have the issue?

> Thank you in advance for your time and assistance. I look forward to hearing your suggestions or advice.

Thank you for the detailed report. Let me ask you: do you want to
write the final patch? If yes, we can guide you into the process.

Yours sincerely,
Vincent Mailhol