AW: [Issue] Bosch D_CAN echo skb occupied error

"Schmidt, Stefan" <schmidtssstefan@xxxxxxxxxxxxxxxxxxxxxxxx> · Mon, 27 Jan 2025 15:44:26 +0000

Dear Vincent,

> -----Ursprüngliche Nachricht-----
> Von: Vincent Mailhol <mailhol.vincent@xxxxxxxxxx>
> Gesendet: Freitag, 24. Januar 2025 15:16
> An: Schmidt, Stefan <schmidtssstefan@xxxxxxxxxxxxxxxxxxxxxxxx>
> Cc: linux-can@xxxxxxxxxxxxxxx
> Betreff: Re: [Issue] Bosch D_CAN echo skb occupied error
> 
> Hi Stefan,
> 
> Thank you for your great write-up on your finding.
> 
> On Fri. 24 Janv. 2025 à 18:29, Schmidt, Stefan <schmidtssstefan@siemens-
> healthineers.com> a écrit :
> > Dear linux-can Members,
> >
> > I hope this message finds you well. My name is Stefan Schmidt, and I am
> reaching out to seek your guidance regarding the "echo socket buffer occupied
> error".
> 
> Welcome on the mailing list!
Thank you :-)

> 
> > Context:
> > Recently, we updated our kernel from 4.14 to 6.1.43 and our Debian from
> 10 to 12.
> > The system runs on an Altera Cyclone V, which has two Bosch D_CAN CAN
> Controllers on board. We use both. Additionally, it is a dual core system with
> two Arm cores.
> >
> > Issue:
> > After the update, I saw that both CAN devices sporadically print the message
> "BUG! echo_skb %d is occupied!" once into the syslog, with %d always being
> index 0.
> > From there on the affected device was unable to send any messages. A
> restart of the device (ifdown/up) recovered the problem.
> >
> > Steps Taken:
> > After heavily polluting the driver with trace points, I found out that in this
> case, the echo socket buffer is indeed still occupied by an old message on index
> 0.
> > Looking at the preceding usage of index 0, I saw that both functions
> c_can_start_xmit and c_can_do_tx  (drivers/net/can/c_can/c_can_main.c )
> were executed on different cores at the same time.
> > One was called by the tx-interrupt of the CAN controller and was scheduled
> on CPU0.
> > The other one was called by the application / network layer and was
> scheduled on CPU1.
> > As both functions operate on the same resources (variables in RAM and CAN
> controller registers), I think that there should be some kind of synchronization
> mechanism, but I did not find any in the code.
> >
> > Theory:
> > I think that the caching mechanism, that was introduced somewhen after
> the 4.14 version of the kernel, can get compromised by this parallel execution.
> > In the beginning of c_can_start_xmit, the function checks if there is a wrap
> around in the buffer. If this is the case, the message just gets prepared,
> everything gets written to the chip, but the final transmission request is not
> set.
> > ```c
> > // (1) drivers/net/can/c_can/c_can_main.c - c_can_start_xmit cmd &=
> > ~IF_COMM_TXRQST; /* Cache the message */ ``` In the end of
> > c_can_start_xmit this command gets written to the CAN controller.
> > ```c
> > // (2) drivers/net/can/c_can/c_can_main.c - c_can_start_xmit
> > c_can_object_put(dev, IF_TX, obj, cmd); ``` The final trigger to send
> > the message shall then be set in the tx-interrupt, as soon as message from
> object index 15 (last index) was processed, because otherwise, the new
> message on index 0 would bypass the one on index 15 and change the order in
> which they are put on the bus.
> > ```c
> > // (3) drivers/net/can/c_can/c_can_main.c - c_can_do_tx
> > c_can_object_put(dev, IF_NAPI, obj, IF_COMM_TXRQST); ``` My theory is,
> > that as it is possible that both functions run parallel on different CPUs, I think
> that it is possible that this can happen that it happens in this order (1) -> (3) ->
> (2). Which leads to the situation that the transmission request bit is
> immediately reset after being set.
> > To support my theory, I put a spinlock around both critical code sections. As
> of now, this seems to properly solve the issue, but I am aware that this is
> probably the clumsiest way to do it.
> >
> > Request:
> > I am reaching out to you because I have not found anyone else with a similar
> problem on the internet.
> > It would be great if you could share your thought about this problem and
> maybe give some hints for a cleaner solution.
> >
> > Additional Details:
> > Please find my clumsy fix with the spinlock below. I don't think that it is
> ready to be merged to any official repository, but maybe it is a short-term help
> for someone who is in a similar situation.
> 
> So, theoretically, it should be possible to have a lockless queue under the
> conditions that:
> 
>   - there is only one reader at a time
>   - there is only one writer at a time
>   - the queue size is a power of 2
> 

Implementing another queue length than power of 2 in hardware is probably a pain, so I think this point is always true.
Maybe it can come to the problem even on single core systems. Is it possible that the TX soft-irq can interrupt the start_xmit, or is this somehow scheduled atomic?
If it is possible, then the lock is also needed on single core systems.

> This design will then abuse the C unsigned integer wraparound property. The
> c_can/d_can meets all above requirements, so normally, no locks are needed
> here.
> 
> So I think that the core issue is somewhere else. Looking at the recent
> commits, [1] and [2] caught my attention:
> 
> [1] commit 387da6bc7a82 ("can: c_can: cache frames to operate as a true
> FIFO")
> https://git.kernel.org/torvalds/c/387da6bc7a82
> 
> [2] commit 81d192c2ce74 ("can: c_can: don't cache TX messages for C_CAN
> cores")
> https://git.kernel.org/torvalds/c/81d192c2ce74
> 
> Those two are present in 6.1.x on which you are having your problem.
> 
> What triggers me the most is the fact that the D_CAN, on which you are having
> your problems, have a special if condition:
> 
> 
> https://elixir.bootlin.com/linux/v6.12/source/drivers/net/can/c_can/c_can.h
> #L244
> 
> In addition, [2] says that, I quote:
> 
>   the optimization introduced in [1] doesn't properly work on
>   C_CAN, but on D_CAN IP cores. The exact reasons are still
>   unknown.
> 
> The other part of the queue implementation looks good to me. Only this
> c_can_get_tx_free() caught my attention.
> 
> Can you try to revert above commits [1] and [2] and tell us if you still have the
> issue?

Unfortunately, I don't have enough time for those tests, but I am quite sure that they are the reason, because the code looks almost like in 4.14 after reverting those changes.
Those changes [1] and [2] introduced write access to the TX buffer in the tx-irq (c_can_do_tx). This was not the case before those changes.
Maybe it was not considered that the function c_can_do_tx can / is called from another context.
Maybe, someone who uses the C_CAN had the same problems as me and therefore disabled the caching mechanism.