Dear Vincent, > -----Ursprüngliche Nachricht----- > Von: Vincent Mailhol <mailhol.vincent@xxxxxxxxxx> > Gesendet: Freitag, 24. Januar 2025 15:16 > An: Schmidt, Stefan <schmidtssstefan@xxxxxxxxxxxxxxxxxxxxxxxx> > Cc: linux-can@xxxxxxxxxxxxxxx > Betreff: Re: [Issue] Bosch D_CAN echo skb occupied error > > Hi Stefan, > > Thank you for your great write-up on your finding. > > On Fri. 24 Janv. 2025 à 18:29, Schmidt, Stefan <schmidtssstefan@siemens- > healthineers.com> a écrit : > > Dear linux-can Members, > > > > I hope this message finds you well. My name is Stefan Schmidt, and I am > reaching out to seek your guidance regarding the "echo socket buffer occupied > error". > > Welcome on the mailing list! Thank you :-) > > > Context: > > Recently, we updated our kernel from 4.14 to 6.1.43 and our Debian from > 10 to 12. > > The system runs on an Altera Cyclone V, which has two Bosch D_CAN CAN > Controllers on board. We use both. Additionally, it is a dual core system with > two Arm cores. > > > > Issue: > > After the update, I saw that both CAN devices sporadically print the message > "BUG! echo_skb %d is occupied!" once into the syslog, with %d always being > index 0. > > From there on the affected device was unable to send any messages. A > restart of the device (ifdown/up) recovered the problem. > > > > Steps Taken: > > After heavily polluting the driver with trace points, I found out that in this > case, the echo socket buffer is indeed still occupied by an old message on index > 0. > > Looking at the preceding usage of index 0, I saw that both functions > c_can_start_xmit and c_can_do_tx (drivers/net/can/c_can/c_can_main.c ) > were executed on different cores at the same time. > > One was called by the tx-interrupt of the CAN controller and was scheduled > on CPU0. > > The other one was called by the application / network layer and was > scheduled on CPU1. > > As both functions operate on the same resources (variables in RAM and CAN > controller registers), I think that there should be some kind of synchronization > mechanism, but I did not find any in the code. > > > > Theory: > > I think that the caching mechanism, that was introduced somewhen after > the 4.14 version of the kernel, can get compromised by this parallel execution. > > In the beginning of c_can_start_xmit, the function checks if there is a wrap > around in the buffer. If this is the case, the message just gets prepared, > everything gets written to the chip, but the final transmission request is not > set. > > ```c > > // (1) drivers/net/can/c_can/c_can_main.c - c_can_start_xmit cmd &= > > ~IF_COMM_TXRQST; /* Cache the message */ ``` In the end of > > c_can_start_xmit this command gets written to the CAN controller. > > ```c > > // (2) drivers/net/can/c_can/c_can_main.c - c_can_start_xmit > > c_can_object_put(dev, IF_TX, obj, cmd); ``` The final trigger to send > > the message shall then be set in the tx-interrupt, as soon as message from > object index 15 (last index) was processed, because otherwise, the new > message on index 0 would bypass the one on index 15 and change the order in > which they are put on the bus. > > ```c > > // (3) drivers/net/can/c_can/c_can_main.c - c_can_do_tx > > c_can_object_put(dev, IF_NAPI, obj, IF_COMM_TXRQST); ``` My theory is, > > that as it is possible that both functions run parallel on different CPUs, I think > that it is possible that this can happen that it happens in this order (1) -> (3) -> > (2). Which leads to the situation that the transmission request bit is > immediately reset after being set. > > To support my theory, I put a spinlock around both critical code sections. As > of now, this seems to properly solve the issue, but I am aware that this is > probably the clumsiest way to do it. > > > > Request: > > I am reaching out to you because I have not found anyone else with a similar > problem on the internet. > > It would be great if you could share your thought about this problem and > maybe give some hints for a cleaner solution. > > > > Additional Details: > > Please find my clumsy fix with the spinlock below. I don't think that it is > ready to be merged to any official repository, but maybe it is a short-term help > for someone who is in a similar situation. > > So, theoretically, it should be possible to have a lockless queue under the > conditions that: > > - there is only one reader at a time > - there is only one writer at a time > - the queue size is a power of 2 > Implementing another queue length than power of 2 in hardware is probably a pain, so I think this point is always true. Maybe it can come to the problem even on single core systems. Is it possible that the TX soft-irq can interrupt the start_xmit, or is this somehow scheduled atomic? If it is possible, then the lock is also needed on single core systems. > This design will then abuse the C unsigned integer wraparound property. The > c_can/d_can meets all above requirements, so normally, no locks are needed > here. > > So I think that the core issue is somewhere else. Looking at the recent > commits, [1] and [2] caught my attention: > > [1] commit 387da6bc7a82 ("can: c_can: cache frames to operate as a true > FIFO") > https://git.kernel.org/torvalds/c/387da6bc7a82 > > [2] commit 81d192c2ce74 ("can: c_can: don't cache TX messages for C_CAN > cores") > https://git.kernel.org/torvalds/c/81d192c2ce74 > > Those two are present in 6.1.x on which you are having your problem. > > What triggers me the most is the fact that the D_CAN, on which you are having > your problems, have a special if condition: > > > https://elixir.bootlin.com/linux/v6.12/source/drivers/net/can/c_can/c_can.h > #L244 > > In addition, [2] says that, I quote: > > the optimization introduced in [1] doesn't properly work on > C_CAN, but on D_CAN IP cores. The exact reasons are still > unknown. > > The other part of the queue implementation looks good to me. Only this > c_can_get_tx_free() caught my attention. > > Can you try to revert above commits [1] and [2] and tell us if you still have the > issue? Unfortunately, I don't have enough time for those tests, but I am quite sure that they are the reason, because the code looks almost like in 4.14 after reverting those changes. Those changes [1] and [2] introduced write access to the TX buffer in the tx-irq (c_can_do_tx). This was not the case before those changes. Maybe it was not considered that the function c_can_do_tx can / is called from another context. Maybe, someone who uses the C_CAN had the same problems as me and therefore disabled the caching mechanism.