Hi Stefan, Thank you for your great write-up on your finding. On Fri. 24 Janv. 2025 à 18:29, Schmidt, Stefan <schmidtssstefan@xxxxxxxxxxxxxxxxxxxxxxxx> a écrit : > Dear linux-can Members, > > I hope this message finds you well. My name is Stefan Schmidt, and I am reaching out to seek your guidance regarding the "echo socket buffer occupied error". Welcome on the mailing list! > Context: > Recently, we updated our kernel from 4.14 to 6.1.43 and our Debian from 10 to 12. > The system runs on an Altera Cyclone V, which has two Bosch D_CAN CAN Controllers on board. We use both. Additionally, it is a dual core system with two Arm cores. > > Issue: > After the update, I saw that both CAN devices sporadically print the message "BUG! echo_skb %d is occupied!" once into the syslog, with %d always being index 0. > From there on the affected device was unable to send any messages. A restart of the device (ifdown/up) recovered the problem. > > Steps Taken: > After heavily polluting the driver with trace points, I found out that in this case, the echo socket buffer is indeed still occupied by an old message on index 0. > Looking at the preceding usage of index 0, I saw that both functions c_can_start_xmit and c_can_do_tx (drivers/net/can/c_can/c_can_main.c ) were executed on different cores at the same time. > One was called by the tx-interrupt of the CAN controller and was scheduled on CPU0. > The other one was called by the application / network layer and was scheduled on CPU1. > As both functions operate on the same resources (variables in RAM and CAN controller registers), I think that there should be some kind of synchronization mechanism, but I did not find any in the code. > > Theory: > I think that the caching mechanism, that was introduced somewhen after the 4.14 version of the kernel, can get compromised by this parallel execution. > In the beginning of c_can_start_xmit, the function checks if there is a wrap around in the buffer. If this is the case, the message just gets prepared, everything gets written to the chip, but the final transmission request is not set. > ```c > // (1) drivers/net/can/c_can/c_can_main.c - c_can_start_xmit > cmd &= ~IF_COMM_TXRQST; /* Cache the message */ > ``` > In the end of c_can_start_xmit this command gets written to the CAN controller. > ```c > // (2) drivers/net/can/c_can/c_can_main.c - c_can_start_xmit > c_can_object_put(dev, IF_TX, obj, cmd); > ``` > The final trigger to send the message shall then be set in the tx-interrupt, as soon as message from object index 15 (last index) was processed, because otherwise, the new message on index 0 would bypass the one on index 15 and change the order in which they are put on the bus. > ```c > // (3) drivers/net/can/c_can/c_can_main.c - c_can_do_tx > c_can_object_put(dev, IF_NAPI, obj, IF_COMM_TXRQST); > ``` > My theory is, that as it is possible that both functions run parallel on different CPUs, I think that it is possible that this can happen that it happens in this order (1) -> (3) -> (2). Which leads to the situation that the transmission request bit is immediately reset after being set. > To support my theory, I put a spinlock around both critical code sections. As of now, this seems to properly solve the issue, but I am aware that this is probably the clumsiest way to do it. > > Request: > I am reaching out to you because I have not found anyone else with a similar problem on the internet. > It would be great if you could share your thought about this problem and maybe give some hints for a cleaner solution. > > Additional Details: > Please find my clumsy fix with the spinlock below. I don't think that it is ready to be merged to any official repository, but maybe it is a short-term help for someone who is in a similar situation. So, theoretically, it should be possible to have a lockless queue under the conditions that: - there is only one reader at a time - there is only one writer at a time - the queue size is a power of 2 This design will then abuse the C unsigned integer wraparound property. The c_can/d_can meets all above requirements, so normally, no locks are needed here. So I think that the core issue is somewhere else. Looking at the recent commits, [1] and [2] caught my attention: [1] commit 387da6bc7a82 ("can: c_can: cache frames to operate as a true FIFO") https://git.kernel.org/torvalds/c/387da6bc7a82 [2] commit 81d192c2ce74 ("can: c_can: don't cache TX messages for C_CAN cores") https://git.kernel.org/torvalds/c/81d192c2ce74 Those two are present in 6.1.x on which you are having your problem. What triggers me the most is the fact that the D_CAN, on which you are having your problems, have a special if condition: https://elixir.bootlin.com/linux/v6.12/source/drivers/net/can/c_can/c_can.h#L244 In addition, [2] says that, I quote: the optimization introduced in [1] doesn't properly work on C_CAN, but on D_CAN IP cores. The exact reasons are still unknown. The other part of the queue implementation looks good to me. Only this c_can_get_tx_free() caught my attention. Can you try to revert above commits [1] and [2] and tell us if you still have the issue? > Thank you in advance for your time and assistance. I look forward to hearing your suggestions or advice. Thank you for the detailed report. Let me ask you: do you want to write the final patch? If yes, we can guide you into the process. Yours sincerely, Vincent Mailhol