Hi Wolfgang On 21/06/18 08:24, Wolfgang Grandegger wrote: > Hello Joe, > > I have some more questions... > > Am 20.06.2018 um 19:00 schrieb Joe Burmeister: >> Hi, >> >> I've bumped into what I think is a chip bug that the C_CAN/D_CAN driver >> isn't handling. >> >> It can get into a state where the chip status register reports it's bus >> off, but the can driver doesn't know, so the bus never gets restarted. >> >> Looks like the chip isn't firing the interrupt or is firing with the >> interrupt register as zero. Either is wrong and means "c_can_poll" is >> never called, and thus the driver never picks up the bus off. >> >> We are turning on/off the can device we are talking to, and we have to >> do this a lot to cause this. But we can get into this state and then the > With on/off you mean "ifconfig up/down"? No, literally power on and power off to the device we are talking to over can. It's power is controlled by a GPIO line on the BBB and part of the normal operation is to turn it on and off. But in the test, we do that a lot to reproduce this bug we only saw once in a blue moon. > Is it always the first bus-off making trouble after you switched on the > device? No, even in the test, most of the time, the test iteration completes without issue. > Does the "bus-off" condition occur frequently? Even with the test, which an iteration lasts about 30 seconds, it can take over 5 minutes. > May bus-off also occur during the start of the CAN device (ifconfig up)? As far as the c_can/d_can driver is concerned, nothing has happened because it missed the that the chip went to bus off. >> manual fix is to do "ifdown can0 && ifup can0" to sync up the driver and >> the chip. If you don't everything looks fine but nothing you send goes >> out to the bus and you never receive anything. >> >> When this issue bites, the last messages you see in candump are: >> >> can0 20000004 [8] 00 04 00 00 00 00 00 79 ERRORFRAME >> can0 20000004 [8] 00 10 00 00 00 00 00 79 ERRORFRAME >> >> You see this in candump on other iterations of the test, but often see >> the following : >> >> can0 20000040 [8] 00 00 00 00 00 00 00 00 ERRORFRAME >> can0 20000100 [8] 00 00 00 00 00 00 00 00 ERRORFRAME >> >> You obviously see a "c_can_platform 481cc000.can can0: bus-off" and >> "c_can_platform 481cc000.can can0: restarted" in dmesg with the above >> can messages. As I understand it, it's the BBB end that is sending these >> two. When you don't see these two following, there isn't a (lasting >> anyway) detected bus off, so the traffic between the device and the BBB >> starts as normal when power comes on. >> >> What I've done is catch the bus off in "c_can_start_xmit" on a >> "can_send" and if it is an unknown bus off, schedule "c_can_poll" which >> will do what is required. So it self fixes. >> >> I figured even if it's something odd about the device we are talking to >> causing this, it shouldn't be able to get into this state. >> >> This was on 4.4 but I see that 4.18 is basically the same code. >> >> Anyway, this is what we are doing and now I've done due diligence >> passing the information on. :-) >> >> Patch attached. > OK, an extra napi_schedule() finds the bus-off then. Needs more thoughts... After writing this up, and having a bit of a break and a sleep, I woke realizing what might be happening. "c_can_isr" disables the interrupts which are then reenabled at the end of "c_can_poll". I'm guessing this is to avoid "c_can_poll" causing fresh interrupts. But I think the all important interrupt is happening before the interrupts are re-enabled. What it should do is grab the status register before and after the re-enable, and if they are different, or different from the status register value at the start of "c_can_poll", schedule another "c_can_poll". I will try this when I can get a slot on the test setup. >> Regards, >> >> Joe >> >> P.S. Don't know if >> "http://www.keil.com/dd/docs/datashts/silabs/boschcan_ug.pdf" is an >> acceptable link for the datasheet, but the URL for the datasheet in the >> code is 404'ed. > I also realized some time ago that the link is broken :(. Your link to > Keil looks good, though. > > Wolfgang. > Shame really there isn't a git repo of datasheet pdf that we can reference instead of links. Not great to store binaries in git, but each datasheet will only have a few revisions.... Joe -- To unsubscribe from this list: send the line "unsubscribe linux-can" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html