Hello Joe, Am 21.06.2018 um 09:55 schrieb Joe Burmeister: > Hi Wolfgang > > > On 21/06/18 08:24, Wolfgang Grandegger wrote: >> Hello Joe, >> >> I have some more questions... >> >> Am 20.06.2018 um 19:00 schrieb Joe Burmeister: >>> Hi, >>> >>> I've bumped into what I think is a chip bug that the C_CAN/D_CAN driver >>> isn't handling. >>> >>> It can get into a state where the chip status register reports it's bus >>> off, but the can driver doesn't know, so the bus never gets restarted. >>> >>> Looks like the chip isn't firing the interrupt or is firing with the >>> interrupt register as zero. Either is wrong and means "c_can_poll" is >>> never called, and thus the driver never picks up the bus off. >>> >>> We are turning on/off the can device we are talking to, and we have to >>> do this a lot to cause this. But we can get into this state and then the >> With on/off you mean "ifconfig up/down"? > > No, literally power on and power off to the device we are talking to > over can. > It's power is controlled by a GPIO line on the BBB and part of the > normal operation is to turn it on and off. > But in the test, we do that a lot to reproduce this bug we only saw once > in a blue moon. > >> Is it always the first bus-off making trouble after you switched on the >> device? > > No, even in the test, most of the time, the test iteration completes > without issue. > >> Does the "bus-off" condition occur frequently? > > Even with the test, which an iteration lasts about 30 seconds, it can > take over 5 minutes. I mean: do bus-off conditions occur frequently on the bus? At what rate? > >> May bus-off also occur during the start of the CAN device (ifconfig up)? > > As far as the c_can/d_can driver is concerned, nothing has happened > because it missed the that the chip went to bus off. >>> manual fix is to do "ifdown can0 && ifup can0" to sync up the driver and >>> the chip. If you don't everything looks fine but nothing you send goes >>> out to the bus and you never receive anything. >>> >>> When this issue bites, the last messages you see in candump are: >>> >>> can0 20000004 [8] 00 04 00 00 00 00 00 79 ERRORFRAME >>> can0 20000004 [8] 00 10 00 00 00 00 00 79 ERRORFRAME >>> >>> You see this in candump on other iterations of the test, but often see >>> the following : >>> >>> can0 20000040 [8] 00 00 00 00 00 00 00 00 ERRORFRAME >>> can0 20000100 [8] 00 00 00 00 00 00 00 00 ERRORFRAME >>> >>> You obviously see a "c_can_platform 481cc000.can can0: bus-off" and >>> "c_can_platform 481cc000.can can0: restarted" in dmesg with the above >>> can messages. As I understand it, it's the BBB end that is sending these >>> two. When you don't see these two following, there isn't a (lasting >>> anyway) detected bus off, so the traffic between the device and the BBB >>> starts as normal when power comes on. >>> >>> What I've done is catch the bus off in "c_can_start_xmit" on a >>> "can_send" and if it is an unknown bus off, schedule "c_can_poll" which >>> will do what is required. So it self fixes. >>> >>> I figured even if it's something odd about the device we are talking to >>> causing this, it shouldn't be able to get into this state. >>> >>> This was on 4.4 but I see that 4.18 is basically the same code. >>> >>> Anyway, this is what we are doing and now I've done due diligence >>> passing the information on. :-) >>> >>> Patch attached. >> OK, an extra napi_schedule() finds the bus-off then. Needs more thoughts... > > After writing this up, and having a bit of a break and a sleep, I woke > realizing what might be happening. > > "c_can_isr" disables the interrupts which are then reenabled at the end > of "c_can_poll". I'm guessing this is to avoid "c_can_poll" causing > fresh interrupts. > But I think the all important interrupt is happening before the > interrupts are re-enabled. > > What it should do is grab the status register before and after the > re-enable, and if they are different, or different from the status > register value at the start of "c_can_poll", schedule another "c_can_poll". It's just an interrupt mask. After re-enabling it, an interrupt should be triggered even if pending. Maybe there is something strange going on with the "init" bit. Wolfgang. > > I will try this when I can get a slot on the test setup. > >>> Regards, >>> >>> Joe >>> >>> P.S. Don't know if >>> "http://www.keil.com/dd/docs/datashts/silabs/boschcan_ug.pdf" is an >>> acceptable link for the datasheet, but the URL for the datasheet in the >>> code is 404'ed. >> I also realized some time ago that the link is broken :(. Your link to >> Keil looks good, though. >> >> Wolfgang. >> > > Shame really there isn't a git repo of datasheet pdf that we can > reference instead of links. Not sure if there a licensing issues. Wolfgang. -- To unsubscribe from this list: send the line "unsubscribe linux-can" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html