Re: C_CAN/D_CAN bug and fix

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hello Joe,

Am 21.06.2018 um 09:55 schrieb Joe Burmeister:
> Hi Wolfgang
> 
> 
> On 21/06/18 08:24, Wolfgang Grandegger wrote:
>> Hello Joe,
>>
>> I have some more questions...
>>
>> Am 20.06.2018 um 19:00 schrieb Joe Burmeister:
>>> Hi,
>>>
>>> I've bumped into what I think is a chip bug that the C_CAN/D_CAN driver
>>> isn't handling.
>>>
>>> It can get into a state where the chip status register reports it's bus
>>> off, but the can driver doesn't know, so the bus never gets restarted.
>>>
>>> Looks like the chip isn't firing the interrupt or is firing with the
>>> interrupt register as zero. Either is wrong and means "c_can_poll" is
>>> never called, and thus the driver never picks up the bus off.
>>>
>>> We are turning on/off the can device we are talking to, and we have to
>>> do this a lot to cause this. But we can get into this state and then the
>> With on/off you mean "ifconfig up/down"?
> 
> No, literally power on and power off to the device we are talking to
> over can.
> It's power is controlled by a GPIO line on the BBB and part of the
> normal operation is to turn it on and off.
> But in the test, we do that a lot to reproduce this bug we only saw once
> in a blue moon.
> 
>> Is it always the first bus-off making trouble after you switched on the
>> device?
> 
> No, even in the test, most of the time, the test iteration completes
> without issue.
> 
>> Does the "bus-off" condition occur frequently?
> 
> Even with the test, which an iteration lasts about 30 seconds, it can
> take over 5 minutes.

I mean: do bus-off conditions occur frequently on the bus? At what rate?

> 
>> May bus-off also occur during the start of the CAN device (ifconfig up)?
> 
> As far as the c_can/d_can driver is concerned, nothing has happened
> because it missed the that the chip went to bus off.
>>> manual fix is to do "ifdown can0 && ifup can0" to sync up the driver and
>>> the chip. If you don't everything looks fine but nothing you send goes
>>> out to the bus and you never receive anything.
>>>
>>> When this issue bites, the last messages you see in candump are:
>>>
>>> can0  20000004   [8]  00 04 00 00 00 00 00 79   ERRORFRAME
>>> can0  20000004   [8]  00 10 00 00 00 00 00 79   ERRORFRAME
>>>
>>> You see this in candump on other iterations of the test, but often see
>>> the following :
>>>
>>> can0  20000040   [8]  00 00 00 00 00 00 00 00   ERRORFRAME
>>> can0  20000100   [8]  00 00 00 00 00 00 00 00   ERRORFRAME
>>>
>>> You obviously see a "c_can_platform 481cc000.can can0: bus-off" and
>>> "c_can_platform 481cc000.can can0: restarted" in dmesg with the above
>>> can messages. As I understand it, it's the BBB end that is sending these
>>> two. When you don't see these two following, there isn't a (lasting
>>> anyway) detected bus off, so the traffic between the device and the BBB
>>> starts as normal when power comes on.
>>>
>>> What I've done is catch the bus off in "c_can_start_xmit" on a
>>> "can_send" and if it is an unknown bus off, schedule "c_can_poll" which
>>> will do what is required. So it self fixes.
>>>
>>> I figured even if it's something odd about the device we are talking to
>>> causing this, it shouldn't be able to get into this state.
>>>
>>> This was on 4.4 but I see that 4.18 is basically the same code.
>>>
>>> Anyway, this is what we are doing and now I've done due diligence
>>> passing the information on. :-)
>>>
>>> Patch attached.
>> OK, an extra napi_schedule() finds the bus-off then. Needs more thoughts...
> 
> After writing this up, and having a bit of a break and a sleep, I woke
> realizing what might be happening.
> 
> "c_can_isr" disables the interrupts which are then reenabled at the end
> of "c_can_poll". I'm guessing this is to avoid "c_can_poll" causing
> fresh interrupts.
> But I think the all important interrupt is happening before the
> interrupts are re-enabled.
> 
> What it should do is grab the status register before and after the
> re-enable, and if they are different, or different from the status
> register value at the start of "c_can_poll", schedule another "c_can_poll".

It's just an interrupt mask. After re-enabling it, an interrupt should
be triggered even if pending. Maybe there is something strange going on
with the "init" bit.

Wolfgang.

> 
> I will try this when I can get a slot on the test setup.
> 
>>> Regards,
>>>
>>> Joe
>>>
>>> P.S. Don't know if
>>> "http://www.keil.com/dd/docs/datashts/silabs/boschcan_ug.pdf"; is an
>>> acceptable link for the datasheet, but the URL for the datasheet in the
>>> code is 404'ed.
>> I also realized some time ago that the link is broken :(. Your link to
>> Keil looks good, though.
>>
>> Wolfgang.
>>
> 
> Shame really there isn't a git repo of datasheet pdf that we can
> reference instead of links.

Not sure if there a licensing issues.

Wolfgang.
--
To unsubscribe from this list: send the line "unsubscribe linux-can" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html



[Index of Archives]     [Automotive Discussions]     [Linux ARM Kernel]     [Linux ARM]     [Linux Omap]     [Fedora ARM]     [IETF Annouce]     [Security]     [Bugtraq]     [Linux]     [Linux OMAP]     [Linux MIPS]     [eCos]     [Asterisk Internet PBX]     [Linux API]     [CAN Bus]

  Powered by Linux