Re: C_CAN/D_CAN bug and fix

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi Wolfgang


On 21/06/18 08:24, Wolfgang Grandegger wrote:
> Hello Joe,
>
> I have some more questions...
>
> Am 20.06.2018 um 19:00 schrieb Joe Burmeister:
>> Hi,
>>
>> I've bumped into what I think is a chip bug that the C_CAN/D_CAN driver
>> isn't handling.
>>
>> It can get into a state where the chip status register reports it's bus
>> off, but the can driver doesn't know, so the bus never gets restarted.
>>
>> Looks like the chip isn't firing the interrupt or is firing with the
>> interrupt register as zero. Either is wrong and means "c_can_poll" is
>> never called, and thus the driver never picks up the bus off.
>>
>> We are turning on/off the can device we are talking to, and we have to
>> do this a lot to cause this. But we can get into this state and then the
> With on/off you mean "ifconfig up/down"?

No, literally power on and power off to the device we are talking to
over can.
It's power is controlled by a GPIO line on the BBB and part of the
normal operation is to turn it on and off.
But in the test, we do that a lot to reproduce this bug we only saw once
in a blue moon.

> Is it always the first bus-off making trouble after you switched on the
> device?

No, even in the test, most of the time, the test iteration completes
without issue.

> Does the "bus-off" condition occur frequently?

Even with the test, which an iteration lasts about 30 seconds, it can
take over 5 minutes.

> May bus-off also occur during the start of the CAN device (ifconfig up)?

As far as the c_can/d_can driver is concerned, nothing has happened
because it missed the that the chip went to bus off.
>> manual fix is to do "ifdown can0 && ifup can0" to sync up the driver and
>> the chip. If you don't everything looks fine but nothing you send goes
>> out to the bus and you never receive anything.
>>
>> When this issue bites, the last messages you see in candump are:
>>
>> can0  20000004   [8]  00 04 00 00 00 00 00 79   ERRORFRAME
>> can0  20000004   [8]  00 10 00 00 00 00 00 79   ERRORFRAME
>>
>> You see this in candump on other iterations of the test, but often see
>> the following :
>>
>> can0  20000040   [8]  00 00 00 00 00 00 00 00   ERRORFRAME
>> can0  20000100   [8]  00 00 00 00 00 00 00 00   ERRORFRAME
>>
>> You obviously see a "c_can_platform 481cc000.can can0: bus-off" and
>> "c_can_platform 481cc000.can can0: restarted" in dmesg with the above
>> can messages. As I understand it, it's the BBB end that is sending these
>> two. When you don't see these two following, there isn't a (lasting
>> anyway) detected bus off, so the traffic between the device and the BBB
>> starts as normal when power comes on.
>>
>> What I've done is catch the bus off in "c_can_start_xmit" on a
>> "can_send" and if it is an unknown bus off, schedule "c_can_poll" which
>> will do what is required. So it self fixes.
>>
>> I figured even if it's something odd about the device we are talking to
>> causing this, it shouldn't be able to get into this state.
>>
>> This was on 4.4 but I see that 4.18 is basically the same code.
>>
>> Anyway, this is what we are doing and now I've done due diligence
>> passing the information on. :-)
>>
>> Patch attached.
> OK, an extra napi_schedule() finds the bus-off then. Needs more thoughts...

After writing this up, and having a bit of a break and a sleep, I woke
realizing what might be happening.

"c_can_isr" disables the interrupts which are then reenabled at the end
of "c_can_poll". I'm guessing this is to avoid "c_can_poll" causing
fresh interrupts.
But I think the all important interrupt is happening before the
interrupts are re-enabled.

What it should do is grab the status register before and after the
re-enable, and if they are different, or different from the status
register value at the start of "c_can_poll", schedule another "c_can_poll".

I will try this when I can get a slot on the test setup.

>> Regards,
>>
>> Joe
>>
>> P.S. Don't know if
>> "http://www.keil.com/dd/docs/datashts/silabs/boschcan_ug.pdf"; is an
>> acceptable link for the datasheet, but the URL for the datasheet in the
>> code is 404'ed.
> I also realized some time ago that the link is broken :(. Your link to
> Keil looks good, though.
>
> Wolfgang.
>

Shame really there isn't a git repo of datasheet pdf that we can
reference instead of links.
Not great to store binaries in git, but each datasheet will only have a
few revisions....


Joe
--
To unsubscribe from this list: send the line "unsubscribe linux-can" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html



[Index of Archives]     [Automotive Discussions]     [Linux ARM Kernel]     [Linux ARM]     [Linux Omap]     [Fedora ARM]     [IETF Annouce]     [Security]     [Bugtraq]     [Linux]     [Linux OMAP]     [Linux MIPS]     [eCos]     [Asterisk Internet PBX]     [Linux API]     [CAN Bus]

  Powered by Linux