Re: C_CAN/D_CAN bug and fix

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi Wolfgang,

On 21/06/18 09:25, Wolfgang Grandegger wrote:
> Hello Joe,
>
> Am 21.06.2018 um 09:55 schrieb Joe Burmeister:
>> Hi Wolfgang
>>
>>
>> On 21/06/18 08:24, Wolfgang Grandegger wrote:
>>> Hello Joe,
>>>
>>> I have some more questions...
>>>
>>> Am 20.06.2018 um 19:00 schrieb Joe Burmeister:
>>>> Hi,
>>>>
>>>> I've bumped into what I think is a chip bug that the C_CAN/D_CAN driver
>>>> isn't handling.
>>>>
>>>> It can get into a state where the chip status register reports it's bus
>>>> off, but the can driver doesn't know, so the bus never gets restarted.
>>>>
>>>> Looks like the chip isn't firing the interrupt or is firing with the
>>>> interrupt register as zero. Either is wrong and means "c_can_poll" is
>>>> never called, and thus the driver never picks up the bus off.
>>>>
>>>> We are turning on/off the can device we are talking to, and we have to
>>>> do this a lot to cause this. But we can get into this state and then the
>>> With on/off you mean "ifconfig up/down"?
>> No, literally power on and power off to the device we are talking to
>> over can.
>> It's power is controlled by a GPIO line on the BBB and part of the
>> normal operation is to turn it on and off.
>> But in the test, we do that a lot to reproduce this bug we only saw once
>> in a blue moon.
>>
>>> Is it always the first bus-off making trouble after you switched on the
>>> device?
>> No, even in the test, most of the time, the test iteration completes
>> without issue.
>>
>>> Does the "bus-off" condition occur frequently?
>> Even with the test, which an iteration lasts about 30 seconds, it can
>> take over 5 minutes.
> I mean: do bus-off conditions occur frequently on the bus? At what rate?

It's when the power is going on or off to the device. We have some
contactors to some big power that probably introduces a fair amount of
noise on connect/disconnect causing can errors. The device we are
talking to gets power from the same circuit. Though it's fine once up,
it is born and dies in a hell fire of noise. But CAN should be ok with that.

>>> May bus-off also occur during the start of the CAN device (ifconfig up)?
>> As far as the c_can/d_can driver is concerned, nothing has happened
>> because it missed the that the chip went to bus off.
>>>> manual fix is to do "ifdown can0 && ifup can0" to sync up the driver and
>>>> the chip. If you don't everything looks fine but nothing you send goes
>>>> out to the bus and you never receive anything.
>>>>
>>>> When this issue bites, the last messages you see in candump are:
>>>>
>>>> can0  20000004   [8]  00 04 00 00 00 00 00 79   ERRORFRAME
>>>> can0  20000004   [8]  00 10 00 00 00 00 00 79   ERRORFRAME
>>>>
>>>> You see this in candump on other iterations of the test, but often see
>>>> the following :
>>>>
>>>> can0  20000040   [8]  00 00 00 00 00 00 00 00   ERRORFRAME
>>>> can0  20000100   [8]  00 00 00 00 00 00 00 00   ERRORFRAME
>>>>
>>>> You obviously see a "c_can_platform 481cc000.can can0: bus-off" and
>>>> "c_can_platform 481cc000.can can0: restarted" in dmesg with the above
>>>> can messages. As I understand it, it's the BBB end that is sending these
>>>> two. When you don't see these two following, there isn't a (lasting
>>>> anyway) detected bus off, so the traffic between the device and the BBB
>>>> starts as normal when power comes on.
>>>>
>>>> What I've done is catch the bus off in "c_can_start_xmit" on a
>>>> "can_send" and if it is an unknown bus off, schedule "c_can_poll" which
>>>> will do what is required. So it self fixes.
>>>>
>>>> I figured even if it's something odd about the device we are talking to
>>>> causing this, it shouldn't be able to get into this state.
>>>>
>>>> This was on 4.4 but I see that 4.18 is basically the same code.
>>>>
>>>> Anyway, this is what we are doing and now I've done due diligence
>>>> passing the information on. :-)
>>>>
>>>> Patch attached.
>>> OK, an extra napi_schedule() finds the bus-off then. Needs more thoughts...
>> After writing this up, and having a bit of a break and a sleep, I woke
>> realizing what might be happening.
>>
>> "c_can_isr" disables the interrupts which are then reenabled at the end
>> of "c_can_poll". I'm guessing this is to avoid "c_can_poll" causing
>> fresh interrupts.
>> But I think the all important interrupt is happening before the
>> interrupts are re-enabled.
>>
>> What it should do is grab the status register before and after the
>> re-enable, and if they are different, or different from the status
>> register value at the start of "c_can_poll", schedule another "c_can_poll".
> It's just an interrupt mask. After re-enabling it, an interrupt should
> be triggered even if pending. Maybe there is something strange going on
> with the "init" bit.
>
> Wolfgang.

I wonder if that pending interrupt system isn't working as expected and
so this bus off is being missed.

I'll have more of a look at the datasheet and the code in this area.

Not had any time on the test setup yet to try, my plan is:

#define STATUS_ERROR_MASK    (STATUS_BOFF & STATUS_EWARN & STATUS_EPASS)

        /* enable all IRQs if we are not in bus off state */
        if (priv->can.state != CAN_STATE_BUS_OFF) {
            c_can_irq_control(priv, true);
            last = curr;
            curr = priv->read_reg(priv, C_CAN_STS_REG);

            if ((curr & STATUS_ERROR_MASK) != (last & STATUS_ERROR_MASK)) {
                netdev_warn(dev, "Possible change during interrupts
disabled, scheduling status check.\n");
                napi_schedule(&priv->napi);
            }
        }

On the end of "c_can_poll".

We only have the one hardware lash up and we are all wanting to test
lots of things.

Joe



>> I will try this when I can get a slot on the test setup.
>>
>>>> Regards,
>>>>
>>>> Joe
>>>>
>>>> P.S. Don't know if
>>>> "http://www.keil.com/dd/docs/datashts/silabs/boschcan_ug.pdf"; is an
>>>> acceptable link for the datasheet, but the URL for the datasheet in the
>>>> code is 404'ed.
>>> I also realized some time ago that the link is broken :(. Your link to
>>> Keil looks good, though.
>>>
>>> Wolfgang.
>>>
>> Shame really there isn't a git repo of datasheet pdf that we can
>> reference instead of links.
> Not sure if there a licensing issues.
>
> Wolfgang.
Yer, it's a separate thing really. It's just every time I do any Linux
driver stuff there is dead links to datasheets I have to hunt for.
Ignore me, just dreaming.


Joe
--
To unsubscribe from this list: send the line "unsubscribe linux-can" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html



[Index of Archives]     [Automotive Discussions]     [Linux ARM Kernel]     [Linux ARM]     [Linux Omap]     [Fedora ARM]     [IETF Annouce]     [Security]     [Bugtraq]     [Linux]     [Linux OMAP]     [Linux MIPS]     [eCos]     [Asterisk Internet PBX]     [Linux API]     [CAN Bus]

  Powered by Linux