Re: [RFC PATCH] can: xilinx_can: fix chip_start failure with invalid bus

Anssi Hannula <anssi.hannula@xxxxxxxxxx> · Mon, 18 Mar 2019 15:38:44 +0200

On 18.3.2019 13:25, Appana Durga Kedareswara Rao wrote:
> Hi Anssi Hannula,
>
>> -----Original Message-----
>> From: Anssi Hannula <anssi.hannula@xxxxxxxxxx>
>> Sent: Monday, March 18, 2019 4:52 PM
>>
>>
>> On 12.3.2019 13:27, Appana Durga Kedareswara Rao wrote:
>>>> On 07. 03. 19 13:52, Anssi Hannula wrote:
>>>>> Currently the xilinx_can xcan_chip_start() function, called from
>>>>> .ndo_open() and via CAN_MODE_START (bus-off restart), waits for the
>>>>> SR register to show the wanted operating state, with a 1 sec timeout.
>>>>>
>>>>> However, that register bit will only be set once the HW has observed
>>>>> 11 consecutive recessive bits (BusIdle) on the bus.
>>>>>
>>>>> If the bus will not see the 11 bits (e.g. it is stuck dominant), the
>>>>> function will timeout and return an error.
>>>>> If this was done as part of a scheduled restart from bus-off, the
>>>>> interface will stay in bus-off state forever even if the bus
>>>>> recovers later.
>>>>> If this was done during interface .ndo_open(), the interface will
>>>>> stay down.
>>>>>
>>>>> According to M_CAN and FLEXCAN documentation they also wait for 11
>>>>> consecutive recessive bits, but their drivers do not seem to wait
>>>>> for that.
>>>>>
>>>>> To make the behavior consistent, modify xilinx_can to also not wait
>>>>> for the synchronization to complete.
>>>>>
>>>>> The only way for users to know for sure that the bus has been joined
>>>>> successfully is to see successfully received or transmitted frames.
>>>>> That does not seem optimal, but it is consistent with other drivers
>>>>> and we should have a properly working restart-ms with xilinx_can.
>>>>>
>>>>> Tested on ZynqMP with Xilinx CAN-FD 1.0.
>>>>>
>>>>> Signed-off-by: Anssi Hannula <anssi.hannula@xxxxxxxxxx>
>>>>> ---
>>>>>
>>>>> Not really sure if this way is the preferred way forward.
>>>>>
>>>>> One alternative way would be to keep the wait but log a warning and
>>>>> allow the function to succeed?
>>>>>
>>>>> But I believe something needs to be done, as it seems a reasonable
>>>>> expectation that if an interface is brought up with restart-ms, the
>>>>> interface should recover to a working state even after periods of
>>>>> bus being in whatever invalid state.
>>>>>
>>>>>
>>>>> The HW waiting is described for Xilinx HW in [1] "Register
>>>>> Configuration Sequence", for M_CAN in [2] 3.1.1 Software
>>>>> Initialization, for flexcan in [3] 34.4.10.1 Freeze mode.
>>>>>
>>>>> [1]
>>>>>
>> https://www.xilinx.com/support/documentation/ip_documentation/canfd/v
>>>> 2
>>>>> _0/pg223-canfd.pdf [2]
>>>>> http://www.bosch-
>>>> semiconductors.com/media/ip_modules/pdf_2/m_can/mcan_
>>>>> users_manual_v3212.pdf [3]
>>>>> https://www.nxp.com/docs/en/reference-manual/MC56F8458XRM.pdf
>>>>>
>>>>>
>>>>>  drivers/net/can/xilinx_can.c | 13 +------------
>>>>>  1 file changed, 1 insertion(+), 12 deletions(-)
>>>>>
>>>>> diff --git a/drivers/net/can/xilinx_can.c
>>>>> b/drivers/net/can/xilinx_can.c index 97d0933d9bd9..13abc9e1a7ed
>>>>> 100644
>>>>> --- a/drivers/net/can/xilinx_can.c
>>>>> +++ b/drivers/net/can/xilinx_can.c
>>>>> @@ -400,9 +400,8 @@ static int xcan_set_bittiming(struct net_device
>>>>> *ndev)  static int xcan_chip_start(struct net_device *ndev)  {
>>>>>  	struct xcan_priv *priv = netdev_priv(ndev);
>>>>> -	u32 reg_msr, reg_sr_mask;
>>>>> +	u32 reg_msr;
>>>>>  	int err;
>>>>> -	unsigned long timeout;
>>>>>  	u32 ier;
>>>>>
>>>>>  	/* Check if it is in reset mode */ @@ -428,10 +427,8 @@ static int
>>>>> xcan_chip_start(struct net_device *ndev)
>>>>>  	/* Check whether it is loopback mode or normal mode  */
>>>>>  	if (priv->can.ctrlmode & CAN_CTRLMODE_LOOPBACK) {
>>>>>  		reg_msr = XCAN_MSR_LBACK_MASK;
>>>>> -		reg_sr_mask = XCAN_SR_LBACK_MASK;
>>>>>  	} else {
>>>>>  		reg_msr = 0x0;
>>>>> -		reg_sr_mask = XCAN_SR_NORMAL_MASK;
>>>>>  	}
>>>>>
>>>>>  	/* enable the first extended filter, if any, as cores with
>>>>> extended @@ -443,14 +440,6 @@ static int xcan_chip_start(struct
>> net_device *ndev)
>>>>>  	priv->write_reg(priv, XCAN_MSR_OFFSET, reg_msr);
>>>>>  	priv->write_reg(priv, XCAN_SRR_OFFSET, XCAN_SRR_CEN_MASK);
>>>>>
>>>>> -	timeout = jiffies + XCAN_TIMEOUT;
>>>>> -	while (!(priv->read_reg(priv, XCAN_SR_OFFSET) & reg_sr_mask)) {
>>>>> -		if (time_after(jiffies, timeout)) {
>>>>> -			netdev_warn(ndev,
>>>>> -				"timed out for correct mode\n");
>>>>> -			return -ETIMEDOUT;
>>>>> -		}
>>>>> -	}
>>> Instead of deleting the timeout loop we can check for the bus off state
>> here(SR Register -- Error Status fields).
>>> If the controller is in bus off state skip this timeout loop...
>> Unfortunately the SR and ESR registers read zero in this situation until the 11
>> recessive bits have been seen, so that would not work.
> Could you please provide the steps to reproduce this issue?? 
>

Sure. Below are instructions for reproducing this by just using CAN FD
cores to disrupt the bus.
I believe the instructions will likely work with other Xilinx CAN cores,
too (and the secondary CAN interface(s) probably don't need to be Xilinx).

Scenario A - Failure to bring interface up with invalid bus state:

1. Connect 2 CAN FD cores to a properly terminated bus with no other
participants.

2. Start one interface at 1000k and start sending frames (well, at least
try to send).

# ip link set dev can2 up type can bitrate 1000000 restart-ms 2000
# cangen can2 -I 42A -p 100

3. Try to enable another interface at 125k - the other interface we
started earlier will prevent this interface from seeing the 11 recessive
bits:

# ip link set dev can1 up type can bitrate 125000 restart-ms 2000
RTNETLINK answers: Connection timed out

[   49.276765] xilinx_can a0002000.canfd can1: timed out for correct mode
[   49.283295] xilinx_can a0002000.canfd can1: xcan_chip_start failed!
[   49.289629] A link change request failed with some changes committed
already. Interface can1 may have been left with an inconsistent
configuration, please check.

Scenario B - A temporary bus fault causes the interface to never come
back up again even with restart-ms:

We will trigger a bus-off and have the bus be in the scenario A state
when the restart is being done.

1. Connect 3 CAN FD cores to a properly terminated bus with no other
participants.

2. Start interfaces:

# ip link set dev can1 up type can bitrate 125000 restart-ms 16000
# ip link set dev can2 up type can bitrate 125000 restart-ms 16000
# ip link set dev can3 up type can bitrate 125000 restart-ms 16000

3. Start observing can1:

# candump -D -cc can1

4. Verify that bus is OK and a frame can be sent. It should be seen in
the above candump output.

# cansend can1 "123#11223344"

5. Make can1 enter bus-off. Here we do it by changing the bitrate of the
2 other interfaces and adding traffic

# ip link set dev can2 down
# ip link set dev can2 up type can bitrate 1000000 restart-ms 2000
# ip link set dev can3 down
# ip link set dev can3 up type can bitrate 1000000 restart-ms 2000
# cangen can2 -I 42A -p 100 -g 1
(keep cangen running)

and then trying to send a frame from can1:

# cansend can1 "123#11223345"

The frame will (expectedly) not show up in candump and

# ip -d link show can1

should show the state be in BUS-OFF soon after the command.

[   64.777854] xilinx_can a0002000.canfd can1: bus-off

6. Within 16 sec (restart-ms) of the above cansend command, stop can3 to
get the bus into a state that prevents can1 restart (i.e. it will not
see 11 recessive bits):

# ip link set dev can3 down

When the restart-ms triggers, you will get:

[   81.864783] xilinx_can a0002000.canfd can1: timed out for correct mode
[   81.871309] xilinx_can a0002000.canfd can1: xcan_chip_start failed!
[   81.877574] xilinx_can a0002000.canfd can1: Error -110 during restart

7. Bring the bus back to a valid state:

# ip link set dev can2 down
# ip link set dev can2 up type can bitrate 125000 restart-ms 2000
# ip link set dev can3 down
# ip link set dev can3 up type can bitrate 125000 restart-ms 2000

But can1 will remain in BUS-OFF forever. Frames cannot be sent from can1
despite the bus actually being OK now and restart-ms being active.
In reality the can1 HW is fine and in error-active at this point, just
the Linux interface is in BUS-OFF as xcan_chip_start() failed in step 6.

>
>> Checked on CAN FD 1.0 on ZynqMP.
>>
>>
>>> Regards,
>>> Kedar.
>>>
>>>>>  	netdev_dbg(ndev, "status:#x%08x\n",
>>>>>  			priv->read_reg(priv, XCAN_SR_OFFSET));
>>>>>
>>>>>
>>>> Xilinx folks: Please take a look at this RFC.
>>>>
>>>> Thanks,
>>>> Michal
>>
>> --
>> Anssi Hannula / Bitwise Oy
>> +358 503803997

-- 
Anssi Hannula / Bitwise Oy
+358 503803997