Re: [PATCH] can: mcp251xfd: Always stop on BUS_OFF and call netif_stop_queue

Mike Looijmans <mike.looijmans@xxxxxxxx> · Wed, 12 Jul 2023 07:43:05 +0200

Met vriendelijke groet / kind regards,

Mike Looijmans
System Expert

TOPIC Embedded Products B.V.
Materiaalweg 4, 5681 RJ Best
The Netherlands

T: +31 (0) 499 33 69 69
E: mike.looijmans@xxxxxxxxxxxxxxxxx
W: www.topic.nl

Please consider the environment before printing this e-mail
On 11-07-2023 19:05, Marc Kleine-Budde wrote:
On 11.07.2023 17:26:47, Mike Looijmans wrote:
When there's an error attempting to store the BER counter, don't abort
but continue shutting down the chip as required.

If you refer to an error by __mcp251xfd_get_berr_counter(), it's not a
store error, but a failure of regmap_read(priv->map_reg,
MCP251XFD_REG_TREC, &trec). By default the SPI transfers are CRC enabled
and no/wrong data from the chip will be detected and return an error
here (after 3 retires). In out of memory conditions or other kernel
errors might be possible here, too.

Have you seen a problem here? But as we shut down the chip here anyways,
we can ignore the error here.

Excactly my point, the important task here is to report the bus off state.

I haven't seen the error trigger, so it wasn't the cause of our problems.

In retrospect, maybe I should have split this in two patches?

After disabling communications, also stop the TX queue with a call to
netif_stop_queue.

can_bus_off() will call netif_carrier_off(), isn't this sufficient? Have
you enabled automatic restart in case of bus off? I think the netdev
watchdog will kick you, if the interface has a stooped queue for too
long (IIRC 5 seconds).

Apparently not. The problem also occurred in the field, where the CAN 
communication got stuck for something like a full day.

Automatic restart was enabled, with a 100ms timeout. Setting a shorter timeout 
like 50 ms seemed to make the problem more likely to occur

What I also noticed, but didn't accurately measure, is that the time it took 
to recover used to be somewhat random, it could take up to about a second for 
packets to start going out again, as observed by watching LEDs. After adding 
netif_stop_queue the recovery was almost instantaneous.

When the interface restarts, mcp251xfd_set_mode will
call netif_wake_queue and resume.

This fixes a hangup in either send() or poll() from userspace. To
reproduce: I ran "cansequence can0 -p" to flood the system with packets.
While running that, I shorted the CAN signals, causing a bus error.
Usually communications would resume after the CAN bus restarted, but
sometimes the system got stuck and refused to send any more packets.
The sending process would be in either poll() or send(), waiting for
the queue to resume. To "unstuck" the process from send() it was
sufficient to send any packet on the can interface from another
process. To get it out of the poll() hang, only bringing the can
interface down (and up) would work.

After adding the netif_stop_queue call, I was unable to reproduce the
problem.

The newly added netif_stop_queue() will cause the netif_wake_queue() in
the mcp251xfd_set_mode() to actually wake the queue. If you observe a
problem, I think it's a general problem, so all drivers would be
effected.

After two days digging, my conclusion was that "something" wasn't getting a 
proper "kick" when the CAN interface came back online.

One could argue the netif_stop_queue should be put into can_bus_off. But the 
netif_wake_queue call is in the driver, so that would break symmetry.

Signed-off-by: Mike Looijmans <mike.looijmans@xxxxxxxx>

Marc