Can Frame submission and dropped frames

kernel@xxxxxxxxxxxxxxxx · Wed, 28 Nov 2018 19:39:53 +0100

Hi!

While working on the rewrite of the mcp25xxfd driver to get upstreamed I have
come across a strange observation with regards to dropped frames:

Essentially I am running a worsted case CAN2.0 bus saturation test where
I receive 1M CAN2.0 frames (standard ID, Len: 0) at 1MHz CAN-bus-rate
in 57s (= 17000 frames/s).

On a Raspberry Pi3 I can handle this load from the SPI side without any issues
or lost packages (even though the driver is still unoptimized and I made the
decision to have those optimizations submitted as separate patches on top of 
basic functionality).

This means with the following code disabled:
	skb = alloc_can_skb(net, &frame);
	if (!skb)
		return NULL;
	frame->can_id = id;
	frame->can_dlc = dlc;
	memcpy(frame->data, rx->data, len);
	netif_rx_ni(skb);

(Counters are updated before this code is executed)

But when I enable submission of the frames to the network stack I get lots 
of dropped packets and the CPU load is increased and also see packet loss
on the SPI side due to CPU congestion.

Here stats after 1M packets received without submission to the stack:
root@raspcm3:~# ip -d -s link show  can0
11: can0: <NOARP,UP,LOWER_UP,ECHO> mtu 72 qdisc pfifo_fast state UNKNOWN mode DEFAULT group default qlen 10
    link/can  promiscuity 0
    can <FD> state ERROR-ACTIVE (berr-counter tx 0 rx 0) restart-ms 0
	  bitrate 1000000 sample-point 0.750
	  tq 25 prop-seg 14 phase-seg1 15 phase-seg2 10 sjw 1
	  mcp25xxfd: tseg1 2..256 tseg2 1..128 sjw 1..128 brp 1..256 brp-inc 1
	  dbitrate 1000000 dsample-point 0.750
	  dtq 25 dprop-seg 14 dphase-seg1 15 dphase-seg2 10 dsjw 1
	  mcp25xxfd: dtseg1 1..32 dtseg2 1..16 dsjw 1..16 dbrp 1..256 dbrp-inc 1
	  clock 40000000
	  re-started bus-errors arbit-lost error-warn error-pass bus-off
	  0          0          0          0          0          0
          numtxqueues 1 numrxqueues 1 gso_max_size 65536 gso_max_segs 65535
    RX: bytes  packets  errors  dropped overrun mcast
    0          1000000  0       0       0       0
    TX: bytes  packets  errors  dropped carrier collsns
    0          0        0       0       0       0

And after a module reload now with packet submission code enabled 
(just a module parameter changed):
root@raspcm3:~# ip -d -s link show  can0
12: can0: <NOARP,UP,LOWER_UP,ECHO> mtu 72 qdisc pfifo_fast state UNKNOWN mode DEFAULT group default qlen 10
    link/can  promiscuity 0
    can <FD> state ERROR-ACTIVE (berr-counter tx 0 rx 0) restart-ms 0
	  bitrate 1000000 sample-point 0.750
	  tq 25 prop-seg 14 phase-seg1 15 phase-seg2 10 sjw 1
	  mcp25xxfd: tseg1 2..256 tseg2 1..128 sjw 1..128 brp 1..256 brp-inc 1
	  dbitrate 1000000 dsample-point 0.750
	  dtq 25 dprop-seg 14 dphase-seg1 15 dphase-seg2 10 dsjw 1
	  mcp25xxfd: dtseg1 1..32 dtseg2 1..16 dsjw 1..16 dbrp 1..256 dbrp-inc 1
	  clock 40000000
	  re-started bus-errors arbit-lost error-warn error-pass bus-off
	  0          0          0          0          0          0
          numtxqueues 1 numrxqueues 1 gso_max_size 65536 gso_max_segs 65535
    RX: bytes  packets  errors  dropped overrun mcast
    0          1000000  0       945334  0       0
    TX: bytes  packets  errors  dropped carrier collsns
    0          0        0       0       0       0

A more realistic scenario would be with DLC=8, and looks like this:
(this took 122.3s):
root@raspcm3:~# ip -d -s link show  can0
13: can0: <NOARP,UP,LOWER_UP,ECHO> mtu 72 qdisc pfifo_fast state UNKNOWN mode DEFAULT group default qlen 10
    link/can  promiscuity 0
    can <FD> state ERROR-ACTIVE (berr-counter tx 0 rx 0) restart-ms 0
	  bitrate 1000000 sample-point 0.750
	  tq 25 prop-seg 14 phase-seg1 15 phase-seg2 10 sjw 1
	  mcp25xxfd: tseg1 2..256 tseg2 1..128 sjw 1..128 brp 1..256 brp-inc 1
	  dbitrate 1000000 dsample-point 0.750
	  dtq 25 dprop-seg 14 dphase-seg1 15 dphase-seg2 10 dsjw 1
	  mcp25xxfd: dtseg1 1..32 dtseg2 1..16 dsjw 1..16 dbrp 1..256 dbrp-inc 1
	  clock 40000000
	  re-started bus-errors arbit-lost error-warn error-pass bus-off
	  0          0          0          0          0          0
          numtxqueues 1 numrxqueues 1 gso_max_size 65536 gso_max_segs 65535
    RX: bytes  packets  errors  dropped overrun mcast
    8000000    1000000  0       0       0       0
    TX: bytes  packets  errors  dropped carrier collsns
    0          0        0       0       0       0

So I am wondering: is there a good idea already how this worsted case 
issue can get avoided in the first place?

What I could come up with is the idea of:
* queuing packets in a ring buffer of a certain size
* having a separate submission thread that pushes messages onto the
  network stack (essentially the short code above)

The idea is that this thread would (hopefully) get scheduled on a different
core so that the CPU resources would get better used.

Logic to switch from inline to deferred queuing could be made dynamically
based on traffic (i.e: if there is more than one FIFO filled on the
controller or there is already something in the queue then defer 
submission to that separate thread)

Obviously this leads to delays in submission but at least for medium 
length bursts of messages no message is getting lost dropped or ...

Is this something the driver should address (as a separate patch)?

Or should there be something in the can framework/stack that could
handle such situations better?

Or should I just ignore those “dropped” packages, as this is really
a worsted case scenario?

Thanks,
	Martin

P.s: Note: I am running in mixed CanFD mode, but can2.0 messages get submitted
as Can frames not CanFD frames in case that the CANFD flag is not set for
the message by the controller.