Re: Can Frame submission and dropped frames

kernel@xxxxxxxxxxxxxxxx · Thu, 29 Nov 2018 16:57:08 +0100

Hi Wolfgang!
> On 28.11.2018, at 21:55, Wolfgang Grandegger <wg@xxxxxxxxxxxxxx> wrote:
> 
> Hello,
> 
> Am 28.11.2018 um 19:39 schrieb kernel@xxxxxxxxxxxxxxxx:
>> Hi!
>> 
>> While working on the rewrite of the mcp25xxfd driver to get upstreamed I have
>> come across a strange observation with regards to dropped frames:
>> 
>> Essentially I am running a worsted case CAN2.0 bus saturation test where
>> I receive 1M CAN2.0 frames (standard ID, Len: 0) at 1MHz CAN-bus-rate
>> in 57s (= 17000 frames/s).
> 
> Do you also get that many interrupts, or even more (for SPI as well)?

No - for this specific test the system is heavily loaded doing SPI transfers
so the MCP2517FD CAN line (almost) never goes high during the test.

On top the SPI driver on the bcm283x SOC is heavily optimized (mostly for
short transfers to improve performance of the mcp2515 ) and it will
run polling IO for transfers that take a short time (<30us) and only then defer
to interrupts - this mostly to avoid latencies with irq-thread wakeup-scheduling.

As the SPI speed is quite high (in my case 12.5MHz) you can transfer the 22
bytes (to read the content of a fifo) in 16us - which is acceptable for
polling (with CS select/deselect this is about 22us).

So my expectation is that for most practical purposes there is no irq on the 
SPI side.
On the other hand I am seeing interrupts in the order of 39k plus 62k CS per
second even though there is no load on the system - but none of those 
interrupts are SPI or mcp2517 related.

Some of the context switching is strange as both spi0 and the irq-mcp25xxfd
kernel thread are consuming CPU - but I guess that is an issue with SPI I need
to look into separately.

> 
>> On a Raspberry Pi3 I can handle this load from the SPI side without any issues
>> or lost packages (even though the driver is still unoptimized and I made the
>> decision to have those optimizations submitted as separate patches on top of 
>> basic functionality).
>> 
>> This means with the following code disabled:
>> 	skb = alloc_can_skb(net, &frame);
>> 	if (!skb)
>> 		return NULL;
>> 	frame->can_id = id;
>> 	frame->can_dlc = dlc;
>> 	memcpy(frame->data, rx->data, len);
>> 	netif_rx_ni(skb);
>> 
>> (Counters are updated before this code is executed)
>> 
>> But when I enable submission of the frames to the network stack I get lots 
>> of dropped packets and the CPU load is increased and also see packet loss
>> on the SPI side due to CPU congestion.
>> 
>> Here stats after 1M packets received without submission to the stack:
>> root@raspcm3:~# ip -d -s link show  can0
>> 11: can0: <NOARP,UP,LOWER_UP,ECHO> mtu 72 qdisc pfifo_fast state UNKNOWN mode DEFAULT group default qlen 10
>>    link/can  promiscuity 0
>>    can <FD> state ERROR-ACTIVE (berr-counter tx 0 rx 0) restart-ms 0
>> 	  bitrate 1000000 sample-point 0.750
>> 	  tq 25 prop-seg 14 phase-seg1 15 phase-seg2 10 sjw 1
>> 	  mcp25xxfd: tseg1 2..256 tseg2 1..128 sjw 1..128 brp 1..256 brp-inc 1
>> 	  dbitrate 1000000 dsample-point 0.750
>> 	  dtq 25 dprop-seg 14 dphase-seg1 15 dphase-seg2 10 dsjw 1
>> 	  mcp25xxfd: dtseg1 1..32 dtseg2 1..16 dsjw 1..16 dbrp 1..256 dbrp-inc 1
>> 	  clock 40000000
>> 	  re-started bus-errors arbit-lost error-warn error-pass bus-off
>> 	  0          0          0          0          0          0
>>          numtxqueues 1 numrxqueues 1 gso_max_size 65536 gso_max_segs 65535
>>    RX: bytes  packets  errors  dropped overrun mcast
>>    0          1000000  0       0       0       0
>>    TX: bytes  packets  errors  dropped carrier collsns
>>    0          0        0       0       0       0
>> 
>> 
>> And after a module reload now with packet submission code enabled 
>> (just a module parameter changed):
>> root@raspcm3:~# ip -d -s link show  can0
>> 12: can0: <NOARP,UP,LOWER_UP,ECHO> mtu 72 qdisc pfifo_fast state UNKNOWN mode DEFAULT group default qlen 10
>>    link/can  promiscuity 0
>>    can <FD> state ERROR-ACTIVE (berr-counter tx 0 rx 0) restart-ms 0
>> 	  bitrate 1000000 sample-point 0.750
>> 	  tq 25 prop-seg 14 phase-seg1 15 phase-seg2 10 sjw 1
>> 	  mcp25xxfd: tseg1 2..256 tseg2 1..128 sjw 1..128 brp 1..256 brp-inc 1
>> 	  dbitrate 1000000 dsample-point 0.750
>> 	  dtq 25 dprop-seg 14 dphase-seg1 15 dphase-seg2 10 dsjw 1
>> 	  mcp25xxfd: dtseg1 1..32 dtseg2 1..16 dsjw 1..16 dbrp 1..256 dbrp-inc 1
>> 	  clock 40000000
>> 	  re-started bus-errors arbit-lost error-warn error-pass bus-off
>> 	  0          0          0          0          0          0
>>          numtxqueues 1 numrxqueues 1 gso_max_size 65536 gso_max_segs 65535
>>    RX: bytes  packets  errors  dropped overrun mcast
>>    0          1000000  0       945334  0       0
>>    TX: bytes  packets  errors  dropped carrier collsns
>>    0          0        0       0       0       0
> 
> What CPU load do you see, e.g. with "htop -d50”?

htop -d 50 shows:
* with submission:
  * 1 core 100% (
  * 1 core 25%
* without submission:
  * same measurements  

this sounds strange...
> Did you try to increase the priority of the service thread (with chrt)?
no - I am getting 100% already.

> 
> The symptoms are similar to what we get with bus-errors starving the CPU
> on some low-end systems (where we cannot disable the creation of-
> bus-errors, e.g. if no cable is connected).

I can not comment on that...

> 
>> A more realistic scenario would be with DLC=8, and looks like this:
>> (this took 122.3s):
>> root@raspcm3:~# ip -d -s link show  can0
>> 13: can0: <NOARP,UP,LOWER_UP,ECHO> mtu 72 qdisc pfifo_fast state UNKNOWN mode DEFAULT group default qlen 10
>>    link/can  promiscuity 0
>>    can <FD> state ERROR-ACTIVE (berr-counter tx 0 rx 0) restart-ms 0
>> 	  bitrate 1000000 sample-point 0.750
>> 	  tq 25 prop-seg 14 phase-seg1 15 phase-seg2 10 sjw 1
>> 	  mcp25xxfd: tseg1 2..256 tseg2 1..128 sjw 1..128 brp 1..256 brp-inc 1
>> 	  dbitrate 1000000 dsample-point 0.750
>> 	  dtq 25 dprop-seg 14 dphase-seg1 15 dphase-seg2 10 dsjw 1
>> 	  mcp25xxfd: dtseg1 1..32 dtseg2 1..16 dsjw 1..16 dbrp 1..256 dbrp-inc 1
>> 	  clock 40000000
>> 	  re-started bus-errors arbit-lost error-warn error-pass bus-off
>> 	  0          0          0          0          0          0
>>          numtxqueues 1 numrxqueues 1 gso_max_size 65536 gso_max_segs 65535
>>    RX: bytes  packets  errors  dropped overrun mcast
>>    8000000    1000000  0       0       0       0
>>    TX: bytes  packets  errors  dropped carrier collsns
>>    0          0        0       0       0       0
> 
> This means the problem does not show up with a payload of 8 bytes!?
For all practical purposes I have not seen any drops in this case.
But in this case I see interrupts on the int-mcp251xfd side (498k for 1M
CAN frames).

> 
>> So I am wondering: is there a good idea already how this worsted case 
>> issue can get avoided in the first place?
>> 
>> What I could come up with is the idea of:
>> * queuing packets in a ring buffer of a certain size
>> * having a separate submission thread that pushes messages onto the
>>  network stack (essentially the short code above)
>> 
>> The idea is that this thread would (hopefully) get scheduled on a different
>> core so that the CPU resources would get better used.
> 
> Well, not sure if that makes it better. You have to send the queued
> packages in order to the network stack. Either you queue all message or
> just the small one till a bigger one arrives. Anyway, this optimization
> is very special.

I agree - that was why I was asking…

> 
>> Logic to switch from inline to deferred queuing could be made dynamically
>> based on traffic (i.e: if there is more than one FIFO filled on the
>> controller or there is already something in the queue then defer 
>> submission to that separate thread)
>> 
>> Obviously this leads to delays in submission but at least for medium 
>> length bursts of messages no message is getting lost dropped or ...
>> 
>> Is this something the driver should address (as a separate patch)?
> 
> Imaging you have 2 or even 4 CAN controllers on the board. Or a system
> with just one CPU.

As this is very specific to the high load test.

I doubt that a second controller can get used under these circumstances.

Reason is that basically statistics show that for the above 
performance test:
* the SPI clock is asserted (high) for about 25% of the time - hence
  clock is running for about 50% of the time - that is at 
  (effectively) - 9 cycles per byte where the last cycle is low,
  so the actual figure is actually 12% higher...
* SPI-CS CS is asserted (low) for about 75% of the time.
* on top there is the spi framework overhead that is not easily
  measurable but the time between CS-deassert and CS-assert for
  2 back to back spi_sync calls is 4-6us (with minimal error
  handling) - this is the penalty of the SPI-framework.
  Taking this into account then the 75% raise to 99.7% utilization
  (=75% + (5us*49400)/1s*100%) usage of SPI (at 49.4k CS-assers/s)

This does not leave much space for more SPI traffic to handle high load.

At 500kHz CAN-Bus rate it should be possible - or at higher 
SPI clock rates (I am using 12.5 to be able to have a chance to
capture SPI traffic at 50MHz with my logic analyzer with a sufficiently
high sample rate).

So there is “space” there as well when interfacing at 20MHz.

A second controller may be possible for all but this torture tests.
CanFD with 8MHz Data rates and DLC=9 (12 bytes) may be another such
situation where this may break - the driver reads 8 data bytes
when reading the header -  everything above as a separate spi_sync
call (the driver may become more dynamic based on previous packets
in the future, but mis-predictions of DLC=15 while effective DLC=8
are bad as it increases the number of bytes to get transferred over 
the spi bus by a factor of 3.5 (22 vs 78 bytes)!

But that will be another patch.

> 
>> Or should there be something in the can framework/stack that could
>> handle such situations better?
> 
> That would be ideal, but I'm not sure if it's feasible. Already the high
> interrupt load is a problem. And SPI likely as well. And I think we
> suffer here from the overhead of the networking stack as well.
>> Or should I just ignore those “dropped” packages, as this is really
>> a worsted case scenario?
> 
> If we can do it better with reasonable effort and in a portable way, we
> should address it, of course. Would be interesting to know how other CAN
> controllers behave on that hardware.

In the future maybe - I just wanted this fact to be on the radar/known...

Martin