Re: Can Frame submission and dropped frames

Wolfgang Grandegger <wg@xxxxxxxxxxxxxx> · Fri, 30 Nov 2018 15:20:19 +0100

Hello,

Am 29.11.2018 um 16:57 schrieb kernel@xxxxxxxxxxxxxxxx:
> Hi Wolfgang!
>> On 28.11.2018, at 21:55, Wolfgang Grandegger <wg@xxxxxxxxxxxxxx> wrote:
>>
>> Hello,
>>
>> Am 28.11.2018 um 19:39 schrieb kernel@xxxxxxxxxxxxxxxx:
>>> Hi!
>>>
>>> While working on the rewrite of the mcp25xxfd driver to get upstreamed I have
>>> come across a strange observation with regards to dropped frames:
>>>
>>> Essentially I am running a worsted case CAN2.0 bus saturation test where
>>> I receive 1M CAN2.0 frames (standard ID, Len: 0) at 1MHz CAN-bus-rate
>>> in 57s (= 17000 frames/s).
>>
>> Do you also get that many interrupts, or even more (for SPI as well)?
> 
> No - for this specific test the system is heavily loaded doing SPI transfers
> so the MCP2517FD CAN line (almost) never goes high during the test.
> 
> On top the SPI driver on the bcm283x SOC is heavily optimized (mostly for
> short transfers to improve performance of the mcp2515 ) and it will
> run polling IO for transfers that take a short time (<30us) and only then defer
> to interrupts - this mostly to avoid latencies with irq-thread wakeup-scheduling.
> 
> As the SPI speed is quite high (in my case 12.5MHz) you can transfer the 22
> bytes (to read the content of a fifo) in 16us - which is acceptable for
> polling (with CS select/deselect this is about 22us).
> 
> So my expectation is that for most practical purposes there is no irq on the 
> SPI side.
> On the other hand I am seeing interrupts in the order of 39k plus 62k CS per
> second even though there is no load on the system - but none of those 
> interrupts are SPI or mcp2517 related.

You mean "cat /proc/interrupts" reports 100000 interrupts per second,
one every 10 micro-second!?

> Some of the context switching is strange as both spi0 and the irq-mcp25xxfd
> kernel thread are consuming CPU - but I guess that is an issue with SPI I need
> to look into separately.
> 
>>
>>> On a Raspberry Pi3 I can handle this load from the SPI side without any issues
>>> or lost packages (even though the driver is still unoptimized and I made the
>>> decision to have those optimizations submitted as separate patches on top of 
>>> basic functionality).
>>>
>>> This means with the following code disabled:
>>> 	skb = alloc_can_skb(net, &frame);
>>> 	if (!skb)
>>> 		return NULL;
>>> 	frame->can_id = id;
>>> 	frame->can_dlc = dlc;
>>> 	memcpy(frame->data, rx->data, len);
>>> 	netif_rx_ni(skb);
>>>
>>> (Counters are updated before this code is executed)
>>>
>>> But when I enable submission of the frames to the network stack I get lots 
>>> of dropped packets and the CPU load is increased and also see packet loss
>>> on the SPI side due to CPU congestion.
>>>
>>> Here stats after 1M packets received without submission to the stack:
>>> root@raspcm3:~# ip -d -s link show  can0
>>> 11: can0: <NOARP,UP,LOWER_UP,ECHO> mtu 72 qdisc pfifo_fast state UNKNOWN mode DEFAULT group default qlen 10
>>>    link/can  promiscuity 0
>>>    can <FD> state ERROR-ACTIVE (berr-counter tx 0 rx 0) restart-ms 0
>>> 	  bitrate 1000000 sample-point 0.750
>>> 	  tq 25 prop-seg 14 phase-seg1 15 phase-seg2 10 sjw 1
>>> 	  mcp25xxfd: tseg1 2..256 tseg2 1..128 sjw 1..128 brp 1..256 brp-inc 1
>>> 	  dbitrate 1000000 dsample-point 0.750
>>> 	  dtq 25 dprop-seg 14 dphase-seg1 15 dphase-seg2 10 dsjw 1
>>> 	  mcp25xxfd: dtseg1 1..32 dtseg2 1..16 dsjw 1..16 dbrp 1..256 dbrp-inc 1
>>> 	  clock 40000000
>>> 	  re-started bus-errors arbit-lost error-warn error-pass bus-off
>>> 	  0          0          0          0          0          0
>>>          numtxqueues 1 numrxqueues 1 gso_max_size 65536 gso_max_segs 65535
>>>    RX: bytes  packets  errors  dropped overrun mcast
>>>    0          1000000  0       0       0       0
>>>    TX: bytes  packets  errors  dropped carrier collsns
>>>    0          0        0       0       0       0
>>>
>>>
>>> And after a module reload now with packet submission code enabled 
>>> (just a module parameter changed):
>>> root@raspcm3:~# ip -d -s link show  can0
>>> 12: can0: <NOARP,UP,LOWER_UP,ECHO> mtu 72 qdisc pfifo_fast state UNKNOWN mode DEFAULT group default qlen 10
>>>    link/can  promiscuity 0
>>>    can <FD> state ERROR-ACTIVE (berr-counter tx 0 rx 0) restart-ms 0
>>> 	  bitrate 1000000 sample-point 0.750
>>> 	  tq 25 prop-seg 14 phase-seg1 15 phase-seg2 10 sjw 1
>>> 	  mcp25xxfd: tseg1 2..256 tseg2 1..128 sjw 1..128 brp 1..256 brp-inc 1
>>> 	  dbitrate 1000000 dsample-point 0.750
>>> 	  dtq 25 dprop-seg 14 dphase-seg1 15 dphase-seg2 10 dsjw 1
>>> 	  mcp25xxfd: dtseg1 1..32 dtseg2 1..16 dsjw 1..16 dbrp 1..256 dbrp-inc 1
>>> 	  clock 40000000
>>> 	  re-started bus-errors arbit-lost error-warn error-pass bus-off
>>> 	  0          0          0          0          0          0
>>>          numtxqueues 1 numrxqueues 1 gso_max_size 65536 gso_max_segs 65535
>>>    RX: bytes  packets  errors  dropped overrun mcast
>>>    0          1000000  0       945334  0       0
>>>    TX: bytes  packets  errors  dropped carrier collsns
>>>    0          0        0       0       0       0
>>
>> What CPU load do you see, e.g. with "htop -d50”?
> 
> htop -d 50 shows:
> * with submission:
>   * 1 core 100% (
>   * 1 core 25%
> * without submission:
>   * same measurements  
> 
> this sounds strange...

Then it's no surprise that most of the packets are dropped.

>> Did you try to increase the priority of the service thread (with chrt)?
> no - I am getting 100% already.
> 
> 
>>
>> The symptoms are similar to what we get with bus-errors starving the CPU
>> on some low-end systems (where we cannot disable the creation of-
>> bus-errors, e.g. if no cable is connected).
> 
> I can not comment on that...

It's a different issue then.

>>
>>> A more realistic scenario would be with DLC=8, and looks like this:
>>> (this took 122.3s):
>>> root@raspcm3:~# ip -d -s link show  can0
>>> 13: can0: <NOARP,UP,LOWER_UP,ECHO> mtu 72 qdisc pfifo_fast state UNKNOWN mode DEFAULT group default qlen 10
>>>    link/can  promiscuity 0
>>>    can <FD> state ERROR-ACTIVE (berr-counter tx 0 rx 0) restart-ms 0
>>> 	  bitrate 1000000 sample-point 0.750
>>> 	  tq 25 prop-seg 14 phase-seg1 15 phase-seg2 10 sjw 1
>>> 	  mcp25xxfd: tseg1 2..256 tseg2 1..128 sjw 1..128 brp 1..256 brp-inc 1
>>> 	  dbitrate 1000000 dsample-point 0.750
>>> 	  dtq 25 dprop-seg 14 dphase-seg1 15 dphase-seg2 10 dsjw 1
>>> 	  mcp25xxfd: dtseg1 1..32 dtseg2 1..16 dsjw 1..16 dbrp 1..256 dbrp-inc 1
>>> 	  clock 40000000
>>> 	  re-started bus-errors arbit-lost error-warn error-pass bus-off
>>> 	  0          0          0          0          0          0
>>>          numtxqueues 1 numrxqueues 1 gso_max_size 65536 gso_max_segs 65535
>>>    RX: bytes  packets  errors  dropped overrun mcast
>>>    8000000    1000000  0       0       0       0
>>>    TX: bytes  packets  errors  dropped carrier collsns
>>>    0          0        0       0       0       0
>>
>> This means the problem does not show up with a payload of 8 bytes!?
> For all practical purposes I have not seen any drops in this case.
> But in this case I see interrupts on the int-mcp251xfd side (498k for 1M
> CAN frames).

498000 interrupts per second, really? Gosh!

>>
>>> So I am wondering: is there a good idea already how this worsted case 
>>> issue can get avoided in the first place?
>>>
>>> What I could come up with is the idea of:
>>> * queuing packets in a ring buffer of a certain size
>>> * having a separate submission thread that pushes messages onto the
>>>  network stack (essentially the short code above)
>>>
>>> The idea is that this thread would (hopefully) get scheduled on a different
>>> core so that the CPU resources would get better used.
>>
>> Well, not sure if that makes it better. You have to send the queued
>> packages in order to the network stack. Either you queue all message or
>> just the small one till a bigger one arrives. Anyway, this optimization
>> is very special.
> 
> I agree - that was why I was asking…
> 
>>
>>> Logic to switch from inline to deferred queuing could be made dynamically
>>> based on traffic (i.e: if there is more than one FIFO filled on the
>>> controller or there is already something in the queue then defer 
>>> submission to that separate thread)
>>>
>>> Obviously this leads to delays in submission but at least for medium 
>>> length bursts of messages no message is getting lost dropped or ...
>>>
>>> Is this something the driver should address (as a separate patch)?
>>
>> Imaging you have 2 or even 4 CAN controllers on the board. Or a system
>> with just one CPU.
> 
> As this is very specific to the high load test.
> 
> I doubt that a second controller can get used under these circumstances.

Me too!

> Reason is that basically statistics show that for the above 
> performance test:
> * the SPI clock is asserted (high) for about 25% of the time - hence
>   clock is running for about 50% of the time - that is at 
>   (effectively) - 9 cycles per byte where the last cycle is low,
>   so the actual figure is actually 12% higher...
> * SPI-CS CS is asserted (low) for about 75% of the time.
> * on top there is the spi framework overhead that is not easily
>   measurable but the time between CS-deassert and CS-assert for
>   2 back to back spi_sync calls is 4-6us (with minimal error
>   handling) - this is the penalty of the SPI-framework.
>   Taking this into account then the 75% raise to 99.7% utilization
>   (=75% + (5us*49400)/1s*100%) usage of SPI (at 49.4k CS-assers/s)
> 
> This does not leave much space for more SPI traffic to handle high load.
> 
> At 500kHz CAN-Bus rate it should be possible - or at higher 
> SPI clock rates (I am using 12.5 to be able to have a chance to
> capture SPI traffic at 50MHz with my logic analyzer with a sufficiently
> high sample rate).
> 
> So there is “space” there as well when interfacing at 20MHz.
> 
> A second controller may be possible for all but this torture tests.
> CanFD with 8MHz Data rates and DLC=9 (12 bytes) may be another such
> situation where this may break - the driver reads 8 data bytes
> when reading the header -  everything above as a separate spi_sync
> call (the driver may become more dynamic based on previous packets
> in the future, but mis-predictions of DLC=15 while effective DLC=8
> are bad as it increases the number of bytes to get transferred over 
> the spi bus by a factor of 3.5 (22 vs 78 bytes)!
> 
> But that will be another patch.

Well, SPI is a pain (for software).

>>> Or should there be something in the can framework/stack that could
>>> handle such situations better?
>>
>> That would be ideal, but I'm not sure if it's feasible. Already the high
>> interrupt load is a problem. And SPI likely as well. And I think we
>> suffer here from the overhead of the networking stack as well.
>>> Or should I just ignore those “dropped” packages, as this is really
>>> a worsted case scenario?
>>
>> If we can do it better with reasonable effort and in a portable way, we
>> should address it, of course. Would be interesting to know how other CAN
>> controllers behave on that hardware.
> 
> In the future maybe - I just wanted this fact to be on the radar/known...

Keep it simple! I'm not sure if this worst case scenario is a realistic
one as all other nodes on the CAN bus will moan as well. Do you have a
chance to do such test with a Flexcan or SJA1000?

Wolfgang.