Re: Can Frame submission and dropped frames

kernel@xxxxxxxxxxxxxxxx · Fri, 30 Nov 2018 16:12:26 +0100

> On 30.11.2018, at 15:20, Wolfgang Grandegger <wg@xxxxxxxxxxxxxx> wrote:
> 
> Hello,
> 
> Am 29.11.2018 um 16:57 schrieb kernel@xxxxxxxxxxxxxxxx:
>> Hi Wolfgang!
>>> On 28.11.2018, at 21:55, Wolfgang Grandegger <wg@xxxxxxxxxxxxxx> wrote:
>>> 
>>> Hello,
>>> 
>>> Am 28.11.2018 um 19:39 schrieb kernel@xxxxxxxxxxxxxxxx:
>>>> Hi!
>>>> 
>>>> While working on the rewrite of the mcp25xxfd driver to get upstreamed I have
>>>> come across a strange observation with regards to dropped frames:
>>>> 
>>>> Essentially I am running a worsted case CAN2.0 bus saturation test where
>>>> I receive 1M CAN2.0 frames (standard ID, Len: 0) at 1MHz CAN-bus-rate
>>>> in 57s (= 17000 frames/s).
>>> 
>>> Do you also get that many interrupts, or even more (for SPI as well)?
>> 
>> No - for this specific test the system is heavily loaded doing SPI transfers
>> so the MCP2517FD CAN line (almost) never goes high during the test.
>> 
>> On top the SPI driver on the bcm283x SOC is heavily optimized (mostly for
>> short transfers to improve performance of the mcp2515 ) and it will
>> run polling IO for transfers that take a short time (<30us) and only then defer
>> to interrupts - this mostly to avoid latencies with irq-thread wakeup-scheduling.
>> 
>> As the SPI speed is quite high (in my case 12.5MHz) you can transfer the 22
>> bytes (to read the content of a fifo) in 16us - which is acceptable for
>> polling (with CS select/deselect this is about 22us).
>> 
>> So my expectation is that for most practical purposes there is no irq on the 
>> SPI side.
>> On the other hand I am seeing interrupts in the order of 39k plus 62k CS per
>> second even though there is no load on the system - but none of those 
>> interrupts are SPI or mcp2517 related.

> You mean "cat /proc/interrupts" reports 100000 interrupts per second,
> one every 10 micro-second!?
> 

These where mostly schedule interrupts to other cores to wake the spi worker
thread, which was there started and then typically stopped imediately.

>> Some of the context switching is strange as both spi0 and the irq-mcp25xxfd
>> kernel thread are consuming CPU - but I guess that is an issue with SPI I need
>> to look into separately.
>> 

This has been resolved by a simple patch that I have just submitted to spi
to avoid this wakeup call in the “inline" spi_async case.
Now those interrupts are gone and spi0 is not consuming any cpu at all.

>>> 
>>>> On a Raspberry Pi3 I can handle this load from the SPI side without any issues
>>>> or lost packages (even though the driver is still unoptimized and I made the
>>>> decision to have those optimizations submitted as separate patches on top of 
>>>> basic functionality).
>>>> 
>>>> This means with the following code disabled:
>>>> 	skb = alloc_can_skb(net, &frame);
>>>> 	if (!skb)
>>>> 		return NULL;
>>>> 	frame->can_id = id;
>>>> 	frame->can_dlc = dlc;
>>>> 	memcpy(frame->data, rx->data, len);
>>>> 	netif_rx_ni(skb);
>>>> 
>>>> (Counters are updated before this code is executed)
>>>> 
>>>> But when I enable submission of the frames to the network stack I get lots 
>>>> of dropped packets and the CPU load is increased and also see packet loss
>>>> on the SPI side due to CPU congestion.
>>>> 
>>>> Here stats after 1M packets received without submission to the stack:
>>>> root@raspcm3:~# ip -d -s link show  can0
>>>> 11: can0: <NOARP,UP,LOWER_UP,ECHO> mtu 72 qdisc pfifo_fast state UNKNOWN mode DEFAULT group default qlen 10
>>>>   link/can  promiscuity 0
>>>>   can <FD> state ERROR-ACTIVE (berr-counter tx 0 rx 0) restart-ms 0
>>>> 	  bitrate 1000000 sample-point 0.750
>>>> 	  tq 25 prop-seg 14 phase-seg1 15 phase-seg2 10 sjw 1
>>>> 	  mcp25xxfd: tseg1 2..256 tseg2 1..128 sjw 1..128 brp 1..256 brp-inc 1
>>>> 	  dbitrate 1000000 dsample-point 0.750
>>>> 	  dtq 25 dprop-seg 14 dphase-seg1 15 dphase-seg2 10 dsjw 1
>>>> 	  mcp25xxfd: dtseg1 1..32 dtseg2 1..16 dsjw 1..16 dbrp 1..256 dbrp-inc 1
>>>> 	  clock 40000000
>>>> 	  re-started bus-errors arbit-lost error-warn error-pass bus-off
>>>> 	  0          0          0          0          0          0
>>>>         numtxqueues 1 numrxqueues 1 gso_max_size 65536 gso_max_segs 65535
>>>>   RX: bytes  packets  errors  dropped overrun mcast
>>>>   0          1000000  0       0       0       0
>>>>   TX: bytes  packets  errors  dropped carrier collsns
>>>>   0          0        0       0       0       0
>>>> 
>>>> 
>>>> And after a module reload now with packet submission code enabled 
>>>> (just a module parameter changed):
>>>> root@raspcm3:~# ip -d -s link show  can0
>>>> 12: can0: <NOARP,UP,LOWER_UP,ECHO> mtu 72 qdisc pfifo_fast state UNKNOWN mode DEFAULT group default qlen 10
>>>>   link/can  promiscuity 0
>>>>   can <FD> state ERROR-ACTIVE (berr-counter tx 0 rx 0) restart-ms 0
>>>> 	  bitrate 1000000 sample-point 0.750
>>>> 	  tq 25 prop-seg 14 phase-seg1 15 phase-seg2 10 sjw 1
>>>> 	  mcp25xxfd: tseg1 2..256 tseg2 1..128 sjw 1..128 brp 1..256 brp-inc 1
>>>> 	  dbitrate 1000000 dsample-point 0.750
>>>> 	  dtq 25 dprop-seg 14 dphase-seg1 15 dphase-seg2 10 dsjw 1
>>>> 	  mcp25xxfd: dtseg1 1..32 dtseg2 1..16 dsjw 1..16 dbrp 1..256 dbrp-inc 1
>>>> 	  clock 40000000
>>>> 	  re-started bus-errors arbit-lost error-warn error-pass bus-off
>>>> 	  0          0          0          0          0          0
>>>>         numtxqueues 1 numrxqueues 1 gso_max_size 65536 gso_max_segs 65535
>>>>   RX: bytes  packets  errors  dropped overrun mcast
>>>>   0          1000000  0       945334  0       0
>>>>   TX: bytes  packets  errors  dropped carrier collsns
>>>>   0          0        0       0       0       0
>>> 
>>> What CPU load do you see, e.g. with "htop -d50”?
>> 
>> htop -d 50 shows:
>> * with submission:
>>  * 1 core 100% (
>>  * 1 core 25%
>> * without submission:
>>  * same measurements  
>> 
>> this sounds strange...
> 
> Then it's no surprise that most of the packets are dropped.

well that was one of the reason why I was thinking of deferring message 
submission to a separate thread that would run on a different core where
the network stack would have more time processing the messages - the network 
queues are per CPU/core if I understand correct.

>>> 
>>> The symptoms are similar to what we get with bus-errors starving the CPU
>>> on some low-end systems (where we cannot disable the creation of-
>>> bus-errors, e.g. if no cable is connected).
>> 
>> I can not comment on that...
> 
> 
> It's a different issue then.

but I heard a comment that the old version of the driver will get high CPU
load when not connected to a CAN bus (Bus-off or no peer to do ACKs).

> 
>>> 
>>>> A more realistic scenario would be with DLC=8, and looks like this:
>>>> (this took 122.3s):
>>>> root@raspcm3:~# ip -d -s link show  can0
>>>> 13: can0: <NOARP,UP,LOWER_UP,ECHO> mtu 72 qdisc pfifo_fast state UNKNOWN mode DEFAULT group default qlen 10
>>>>   link/can  promiscuity 0
>>>>   can <FD> state ERROR-ACTIVE (berr-counter tx 0 rx 0) restart-ms 0
>>>> 	  bitrate 1000000 sample-point 0.750
>>>> 	  tq 25 prop-seg 14 phase-seg1 15 phase-seg2 10 sjw 1
>>>> 	  mcp25xxfd: tseg1 2..256 tseg2 1..128 sjw 1..128 brp 1..256 brp-inc 1
>>>> 	  dbitrate 1000000 dsample-point 0.750
>>>> 	  dtq 25 dprop-seg 14 dphase-seg1 15 dphase-seg2 10 dsjw 1
>>>> 	  mcp25xxfd: dtseg1 1..32 dtseg2 1..16 dsjw 1..16 dbrp 1..256 dbrp-inc 1
>>>> 	  clock 40000000
>>>> 	  re-started bus-errors arbit-lost error-warn error-pass bus-off
>>>> 	  0          0          0          0          0          0
>>>>         numtxqueues 1 numrxqueues 1 gso_max_size 65536 gso_max_segs 65535
>>>>   RX: bytes  packets  errors  dropped overrun mcast
>>>>   8000000    1000000  0       0       0       0
>>>>   TX: bytes  packets  errors  dropped carrier collsns
>>>>   0          0        0       0       0       0
>>> 
>>> This means the problem does not show up with a payload of 8 bytes!?
>> For all practical purposes I have not seen any drops in this case.
>> But in this case I see interrupts on the int-mcp251xfd side (498k for 1M
>> CAN frames).
> 
> 498000 interrupts per second, really? Gosh!

> 
>>> 
>>>> So I am wondering: is there a good idea already how this worsted case 
>>>> issue can get avoided in the first place?
>>>> 
>>>> What I could come up with is the idea of:
>>>> * queuing packets in a ring buffer of a certain size
>>>> * having a separate submission thread that pushes messages onto the
>>>> network stack (essentially the short code above)
>>>> 
>>>> The idea is that this thread would (hopefully) get scheduled on a different
>>>> core so that the CPU resources would get better used.
>>> 
>>> Well, not sure if that makes it better. You have to send the queued
>>> packages in order to the network stack. Either you queue all message or
>>> just the small one till a bigger one arrives. Anyway, this optimization
>>> is very special.
>> 
>> I agree - that was why I was asking…
>> 
>>> 
>>>> Logic to switch from inline to deferred queuing could be made dynamically
>>>> based on traffic (i.e: if there is more than one FIFO filled on the
>>>> controller or there is already something in the queue then defer 
>>>> submission to that separate thread)
>>>> 
>>>> Obviously this leads to delays in submission but at least for medium 
>>>> length bursts of messages no message is getting lost dropped or ...
>>>> 
>>>> Is this something the driver should address (as a separate patch)?
>>> 
>>> Imaging you have 2 or even 4 CAN controllers on the board. Or a system
>>> with just one CPU.
>> 
>> As this is very specific to the high load test.
>> 
>> I doubt that a second controller can get used under these circumstances.
> 
> Me too!
> 
>> Reason is that basically statistics show that for the above 
>> performance test:
>> * the SPI clock is asserted (high) for about 25% of the time - hence
>>  clock is running for about 50% of the time - that is at 
>>  (effectively) - 9 cycles per byte where the last cycle is low,
>>  so the actual figure is actually 12% higher...
>> * SPI-CS CS is asserted (low) for about 75% of the time.
>> * on top there is the spi framework overhead that is not easily
>>  measurable but the time between CS-deassert and CS-assert for
>>  2 back to back spi_sync calls is 4-6us (with minimal error
>>  handling) - this is the penalty of the SPI-framework.
>>  Taking this into account then the 75% raise to 99.7% utilization
>>  (=75% + (5us*49400)/1s*100%) usage of SPI (at 49.4k CS-assers/s)
>> 
>> This does not leave much space for more SPI traffic to handle high load.
>> 
>> At 500kHz CAN-Bus rate it should be possible - or at higher 
>> SPI clock rates (I am using 12.5 to be able to have a chance to
>> capture SPI traffic at 50MHz with my logic analyzer with a sufficiently
>> high sample rate).
>> 
>> So there is “space” there as well when interfacing at 20MHz.
>> 
>> A second controller may be possible for all but this torture tests.
>> CanFD with 8MHz Data rates and DLC=9 (12 bytes) may be another such
>> situation where this may break - the driver reads 8 data bytes
>> when reading the header -  everything above as a separate spi_sync
>> call (the driver may become more dynamic based on previous packets
>> in the future, but mis-predictions of DLC=15 while effective DLC=8
>> are bad as it increases the number of bytes to get transferred over 
>> the spi bus by a factor of 3.5 (22 vs 78 bytes)!
>> 
>> But that will be another patch.
> 
> Well, SPI is a pain (for software).

If the spi-bus driver is not heavily optimized then yes!

I am actually quite proud that I can sustain this load test from the SPI
side and I have got some optimizations that will improve it further by 
reducing the number of SPI transfers.

But - as said - those will be “add-on” patches to make the optimization
visible.

> 
>>>> Or should there be something in the can framework/stack that could
>>>> handle such situations better?
>>> 
>>> That would be ideal, but I'm not sure if it's feasible. Already the high
>>> interrupt load is a problem. And SPI likely as well. And I think we
>>> suffer here from the overhead of the networking stack as well.
>>>> Or should I just ignore those “dropped” packages, as this is really
>>>> a worsted case scenario?
>>> 
>>> If we can do it better with reasonable effort and in a portable way, we
>>> should address it, of course. Would be interesting to know how other CAN
>>> controllers behave on that hardware.
>> 
>> In the future maybe - I just wanted this fact to be on the radar/known...
> 
> Keep it simple! I'm not sure if this worst case scenario is a realistic
> one as all other nodes on the CAN bus will moan as well. Do you have a
> chance to do such test with a Flexcan or SJA1000?
> 

I do not own such a device but I am using a beagle bone black for this
load-testing and there it works fine and it can handle those 17k messages
tx and rx with no problems (but at 17k interrupts)

Thanks anyway for the input

Martin