> On 30.11.2018, at 15:20, Wolfgang Grandegger <wg@xxxxxxxxxxxxxx> wrote: > > Hello, > > Am 29.11.2018 um 16:57 schrieb kernel@xxxxxxxxxxxxxxxx: >> Hi Wolfgang! >>> On 28.11.2018, at 21:55, Wolfgang Grandegger <wg@xxxxxxxxxxxxxx> wrote: >>> >>> Hello, >>> >>> Am 28.11.2018 um 19:39 schrieb kernel@xxxxxxxxxxxxxxxx: >>>> Hi! >>>> >>>> While working on the rewrite of the mcp25xxfd driver to get upstreamed I have >>>> come across a strange observation with regards to dropped frames: >>>> >>>> Essentially I am running a worsted case CAN2.0 bus saturation test where >>>> I receive 1M CAN2.0 frames (standard ID, Len: 0) at 1MHz CAN-bus-rate >>>> in 57s (= 17000 frames/s). >>> >>> Do you also get that many interrupts, or even more (for SPI as well)? >> >> No - for this specific test the system is heavily loaded doing SPI transfers >> so the MCP2517FD CAN line (almost) never goes high during the test. >> >> On top the SPI driver on the bcm283x SOC is heavily optimized (mostly for >> short transfers to improve performance of the mcp2515 ) and it will >> run polling IO for transfers that take a short time (<30us) and only then defer >> to interrupts - this mostly to avoid latencies with irq-thread wakeup-scheduling. >> >> As the SPI speed is quite high (in my case 12.5MHz) you can transfer the 22 >> bytes (to read the content of a fifo) in 16us - which is acceptable for >> polling (with CS select/deselect this is about 22us). >> >> So my expectation is that for most practical purposes there is no irq on the >> SPI side. >> On the other hand I am seeing interrupts in the order of 39k plus 62k CS per >> second even though there is no load on the system - but none of those >> interrupts are SPI or mcp2517 related. > You mean "cat /proc/interrupts" reports 100000 interrupts per second, > one every 10 micro-second!? > These where mostly schedule interrupts to other cores to wake the spi worker thread, which was there started and then typically stopped imediately. >> Some of the context switching is strange as both spi0 and the irq-mcp25xxfd >> kernel thread are consuming CPU - but I guess that is an issue with SPI I need >> to look into separately. >> This has been resolved by a simple patch that I have just submitted to spi to avoid this wakeup call in the “inline" spi_async case. Now those interrupts are gone and spi0 is not consuming any cpu at all. >>> >>>> On a Raspberry Pi3 I can handle this load from the SPI side without any issues >>>> or lost packages (even though the driver is still unoptimized and I made the >>>> decision to have those optimizations submitted as separate patches on top of >>>> basic functionality). >>>> >>>> This means with the following code disabled: >>>> skb = alloc_can_skb(net, &frame); >>>> if (!skb) >>>> return NULL; >>>> frame->can_id = id; >>>> frame->can_dlc = dlc; >>>> memcpy(frame->data, rx->data, len); >>>> netif_rx_ni(skb); >>>> >>>> (Counters are updated before this code is executed) >>>> >>>> But when I enable submission of the frames to the network stack I get lots >>>> of dropped packets and the CPU load is increased and also see packet loss >>>> on the SPI side due to CPU congestion. >>>> >>>> Here stats after 1M packets received without submission to the stack: >>>> root@raspcm3:~# ip -d -s link show can0 >>>> 11: can0: <NOARP,UP,LOWER_UP,ECHO> mtu 72 qdisc pfifo_fast state UNKNOWN mode DEFAULT group default qlen 10 >>>> link/can promiscuity 0 >>>> can <FD> state ERROR-ACTIVE (berr-counter tx 0 rx 0) restart-ms 0 >>>> bitrate 1000000 sample-point 0.750 >>>> tq 25 prop-seg 14 phase-seg1 15 phase-seg2 10 sjw 1 >>>> mcp25xxfd: tseg1 2..256 tseg2 1..128 sjw 1..128 brp 1..256 brp-inc 1 >>>> dbitrate 1000000 dsample-point 0.750 >>>> dtq 25 dprop-seg 14 dphase-seg1 15 dphase-seg2 10 dsjw 1 >>>> mcp25xxfd: dtseg1 1..32 dtseg2 1..16 dsjw 1..16 dbrp 1..256 dbrp-inc 1 >>>> clock 40000000 >>>> re-started bus-errors arbit-lost error-warn error-pass bus-off >>>> 0 0 0 0 0 0 >>>> numtxqueues 1 numrxqueues 1 gso_max_size 65536 gso_max_segs 65535 >>>> RX: bytes packets errors dropped overrun mcast >>>> 0 1000000 0 0 0 0 >>>> TX: bytes packets errors dropped carrier collsns >>>> 0 0 0 0 0 0 >>>> >>>> >>>> And after a module reload now with packet submission code enabled >>>> (just a module parameter changed): >>>> root@raspcm3:~# ip -d -s link show can0 >>>> 12: can0: <NOARP,UP,LOWER_UP,ECHO> mtu 72 qdisc pfifo_fast state UNKNOWN mode DEFAULT group default qlen 10 >>>> link/can promiscuity 0 >>>> can <FD> state ERROR-ACTIVE (berr-counter tx 0 rx 0) restart-ms 0 >>>> bitrate 1000000 sample-point 0.750 >>>> tq 25 prop-seg 14 phase-seg1 15 phase-seg2 10 sjw 1 >>>> mcp25xxfd: tseg1 2..256 tseg2 1..128 sjw 1..128 brp 1..256 brp-inc 1 >>>> dbitrate 1000000 dsample-point 0.750 >>>> dtq 25 dprop-seg 14 dphase-seg1 15 dphase-seg2 10 dsjw 1 >>>> mcp25xxfd: dtseg1 1..32 dtseg2 1..16 dsjw 1..16 dbrp 1..256 dbrp-inc 1 >>>> clock 40000000 >>>> re-started bus-errors arbit-lost error-warn error-pass bus-off >>>> 0 0 0 0 0 0 >>>> numtxqueues 1 numrxqueues 1 gso_max_size 65536 gso_max_segs 65535 >>>> RX: bytes packets errors dropped overrun mcast >>>> 0 1000000 0 945334 0 0 >>>> TX: bytes packets errors dropped carrier collsns >>>> 0 0 0 0 0 0 >>> >>> What CPU load do you see, e.g. with "htop -d50”? >> >> htop -d 50 shows: >> * with submission: >> * 1 core 100% ( >> * 1 core 25% >> * without submission: >> * same measurements >> >> this sounds strange... > > Then it's no surprise that most of the packets are dropped. well that was one of the reason why I was thinking of deferring message submission to a separate thread that would run on a different core where the network stack would have more time processing the messages - the network queues are per CPU/core if I understand correct. >>> >>> The symptoms are similar to what we get with bus-errors starving the CPU >>> on some low-end systems (where we cannot disable the creation of- >>> bus-errors, e.g. if no cable is connected). >> >> I can not comment on that... > > > It's a different issue then. but I heard a comment that the old version of the driver will get high CPU load when not connected to a CAN bus (Bus-off or no peer to do ACKs). > >>> >>>> A more realistic scenario would be with DLC=8, and looks like this: >>>> (this took 122.3s): >>>> root@raspcm3:~# ip -d -s link show can0 >>>> 13: can0: <NOARP,UP,LOWER_UP,ECHO> mtu 72 qdisc pfifo_fast state UNKNOWN mode DEFAULT group default qlen 10 >>>> link/can promiscuity 0 >>>> can <FD> state ERROR-ACTIVE (berr-counter tx 0 rx 0) restart-ms 0 >>>> bitrate 1000000 sample-point 0.750 >>>> tq 25 prop-seg 14 phase-seg1 15 phase-seg2 10 sjw 1 >>>> mcp25xxfd: tseg1 2..256 tseg2 1..128 sjw 1..128 brp 1..256 brp-inc 1 >>>> dbitrate 1000000 dsample-point 0.750 >>>> dtq 25 dprop-seg 14 dphase-seg1 15 dphase-seg2 10 dsjw 1 >>>> mcp25xxfd: dtseg1 1..32 dtseg2 1..16 dsjw 1..16 dbrp 1..256 dbrp-inc 1 >>>> clock 40000000 >>>> re-started bus-errors arbit-lost error-warn error-pass bus-off >>>> 0 0 0 0 0 0 >>>> numtxqueues 1 numrxqueues 1 gso_max_size 65536 gso_max_segs 65535 >>>> RX: bytes packets errors dropped overrun mcast >>>> 8000000 1000000 0 0 0 0 >>>> TX: bytes packets errors dropped carrier collsns >>>> 0 0 0 0 0 0 >>> >>> This means the problem does not show up with a payload of 8 bytes!? >> For all practical purposes I have not seen any drops in this case. >> But in this case I see interrupts on the int-mcp251xfd side (498k for 1M >> CAN frames). > > 498000 interrupts per second, really? Gosh! > >>> >>>> So I am wondering: is there a good idea already how this worsted case >>>> issue can get avoided in the first place? >>>> >>>> What I could come up with is the idea of: >>>> * queuing packets in a ring buffer of a certain size >>>> * having a separate submission thread that pushes messages onto the >>>> network stack (essentially the short code above) >>>> >>>> The idea is that this thread would (hopefully) get scheduled on a different >>>> core so that the CPU resources would get better used. >>> >>> Well, not sure if that makes it better. You have to send the queued >>> packages in order to the network stack. Either you queue all message or >>> just the small one till a bigger one arrives. Anyway, this optimization >>> is very special. >> >> I agree - that was why I was asking… >> >>> >>>> Logic to switch from inline to deferred queuing could be made dynamically >>>> based on traffic (i.e: if there is more than one FIFO filled on the >>>> controller or there is already something in the queue then defer >>>> submission to that separate thread) >>>> >>>> Obviously this leads to delays in submission but at least for medium >>>> length bursts of messages no message is getting lost dropped or ... >>>> >>>> Is this something the driver should address (as a separate patch)? >>> >>> Imaging you have 2 or even 4 CAN controllers on the board. Or a system >>> with just one CPU. >> >> As this is very specific to the high load test. >> >> I doubt that a second controller can get used under these circumstances. > > Me too! > >> Reason is that basically statistics show that for the above >> performance test: >> * the SPI clock is asserted (high) for about 25% of the time - hence >> clock is running for about 50% of the time - that is at >> (effectively) - 9 cycles per byte where the last cycle is low, >> so the actual figure is actually 12% higher... >> * SPI-CS CS is asserted (low) for about 75% of the time. >> * on top there is the spi framework overhead that is not easily >> measurable but the time between CS-deassert and CS-assert for >> 2 back to back spi_sync calls is 4-6us (with minimal error >> handling) - this is the penalty of the SPI-framework. >> Taking this into account then the 75% raise to 99.7% utilization >> (=75% + (5us*49400)/1s*100%) usage of SPI (at 49.4k CS-assers/s) >> >> This does not leave much space for more SPI traffic to handle high load. >> >> At 500kHz CAN-Bus rate it should be possible - or at higher >> SPI clock rates (I am using 12.5 to be able to have a chance to >> capture SPI traffic at 50MHz with my logic analyzer with a sufficiently >> high sample rate). >> >> So there is “space” there as well when interfacing at 20MHz. >> >> A second controller may be possible for all but this torture tests. >> CanFD with 8MHz Data rates and DLC=9 (12 bytes) may be another such >> situation where this may break - the driver reads 8 data bytes >> when reading the header - everything above as a separate spi_sync >> call (the driver may become more dynamic based on previous packets >> in the future, but mis-predictions of DLC=15 while effective DLC=8 >> are bad as it increases the number of bytes to get transferred over >> the spi bus by a factor of 3.5 (22 vs 78 bytes)! >> >> But that will be another patch. > > Well, SPI is a pain (for software). If the spi-bus driver is not heavily optimized then yes! I am actually quite proud that I can sustain this load test from the SPI side and I have got some optimizations that will improve it further by reducing the number of SPI transfers. But - as said - those will be “add-on” patches to make the optimization visible. > >>>> Or should there be something in the can framework/stack that could >>>> handle such situations better? >>> >>> That would be ideal, but I'm not sure if it's feasible. Already the high >>> interrupt load is a problem. And SPI likely as well. And I think we >>> suffer here from the overhead of the networking stack as well. >>>> Or should I just ignore those “dropped” packages, as this is really >>>> a worsted case scenario? >>> >>> If we can do it better with reasonable effort and in a portable way, we >>> should address it, of course. Would be interesting to know how other CAN >>> controllers behave on that hardware. >> >> In the future maybe - I just wanted this fact to be on the radar/known... > > Keep it simple! I'm not sure if this worst case scenario is a realistic > one as all other nodes on the CAN bus will moan as well. Do you have a > chance to do such test with a Flexcan or SJA1000? > I do not own such a device but I am using a beagle bone black for this load-testing and there it works fine and it can handle those 17k messages tx and rx with no problems (but at 17k interrupts) Thanks anyway for the input Martin