Hello Sebastian, On Wednesday 29 of January 2025 11:17:09 Sebastian Andrzej Siewior wrote: > On 2025-01-28 16:29:27 [+0100], Pavel Pisa wrote: > > Please check if you find some problematic choices. > > I didn't find anything obviously wrong. Assuming your CPU is busy in > general you could remove NO_HZ in favour of PERIODIC. This is however > not to cause spikes you describe below. Great, thanks much for review by expert. > > The cyclic test worked well, and we have even delivered two systems > > to OSADL QA real-time farm > > > > https://www.osadl.org/?id=4109 > > It shows "IRQ work interrupts". Not sure what causes them. I am not sure either. That list is from old kernel in long term testing setup at OSADL. The actual one show none IRQ work interrupts after last reboot and overnigh test Linux mzapo 6.13.0-rc6-rt3-dut #1 SMP PREEMPT_RT Wed Jan 29 04:46:40 CET 2025 armv7l GNU/Linux CPU0 CPU1 24: 0 0 GIC-0 27 Edge gt 25: 700822 327164 GIC-0 29 Edge twd 26: 300 0 GIC-0 59 Level xuartps 29: 0 0 GIC-0 45 Level f8003000.dmac 30: 0 0 GIC-0 46 Level f8003000.dmac 31: 0 0 GIC-0 47 Level f8003000.dmac 32: 0 0 GIC-0 48 Level f8003000.dmac 33: 0 0 GIC-0 49 Level f8003000.dmac 34: 0 0 GIC-0 72 Level f8003000.dmac 35: 0 0 GIC-0 73 Level f8003000.dmac 36: 0 0 GIC-0 74 Level f8003000.dmac 37: 0 0 GIC-0 75 Level f8003000.dmac 40: 460330 0 GIC-0 54 Level end0 41: 0 0 GIC-0 53 Level e0002000.usb 42: 356 0 GIC-0 56 Level mmc0 43: 0 0 GIC-0 43 Level ttc_clockevent 44: 25 0 GIC-0 39 Level f8007100.adc 45: 0 0 GIC-0 37 Level arm-pmu 46: 0 0 GIC-0 38 Level arm-pmu 47: 128 0 GIC-0 40 Level f8007000.devcfg 48: 314697 0 GIC-0 61 Level can2 49: 314597 0 GIC-0 62 Level can3 50: 314759 0 GIC-0 63 Level can4 51: 311516 0 GIC-0 64 Level can5 IPI0: 0 0 CPU wakeup interrupts IPI1: 0 0 Timer broadcast interrupts IPI2: 17849 292126 Rescheduling interrupts IPI3: 5923 11315 Function call interrupts IPI4: 0 0 CPU stop interrupts IPI5: 271078 74040 IRQ work interrupts IPI6: 0 0 completion interrupts Err: 0 So this seems as no cause. > > However, the CAN/CAN FD communication latency measured on the CTU CAN FD > > IP core is far from optimal. Some runs under load with > > 10 msec latency. Our own CAN FD stack for RTEMS keeps with no exception > > under 60 usec on the same hardware. > > > > I understand that the Linux socket layer and networking > > stack are complex, and many optimizations are ahead. > > We will be happy to contribute where we can and find time > > and even some resources to engage more students etc... > > > > But I would like to be sure that the bad results are not > > caused by our mistakes in configuration. > > You have CAN and "regular networking". My guess would be that regular > networking blocks blocks BH and so your CAN. You could try to have all > interrupts serviced on CPU0 and move CAN to CPU1. If so this should > improve then. Other than that, I would suggest to get some tracing to > see what delays your CAN interrupts and/ or handling in general. Yes, I think that design mixing regular networking packet processing with CAN is the problem. We test even with setup where CAN interrupts priority is boosted to 90 echo "-> Rise CAN irq priorities" PIDS=$(ps -e | grep -E irq/[0-9]+-can[3-4] | tr -s ' ' | cut -d ' ' -f2) TXPID=$(ps -e | grep -E irq/[0-9]+-can2 | tr -s ' ' | cut -d ' ' -f2) chrt -f --pid 80 $TXPID for pid in $PIDS ; do chrt -f --pid 85 $pid done ps Hxa --sort rtprio -o pid,policy,rtprio,state,tname,time,command ... 70 FF 50 S ? 00:00:00 [irq/37-f8003000.dmac] 71 FF 50 S ? 00:00:38 [irq/40-eth%d] ... 405 FF 50 S ? 00:00:00 [irq/26-xuartps] 355 FF 90 S ? 00:00:06 [irq/48-can2] 361 FF 90 S ? 00:00:13 [irq/49-can3] 366 FF 90 S ? 00:00:07 [irq/50-can4] 371 FF 90 S ? 00:00:06 [irq/51-can5] 22 FF 99 S ? 00:00:00 [migration/0] 27 FF 99 S ? 00:00:00 [migration/1] Even this setup is problematic under load. The situation with CAN IRQ priority 50 and 90 can be compared by clicking on "RT priority set" option https://canbus.pages.fel.cvut.cz/can-latester/inspect.html?kernel=rt&prio=1&load=1&flood=1&fd=1 The switch between in kernel CAN gateway and userpace one is controlled by "Kernel GW". User CAN gateway is run with priority 80 chrt -r 80 ugw -f can3 can2 I spot interesting trend after run-250103-045322-hist+6.13.0-rc1-rt1-g5374fecd2695+flood-prio-fd-load.json that user gateway case, simple copy of frames from can3 to can2 has never exceed 1.4 ms almost for one month. It could be interesting to corelate that with kernel changes. We use branch for-kbuild-bot/current-stable from git://git.kernel.org/pub/scm/linux/kernel/git/rt/linux-rt-devel.git to run daily testing. We can consider even something different, but this choice has been given by interest in something functional for each day and ahead of mainline merges to catch some problems in advance. It is interesting than in kernel gateway is significantly worse now. It does not overhead of switching to userspace. But I am not sure if it is not invoked in some kernel worker which has lower or same real time priority than Ethenet networking. In general, I think that the problem is that incommin packets (CAN and Ethernet) load the same per CPU worker. There are even backlog_napi threads per CPU 46 TS - S ? 00:00:00 [backlog_napi/0] 47 TS - S ? 00:00:00 [backlog_napi/1] It has even TS priority. If I remember well, there has been added option to allocate separate RX packets processing therad (instead for default per CPU one) for given interface. But I have no experience with such configuration. Do you have or somebody else have idea how to achieve that and if it is legal to boost such kernel therad priority. It could help, because my general experience with PREEMPT_RT even on this target is very positive for tasks mapping HW directly and doing RT control. Same for latency tester. No spikes under load over 250 usec or less. > > I will be happy to meet you and discuss Linux and other > > control and real-time areas at FOSDEM 2025. > > I should be able to make it. Great, I would be happy to meet at FOSDEM or discuss these topic later at some event. > > Slides in English which I want to update/correct for FOSDEM > > > > > > https://talks.openalt.cz/media/openalt-2024/submissions/3XTMDF/resources/ > >openalt24_linux_for_rt-reduced_FbZPuS0.pdf > > looks good. If you want additional history points, I have some at > https://files.speakerdeck.com/presentations/0620b5b3a00b42fc91fba6cc4092d2 >78/KR_2024_PREEMPT_RT_over_the_years.pdf Slide 11 - 21. Thanks much for the input > However you have most of the pieces so. > Best wishes, Pavel -- Pavel Pisa phone: +420 603531357 e-mail: pisa@xxxxxxxxxxxxxxxx Department of Control Engineering FEE CVUT Karlovo namesti 13, 121 35, Prague 2 university: http://control.fel.cvut.cz/ personal: http://cmp.felk.cvut.cz/~pisa social: https://social.kernel.org/ppisa projects: https://www.openhub.net/accounts/ppisa CAN related:http://canbus.pages.fel.cvut.cz/ RISC-V education: https://comparch.edu.cvut.cz/ Open Technologies Research Education and Exchange Services https://gitlab.fel.cvut.cz/otrees/org/-/wikis/home