raspberry pi 4b preempt_rt performance 32 bit vs 64 bit kernels

Florian Paul Schmidt <florian_paul.schmidt@xxxxxxxxxxxxxxxx> · Wed, 1 Jun 2022 11:04:54 +0200

Hi list,

I spent some time trying to evaluate some different 
raspberrypi-kernel/preempt_rt patch combinations and found some 
"interesting" results which maybe one of you can shed a bit of light on.

I am aware that the raspberrypi-kernel is not really vanilla anymore and 
so it's possible that nothing much can be said about the issue, but I'm 
giving it a shot nonetheless.

As a preamble here's my evaluation routine:

- Set scaling_governor to performance for all cpus
- pin the gpu frequency to either 250 or 500 mhz

I use the 32 bit or 64 bit raspberry os on the rpi 4b to build/run 32 
bit and 64 bit kernels respectively.

The major kernel configuration options (found mostly by trial and error 
besides PREEMPT_RT):

- disable all kernel profiling, latency measurement and debugging options
- disable process accounting
- use a 1000 Hz timer
- use periodic timer instead of dynamic ticks

I then boot the different kernels, run

sudo ./cyclictest -M -p 90 -S --mlockall

and in another terminal trigger a rebuild of a linux kernel with -j4. 
Then I wait for a while and note down the maximum. Here's some results:

32 bit kernels:

4.19.71-rt24:   ca. 130 us
5.10.90-rt61:   ca. 140 us
5.18.0-rc7-rt9  ca. 160 us

It looks like there's rather clear regression going forward with kernel 
versions.

Here's a 64 bit kernel:

5.18.0-rc7-rt9: ca. 230 us

So that's even worse. Before I went on to full 1000Hz with periodic 
timer, disabling all kernel debugging, etc, and pinning the gpu 
frequency I measured some more kernels:

32 bit:

4.19.71-rt24:   ca. 200us
5.10.90-rt61:   ca. 200 us
5.15.40-rt43:   ca. 190 us
5.18.0-rc7-rt9: ca. 170 us

64 bit:

5.15.40-rt43:   ca. 220 us
5.18.0-rc7-rt9: ca. 270 us

So the regression going forward over kernel versions isn't as clear cut 
anymore but one trend is overwhelmingly visible over all these tests:

64 bit kernels have a higher maximum latency when compared to 32 bit 
kernels.

Do you happen to have an idea why that may be? Is there some additional 
tweak required for the 64 bit kernels on that hardware?

Also an additional observation and question:

The high latencies are triggered by kernel compiles. Just running stress 
-c 8 or writing zeros to the SD Card do not trigger them. It seems to be 
specific to that combined work load.

I played with lowering the threaded IRQ priorities for the mmc drivers 
but that had no effect at all, as expected since they run at priority 50 
per default and cyclictest runs at 90. Do you have an idea what might be 
the problem triggered by that particular workload?

Also coming back to the point about the kernel not being really vanilla: 
If I rebuild the kernel with latency measurement intrumentation, etc, 
and get some function traces for high latency code paths would you 
people even consider looking at them? :)

Kind regards,
FPS

--
Biologische Kybernetik
Universität Bielefeld
Phone: ++49 521 106 5535
http://www.uni-bielefeld.de/biologie/Kybernetik/index.html