Raspberry Pi 5 and PREEMPT_RT (6.13.0-rc3)

Florian Paul Schmidt <mista.tapas@xxxxxxx> · Fri, 20 Dec 2024 14:18:43 +0100

Hi!

I'm playing around with a Raspberry Pi 5 and a PREEMPT_RT kernel (raspberry
pi 6.13.y branch) and wonder about the latencies I see. First I tried
cyclictest and saw latencies in the upper 50s up to 80s or so under a
kernel compile load.

So I compiled the kernel with timerlat and osnoise tracers and ran a bit of

rtla timerlat top -a 60 --dma-latency 0

and I get traces like the following:

                                     Timer Latency
  0 00:00:22   |          IRQ Timer Latency (us)        |         Thread Timer Latency (us)
CPU COUNT      |      cur       min       avg       max |      cur       min       avg       max
  0 #21138     |        4         0         0        14 |       17         3         4        48
  1 #21138     |        6         0         0        21 |       41         3         4        50
  2 #21138     |       11         0         0        11 |       61         3         4        61
  3 #21138     |        5         0         0        10 |       40         3         4        57
rtla timerlat hit stop tracing
## CPU 2 hit stop tracing, analyzing it ##
  IRQ handler delay:                                         1.36 us (2.22 %)
  IRQ latency:                                              11.60 us
  Timerlat IRQ duration:                                    21.26 us (34.63 %)
  Blocking thread:                                          19.46 us (31.70 %)
                             cc1:265581                     19.46 us
    Blocking thread stack trace
                -> timerlat_irq
                -> __hrtimer_run_queues
                -> hrtimer_interrupt
                -> arch_timer_handler_phys
                -> handle_percpu_devid_irq
                -> generic_handle_domain_irq
                -> gic_handle_irq
                -> call_on_irq_stack
                -> do_interrupt_handler
                -> el0_interrupt
                -> __el0_irq_handler_common
                -> el0t_64_irq_handler
                -> el0t_64_irq
------------------------------------------------------------------------
  Thread latency:                                           61.40 us (100%)

Max timerlat IRQ latency from idle: 4.74 us in cpu 2
  Saving trace to timerlat_trace.txt

And the surprising thing was that there was nothing out of the ordinary (at
least to my non-expert eyes): No spin locks or similar. Just a syscall (these
are implemented via IRQ EL0, right?) which then goes straight to the timer
handling. I might be completely off here though, as I am not an expert at
all in these things.

When I contrast this to an unloaded system and a threshold like 10 us it seems
that there is actually "something" in the trace.

                                     Timer Latency
  0 00:00:04   |          IRQ Timer Latency (us)        |         Thread Timer Latency (us)
CPU COUNT      |      cur       min       avg       max |      cur       min       avg       max
  0 #4004      |        0         0         0         1 |        3         3         3         6
  1 #4004      |        1         0         0         1 |       10         3         3        10
  2 #4003      |        0         0         0         0 |        3         3         3         7
  3 #4003      |        0         0         0         1 |        3         3         3         7
rtla timerlat hit stop tracing
## CPU 1 hit stop tracing, analyzing it ##
  IRQ handler delay:                                         0.00 us (0.00 %)
  IRQ latency:                                               1.40 us
  Timerlat IRQ duration:                                     4.67 us (45.67 %)
  Blocking thread:                                           3.17 us (30.98 %)
                            htop:262562                      3.17 us
    Blocking thread stack trace
                -> timerlat_irq
                -> __hrtimer_run_queues
                -> hrtimer_interrupt
                -> arch_timer_handler_phys
                -> handle_percpu_devid_irq
                -> generic_handle_domain_irq
                -> gic_handle_irq
                -> call_on_irq_stack
                -> do_interrupt_handler
                -> el1_interrupt
                -> el1h_64_irq_handler
                -> el1h_64_irq
                -> tracer_preempt_on
                -> preempt_count_sub
                -> migrate_enable
                -> kmem_cache_free
                -> __fput
                -> __fput_sync
                -> __arm64_sys_close
                -> invoke_syscall
                -> el0_svc_common.constprop.0
                -> do_el0_svc
                -> el0_svc
                -> el0t_64_sync_handler
                -> el0t_64_sync
------------------------------------------------------------------------
  Thread latency:                                           10.22 us (100%)

Max timerlat IRQ latency from idle: 1.43 us in cpu 1
  Saving trace to timerlat_trace.txt

In this case it's htop invoking some syscall. So I really wonder how to
interpret things like the first trace above where "nothing" is happening
and still there are large-ish latencie.

At some point in time I wondered what kind of loads did trigger the
longest latencies and I played with "stress -m 6" a little and I was
surprised to see the latencies even go higher than 100. Here is an
example trace

                                     Timer Latency
  0 00:00:16   |          IRQ Timer Latency (us)        |         Thread Timer Latency (us)
CPU COUNT      |      cur       min       avg       max |      cur       min       avg       max
  0 #15654     |        1         0         2        31 |        6         3        15        75
  1 #15654     |        2         0         3        42 |       13         3        17        91
  2 #15654     |        2         0         2        38 |       17         3        16        87
  3 #15654     |        5         0         3        26 |      111         3        16       111
rtla timerlat hit stop tracing
## CPU 3 hit stop tracing, analyzing it ##
  IRQ handler delay:                                         0.00 us (0.00 %)
  IRQ latency:                                               5.16 us
  Timerlat IRQ duration:                                    47.41 us (42.53 %)
  Blocking thread:                                          50.50 us (45.30 %)
                          stress:278226                     50.50 us
    Blocking thread stack trace
                -> timerlat_irq
                -> __hrtimer_run_queues
                -> hrtimer_interrupt
                -> arch_timer_handler_phys
                -> handle_percpu_devid_irq
                -> generic_handle_domain_irq
                -> gic_handle_irq
                -> call_on_irq_stack
                -> do_interrupt_handler
                -> el1_interrupt
                -> el1h_64_irq_handler
                -> el1h_64_irq
                -> folios_put_refs
                -> free_pages_and_swap_cache
                -> __tlb_batch_free_encoded_pages
                -> tlb_flush_mmu
                -> unmap_page_range
                -> unmap_vmas
                -> vms_clear_ptes
                -> vms_complete_munmap_vmas
                -> do_vmi_align_munmap
                -> do_vmi_munmap
                -> __vm_munmap
                -> __arm64_sys_munmap
                -> invoke_syscall
                -> el0_svc_common.constprop.0
                -> do_el0_svc
                -> el0_svc
                -> el0t_64_sync_handler
                -> el0t_64_sync
------------------------------------------------------------------------
  Thread latency:                                          111.47 us (100%)

Max timerlat IRQ latency from idle: 1.98 us in cpu 3
  Saving trace to timerlat_trace.txt

OK, here at least something related to virtual memory is happening.
I still wonder though: How come the latencies are so large? Is the
raspberry pi platform just not suited to reach lower latencies? Is
everything working as expected? Are maybe the raspberry pi
patches responsible for latency sources which are invisible in the
traces? I there any hope for getting these latencies a little more
under control?

Kind regards,
FPS