Hi! I'm playing around with a Raspberry Pi 5 and a PREEMPT_RT kernel (raspberry pi 6.13.y branch) and wonder about the latencies I see. First I tried cyclictest and saw latencies in the upper 50s up to 80s or so under a kernel compile load. So I compiled the kernel with timerlat and osnoise tracers and ran a bit of rtla timerlat top -a 60 --dma-latency 0 and I get traces like the following: Timer Latency 0 00:00:22 | IRQ Timer Latency (us) | Thread Timer Latency (us) CPU COUNT | cur min avg max | cur min avg max 0 #21138 | 4 0 0 14 | 17 3 4 48 1 #21138 | 6 0 0 21 | 41 3 4 50 2 #21138 | 11 0 0 11 | 61 3 4 61 3 #21138 | 5 0 0 10 | 40 3 4 57 rtla timerlat hit stop tracing ## CPU 2 hit stop tracing, analyzing it ## IRQ handler delay: 1.36 us (2.22 %) IRQ latency: 11.60 us Timerlat IRQ duration: 21.26 us (34.63 %) Blocking thread: 19.46 us (31.70 %) cc1:265581 19.46 us Blocking thread stack trace -> timerlat_irq -> __hrtimer_run_queues -> hrtimer_interrupt -> arch_timer_handler_phys -> handle_percpu_devid_irq -> generic_handle_domain_irq -> gic_handle_irq -> call_on_irq_stack -> do_interrupt_handler -> el0_interrupt -> __el0_irq_handler_common -> el0t_64_irq_handler -> el0t_64_irq ------------------------------------------------------------------------ Thread latency: 61.40 us (100%) Max timerlat IRQ latency from idle: 4.74 us in cpu 2 Saving trace to timerlat_trace.txt And the surprising thing was that there was nothing out of the ordinary (at least to my non-expert eyes): No spin locks or similar. Just a syscall (these are implemented via IRQ EL0, right?) which then goes straight to the timer handling. I might be completely off here though, as I am not an expert at all in these things. When I contrast this to an unloaded system and a threshold like 10 us it seems that there is actually "something" in the trace. Timer Latency 0 00:00:04 | IRQ Timer Latency (us) | Thread Timer Latency (us) CPU COUNT | cur min avg max | cur min avg max 0 #4004 | 0 0 0 1 | 3 3 3 6 1 #4004 | 1 0 0 1 | 10 3 3 10 2 #4003 | 0 0 0 0 | 3 3 3 7 3 #4003 | 0 0 0 1 | 3 3 3 7 rtla timerlat hit stop tracing ## CPU 1 hit stop tracing, analyzing it ## IRQ handler delay: 0.00 us (0.00 %) IRQ latency: 1.40 us Timerlat IRQ duration: 4.67 us (45.67 %) Blocking thread: 3.17 us (30.98 %) htop:262562 3.17 us Blocking thread stack trace -> timerlat_irq -> __hrtimer_run_queues -> hrtimer_interrupt -> arch_timer_handler_phys -> handle_percpu_devid_irq -> generic_handle_domain_irq -> gic_handle_irq -> call_on_irq_stack -> do_interrupt_handler -> el1_interrupt -> el1h_64_irq_handler -> el1h_64_irq -> tracer_preempt_on -> preempt_count_sub -> migrate_enable -> kmem_cache_free -> __fput -> __fput_sync -> __arm64_sys_close -> invoke_syscall -> el0_svc_common.constprop.0 -> do_el0_svc -> el0_svc -> el0t_64_sync_handler -> el0t_64_sync ------------------------------------------------------------------------ Thread latency: 10.22 us (100%) Max timerlat IRQ latency from idle: 1.43 us in cpu 1 Saving trace to timerlat_trace.txt In this case it's htop invoking some syscall. So I really wonder how to interpret things like the first trace above where "nothing" is happening and still there are large-ish latencie. At some point in time I wondered what kind of loads did trigger the longest latencies and I played with "stress -m 6" a little and I was surprised to see the latencies even go higher than 100. Here is an example trace Timer Latency 0 00:00:16 | IRQ Timer Latency (us) | Thread Timer Latency (us) CPU COUNT | cur min avg max | cur min avg max 0 #15654 | 1 0 2 31 | 6 3 15 75 1 #15654 | 2 0 3 42 | 13 3 17 91 2 #15654 | 2 0 2 38 | 17 3 16 87 3 #15654 | 5 0 3 26 | 111 3 16 111 rtla timerlat hit stop tracing ## CPU 3 hit stop tracing, analyzing it ## IRQ handler delay: 0.00 us (0.00 %) IRQ latency: 5.16 us Timerlat IRQ duration: 47.41 us (42.53 %) Blocking thread: 50.50 us (45.30 %) stress:278226 50.50 us Blocking thread stack trace -> timerlat_irq -> __hrtimer_run_queues -> hrtimer_interrupt -> arch_timer_handler_phys -> handle_percpu_devid_irq -> generic_handle_domain_irq -> gic_handle_irq -> call_on_irq_stack -> do_interrupt_handler -> el1_interrupt -> el1h_64_irq_handler -> el1h_64_irq -> folios_put_refs -> free_pages_and_swap_cache -> __tlb_batch_free_encoded_pages -> tlb_flush_mmu -> unmap_page_range -> unmap_vmas -> vms_clear_ptes -> vms_complete_munmap_vmas -> do_vmi_align_munmap -> do_vmi_munmap -> __vm_munmap -> __arm64_sys_munmap -> invoke_syscall -> el0_svc_common.constprop.0 -> do_el0_svc -> el0_svc -> el0t_64_sync_handler -> el0t_64_sync ------------------------------------------------------------------------ Thread latency: 111.47 us (100%) Max timerlat IRQ latency from idle: 1.98 us in cpu 3 Saving trace to timerlat_trace.txt OK, here at least something related to virtual memory is happening. I still wonder though: How come the latencies are so large? Is the raspberry pi platform just not suited to reach lower latencies? Is everything working as expected? Are maybe the raspberry pi patches responsible for latency sources which are invisible in the traces? I there any hope for getting these latencies a little more under control? Kind regards, FPS