On Fri, 19 Aug 2022 15:36:03 +0530 Prasad Pandit <ppandit@xxxxxxxxxx> wrote: > Hello, > > * While running oslat(8) tests in a guest running on RHEL kernel-rt > host, I'm seeing occasional peculiar latency spikes as below > === > # trace-cmd record -p nop --poll -M 1E -e all > /home/test/rt-tests/oslat --cpu-list 1-4 --rtprio 1 --duration 12h -T > 30 Hmm you need to turn off line wraps when pasting in output like this. Because the line wraps make it very difficult to follow. /me fixes it: > === > # trace-cmd record -p nop --poll -M 1E -e all /home/test/rt-tests/oslat --cpu-list 1-4 --rtprio 1 --duration 12h -T 30 oslat V 2.30 > INFO: debugfs mountpoint: /sys/kernel/debug/tracing/ > Total runtime: 43200 seconds > Thread priority: SCHED_FIFO:1 > CPU list: 1-4 > CPU for main thread: 0 > Workload: no > Workload mem: 0 (KiB) > Preheat cores: 4 > > Pre-heat for 1 seconds... > Test starts... > /home/test/rt-tests/oslat: Trace threshold (30 us) triggered on cpu 2 > with 67 us! > Stopping the test. > ... > <idle>-0 [002] 265.041487: tlb_flush: pages=0 reason=flush on task switch (0) > oslat-16074 [002] 265.041489: print: tracing_mark_write: /home/test/rt-tests/oslat: Starting thread on CPU 2. > oslat-16074 [002] 265.041490: kfree: (__audit_syscall_exit+0x249) call_site=ffffffff95fb8889 ptr=(nil) > oslat-16074 [002] 265.041490: sys_exit: NR 1 = 53 > oslat-16074 [002] 265.041490: sys_exit_write: 0x35 > oslat-16074 [002] 265.041490: user_enter: > oslat-16074 [002] 300.576710: user_exit: Here's where it exits the user app. Hmm, what's the 24us delay here? > oslat-16074 [002] 300.576734: irq_handler_entry: irq=21 name=virtio3 > oslat-16074 [002] 300.576734: irq_handler_exit: irq=21 ret=handled > oslat-16074 [002] 300.576734: sched_waking: comm=irq/21-virtio3 pid=219 prio=49 target_cpu=002 > oslat-16074 [002] 300.576736: sched_migrate_task: comm=irq/21-virtio3 pid=219 prio=49 orig_cpu=2 dest_cpu=0 > oslat-16074 [002] 300.576740: write_msr: 830, value fd > oslat-16074 [002] 300.576740: sched_wakeup: irq/21-virtio3:219 [49] success=1 CPU:000 > oslat-16074 [002] 300.576742: read_msr: 819, value 4 > oslat-16074 [002] 300.576754: irq_matrix_alloc: bit=38 cpu=0 online=1 avl=195 alloc=6 managed=1 online_maps=6 global_avl=1178, global_rsvd=10, total_alloc=28 > oslat-16074 [002] 300.576755: vector_update: irq=21 vector=38 cpu=0 prev_vector=34 prev_cpu=2 > oslat-16074 [002] 300.576755: vector_alloc: irq=21 vector=38 reserved=0 ret=0 > oslat-16074 [002] 300.576756: vector_config: irq=21 vector=38 cpu=0 apicdest=0x00000000 And here's where it enters. > oslat-16074 [002] 300.576773: user_enter: > oslat-16074 [002] 300.576778: user_exit: > oslat-16074 [002] 300.576778: sys_enter: NR 1 (3, 7fcbb683c300, 65, 0, 0, 7fcbb683b816) > oslat-16074 [002] 300.576778: sys_enter_write: fd: 0x00000003, buf: 0x7fcbb683c300, count: 0x00000065 > oslat-16074 [002] 300.576780: print: tracing_mark_write: /home/test/rt-tests/oslat: Trace threshold (30 us) triggered on cpu 2 with 67 us! Stopping the test. > oslat-16074 [002] 300.576780: kfree: (__audit_syscall_exit+0x249) call_site=ffffffff95fb8889 ptr=(nil) > oslat-16074 [002] 300.576781: sys_exit: NR 1 = 101 > oslat-16074 [002] 300.576781: sys_exit_write: 0x65 > oslat-16074 [002] 300.576781: user_enter: > === > > * Looking at these traces it is difficult to see/conclude the source > of the said (67us) latency spike. > * Other logs I have for the same host/guest show > _call_function_entry/exit or timer softirq entries in traces. > * Peculiar is that these spikes seem to occur when oslat(8) is invoked > after the guest boot-up. > Subsequent oslat(8) sessions in the same guest run over 12 hrs > without any sign of latency spike. > > * Is there a way to find further clues or conclusively figure out the > source of such occasional latency spikes? > Since this is run on a guest, the host can obviously cause noise. If you have the latest trace-cmd (at least 3.1.2) and install that on both the host and guest, you can do a trace to get more info. You can even start trace-cmd on the guest and kick off the host. But you need to have vsockets enabled on the guest. To get the vsocket CID id, you can run the agent: # trace-cmd agent listening on @3:823 The @3 is the CID (823 is the port). Hit Ctrl-C on to kill the agent. Then on the host, run: # trace-cmd agent -P 3 Where 3 is the CID of the guest you want to connect the host (the same number from the 'trace-cmd agent'. That is, if it was @4:823, then use '-P 4'). The host agent above should also have a: listening on @2:823 or something. You'll need the host's CID too. Then on the guest, run: [ You do not need -p nop if you have -e used ] # trace-cmd record --poll -M 1E -e all --proxy 2 --name host -e kvm -e sched -e irq /home/test/rt-tests/oslat --cpu-list 1-4 --rtprio 1 --duration 12h -T 30 oslat V 2.30 And then when it ends, you'll have two files. One that is trace.dat and one that is trace-host.dat. If everything worked well, the two would also be in sync with each other. Doing: # trace-cmd report -i trace-host.dat -i trace.dat Will show the host events happening with the guest events. If the host caused the latency, you will definitely see it. If you have the latest kernelshark, you can run: kernelshark trace-host.dat -a trace.dat And then select "Plots" -> "KVM Combo Plots" and in the popup, select the box for "all". Then you'll see plots for each virtual CPU and the host thread that represents it. -- Steve