About tracing clues

Prasad Pandit <ppandit@xxxxxxxxxx> · Fri, 19 Aug 2022 15:36:03 +0530

Hello,

* While running oslat(8) tests in a guest running on RHEL kernel-rt
host, I'm seeing occasional peculiar latency spikes as below
===
# trace-cmd record -p nop --poll -M 1E -e all
/home/test/rt-tests/oslat --cpu-list 1-4 --rtprio 1 --duration 12h -T
30
oslat V 2.30
INFO: debugfs mountpoint: /sys/kernel/debug/tracing/
Total runtime:          43200 seconds
Thread priority:        SCHED_FIFO:1
CPU list:               1-4
CPU for main thread:    0
Workload:               no
Workload mem:           0 (KiB)
Preheat cores:          4

Pre-heat for 1 seconds...
Test starts...
/home/test/rt-tests/oslat: Trace threshold (30 us) triggered on cpu 2
with 67 us!
Stopping the test.
...
             <idle>-0     [002]   265.041487: tlb_flush:
pages=0 reason=flush on task switch (0)
           oslat-16074 [002]   265.041489: print:
tracing_mark_write: /home/test/rt-tests/oslat: Starting thread on CPU
2.
           oslat-16074 [002]   265.041490: kfree:
(__audit_syscall_exit+0x249) call_site=ffffffff95fb8889 ptr=(nil)
           oslat-16074 [002]   265.041490: sys_exit:             NR 1 = 53
           oslat-16074 [002]   265.041490: sys_exit_write:       0x35
           oslat-16074 [002]   265.041490: user_enter:
           oslat-16074 [002]   300.576710: user_exit:
           oslat-16074 [002]   300.576734: irq_handler_entry:
irq=21 name=virtio3
           oslat-16074 [002]   300.576734: irq_handler_exit:
irq=21 ret=handled
           oslat-16074 [002]   300.576734: sched_waking:
comm=irq/21-virtio3 pid=219 prio=49 target_cpu=002
           oslat-16074 [002]   300.576736: sched_migrate_task:
comm=irq/21-virtio3 pid=219 prio=49 orig_cpu=2 dest_cpu=0
           oslat-16074 [002]   300.576740: write_msr:            830, value fd
           oslat-16074 [002]   300.576740: sched_wakeup:
irq/21-virtio3:219 [49] success=1 CPU:000
           oslat-16074 [002]   300.576742: read_msr:             819, value 4
           oslat-16074 [002]   300.576754: irq_matrix_alloc:
bit=38 cpu=0 online=1 avl=195 alloc=6 managed=1 online_maps=6
global_avl=1178, global_rsvd=10, total_alloc=28
           oslat-16074 [002]   300.576755: vector_update:
irq=21 vector=38 cpu=0 prev_vector=34 prev_cpu=2
           oslat-16074 [002]   300.576755: vector_alloc:
irq=21 vector=38 reserved=0 ret=0
           oslat-16074 [002]   300.576756: vector_config:
irq=21 vector=38 cpu=0 apicdest=0x00000000
           oslat-16074 [002]   300.576773: user_enter:
           oslat-16074 [002]   300.576778: user_exit:
           oslat-16074 [002]   300.576778: sys_enter:            NR 1
(3, 7fcbb683c300, 65, 0, 0, 7fcbb683b816)
           oslat-16074 [002]   300.576778: sys_enter_write:      fd:
0x00000003, buf: 0x7fcbb683c300, count: 0x00000065
           oslat-16074 [002]   300.576780: print:
tracing_mark_write: /home/test/rt-tests/oslat: Trace threshold (30 us)
triggered on cpu 2 with 67 us!
Stopping the test.
           oslat-16074 [002]   300.576780: kfree:
(__audit_syscall_exit+0x249) call_site=ffffffff95fb8889 ptr=(nil)
           oslat-16074 [002]   300.576781: sys_exit:             NR 1 = 101
           oslat-16074 [002]   300.576781: sys_exit_write:       0x65
           oslat-16074 [002]   300.576781: user_enter:
===

* Looking at these traces it is difficult to see/conclude the source
of the said (67us) latency spike.
* Other logs I have for the same host/guest show
_call_function_entry/exit or timer softirq entries in traces.
* Peculiar is that these spikes seem to occur when oslat(8) is invoked
after the guest boot-up.
   Subsequent oslat(8) sessions in the same guest run over 12 hrs
without any sign of latency spike.

* Is there a way to find further clues or conclusively figure out the
source of such occasional latency spikes?

Thank you.
---
  - P J P