Re: Pin-pointing the root of unusual application latencies

John Sigler <linux.kernel@xxxxxxx> · Mon, 23 Jul 2007 10:25:29 +0200

John Sigler wrote:

[ Note: cross-posted to linux-rt-users and oprofile-list ]

I'm running kernel 2.6.20.7-rt8 on x86 (1267-MHz P3) with the
following debug options:

CONFIG_TRACE_IRQFLAGS_SUPPORT=y
# CONFIG_PRINTK_TIME is not set
# CONFIG_ENABLE_MUST_CHECK is not set
# CONFIG_MAGIC_SYSRQ is not set
# CONFIG_UNUSED_SYMBOLS is not set
# CONFIG_DEBUG_FS is not set
# CONFIG_HEADERS_CHECK is not set
# CONFIG_DEBUG_KERNEL is not set
CONFIG_LOG_BUF_SHIFT=14
CONFIG_STACKTRACE=y
CONFIG_EVENT_TRACE=y
CONFIG_FUNCTION_TRACE=y
CONFIG_WAKEUP_TIMING=y
CONFIG_LATENCY_TRACE=y
# CONFIG_CRITICAL_PREEMPT_TIMING is not set
# CONFIG_CRITICAL_IRQSOFF_TIMING is not set
CONFIG_WAKEUP_LATENCY_HIST=y
CONFIG_LATENCY_TIMING=y
CONFIG_LATENCY_HIST=y
CONFIG_MCOUNT=y
CONFIG_DEBUG_BUGVERBOSE=y
CONFIG_FRAME_POINTER=y
CONFIG_UNWIND_INFO=y
CONFIG_STACK_UNWIND=y
CONFIG_EARLY_PRINTK=y
CONFIG_X86_FIND_SMP_CONFIG=y
CONFIG_X86_MPPARSE=y
CONFIG_DOUBLEFAULT=y

http://linux.kernel.free.fr/latency/config-2.6.20.7-rt8-adlink-latency

Here's my situation:

A SCHED_RR process with priority 80 is performing blocking reads of 
1316-byte chunks from a PCI device. The data comes in at ~38 Mbit/s.

Therefore, in a perfect world, my process would wake up periodically 
every 1316*8 / 38e6 = 277 µs to handle the latest chunk.

However, periodically, my process blocks up to 400 µs instead of 277 µs.
Naturally, the next iteration, my process blocks less than 277 µs (the 
PCI board has 1316 bytes ready every 277 µs regardless).

I've captured a histogram of read latencies.
It is expressed in units of 512 ns.
e.g. there were 23,126,460 occurences of LAT=278 µs (543 * 512 ns)
http://linux.kernel.free.fr/latency/histogram.txt
http://linux.kernel.free.fr/latency/read_latency.png

The system is otherwise completely idle. In other words, 100% of the 
CPU is dedicated to the single task of waiting for an interrupt 
request, servicing it, and pushing the data to user land.

I used OProfile to frequently sample the instruction pointer while my 
application ran all night long.

http://linux.kernel.free.fr/latency/oprofile_report_all_night.txt

CPU: PIII, speed 1266.7 MHz (estimated)
Counted CPU_CLK_UNHALTED events (clocks processor is not halted) with 
a unit mask of 0x00 (No unit mask) count 60000
samples      %      symbol name
958075714  90.6470   default_idle
25637397    2.4256   mask_and_ack_8259A
10697948    1.0122   __copy_to_user_ll
9345347     0.8842   oprofiled
9246368     0.8748   Dta1xxIsr
6732131     0.6370   enable_8259A_irq
5153650     0.4876   __schedule
1873157     0.1772   irq_entries_start
1729897     0.1637   increment_tail
1565354     0.1481   clock_gettime
1429908     0.1353   Dta1xxDmaTransfer
1428107     0.1351   ioread16
1252095     0.1185   system_call
1081768     0.1023   try_to_wake_up
983687      0.0931   cpu_idle

I thought default_idle called HLT via safe_halt(). Since I'm only 
counting UNHALTED clock cycles, I'm surprised to see so many samples 
taken in default_idle. Does someone understand that?

Is it possible that one of these functions periodically delays the 
delivery of data to my SCHED_RR application?

Here's a /proc/latency_trace dump. What is there to understand?

# cat /proc/latency_trace
preemption latency trace v1.1.5 on 2.6.20.7-rt8
--------------------------------------------------------------------
 latency: 26 us, #2/2, CPU#0 | (M:rt VP:0, KP:0, SP:1 HP:1)
    -----------------
    | task: softirq-timer/0-4 (uid:0 nice:0 policy:1 rt_prio:50)
    -----------------

                 _------=> CPU#
                / _-----=> irqs-off
               | / _----=> need-resched
               || / _---=> hardirq/softirq
               ||| / _--=> preempt-depth
               |||| /
               |||||     delay
   cmd     pid ||||| time  |   caller
      \   /    |||||   \   |   /
   <...>-4     0D..1   26us : trace_stop_sched_switched 
(__sched_text_start)

vim:ft=help

Shouldn't there be many lines in this output?
Do I need an additional kernel configuration option?

Is there a different diagnostic I can run to tell why data delivery
to my app is periodically delayed 100 µs?

I'm open to all suggestions. I'm running out of ideas.

# cat /proc/latency_hist/wakeup_latency/CPU0
#Minimum latency: 0 microseconds.
#Average latency: 0 microseconds.
#Maximum latency: 26 microseconds.
#Total samples: 14925698
#There are 0 samples greater or equal than 10240 microseconds
#usecs           samples
    0            3682745
    1            9487737
    2            1748372
    3               1192
    4               3689
    5                 50
    6                 14
    7                 14
    8                  9
    9                 46
   10                995
   11                799
   12                  6
   13                  8
   14                  4
   15                  6
   16                  3
   17                  0
   18                  1
   19                  0
   20                  0
   21                  0
   22                  3
   23                  1
   24                  0
   25                  1
   26                  3

I would think that I might have a clearer picture if I could request 
the last N instruction pointer samples from OProfile, to figure out 
what the CPU has been doing for the past several hundred µs.

I'd like to hear what OProfile devs think of this approach.
I seems far from trivial to implement?

Is there some (any) additional information I could provide?

I'm open (really) to all comments and suggestions.

Regards.
-
To unsubscribe from this list: send the line "unsubscribe linux-rt-users" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html