Re: Kernel 4.6.7-rt13: Intel Ethernet driver igb causes huge latencies in cyclictest

Julia Cartwright <julia.cartwright@xxxxxx> · Wed, 28 Sep 2016 14:45:19 -0500

On Mon, Sep 26, 2016 at 11:12:19AM +0000, Koehrer Mathias (ETAS/ESW5) wrote:
[..]
> > Nope. There is nothing that might prevents preemption (unless I missed something).
> > However if you there are a lot of posted writes writes to the HW, the read might
> > force them to occur *now* which will stall the CPU until it happens.
> > Try to comment the function(s) out to see if that spikes vanishes.
> > That igb_read_phy_reg_82580() function also invokes another one for the acquire
> > callback. 
> 
> I tried hard to identify a single function that is causing the spikes. 
> However I had no success. I could not identify a single function.
> What I did now instead was to use the kernel tracer to identify the issue.
> I run now cyclictest with only one thread by using the following command:
> 
> # cyclictest -a -i 100 -m -n -p 80 -t 1  -b 25 -f -O latency-format
> 
> Please see the attached fragment of the trace output.
> It contains the last couple of cycles from the cyclictest before it stopped because
> of the latency being larger than 25us.
> 
> I repeated that test a multiple times.
> And I noticed always igb relevant functions just before the final cycle in the cycletest.
> And for me it looks as if the calls that are in the trace from line 3621 to 3641 occur
> always before the latency exceeds the threshold.
> 
> Can you please have a look inside the attached trace file.
> Do you have any idea what might cause the issue here?

The trace data _seems_ to show that CPU0 is stuck with interrupts
disabled/handling hardirqs from 3702725us (presumably the hrtimer wakeup
for the cyclictest thread), to 3702765us when cyclictest actually gets
scheduled in, which is about 40us.

Right after the initial hrtimer wakeup event, there appears to be a
eight or so of these similar looking traces:

  kworker/-6071    0dN..1.. 3702729us : do_IRQ <-ret_from_intr
  kworker/-6071    0dN..1.. 3702729us : irq_enter <-do_IRQ
  kworker/-6071    0dN..1.. 3702729us : rcu_irq_enter <-irq_enter
  kworker/-6071    0dN..1.. 3702729us : preempt_count_add <-irq_enter
  kworker/-6071    0dN.h1.. 3702729us : exit_idle <-do_IRQ
  kworker/-6071    0dN.h1.. 3702729us : handle_irq <-do_IRQ
  kworker/-6071    0dN.h1.. 3702729us : handle_edge_irq <-handle_irq
  kworker/-6071    0dN.h1.. 3702729us : _raw_spin_lock <-handle_edge_irq
  kworker/-6071    0dN.h1.. 3702729us : preempt_count_add <-_raw_spin_lock
  kworker/-6071    0dN.h2.. 3702730us : irq_may_run <-handle_edge_irq
  kworker/-6071    0dN.h2.. 3702730us : irq_chip_ack_parent <-handle_edge_irq
  kworker/-6071    0dN.h2.. 3702730us : apic_ack_edge <-irq_chip_ack_parent
  kworker/-6071    0dN.h2.. 3702730us : __irq_complete_move <-apic_ack_edge
  kworker/-6071    0dN.h2.. 3702730us : irq_move_irq <-apic_ack_edge
  kworker/-6071    0dN.h2.. 3702730us : handle_irq_event <-handle_edge_irq
  kworker/-6071    0dN.h2.. 3702730us : preempt_count_sub <-handle_irq_event
  kworker/-6071    0dN.h1.. 3702730us : handle_irq_event_percpu <-handle_irq_event
  kworker/-6071    0dN.h1.. 3702731us : irq_default_primary_handler <-handle_irq_event_percpu
  kworker/-6071    0dN.h1.. 3702731us : __irq_wake_thread <-handle_irq_event_percpu
  kworker/-6071    0dN.h1.. 3702731us : wake_up_process <-__irq_wake_thread
  kworker/-6071    0dN.h1.. 3702731us : try_to_wake_up <-wake_up_process
  kworker/-6071    0dN.h1.. 3702731us : _raw_spin_lock_irqsave <-try_to_wake_up
  kworker/-6071    0dN.h1.. 3702731us : preempt_count_add <-_raw_spin_lock_irqsave
  kworker/-6071    0dN.h2.. 3702731us : _raw_spin_lock <-try_to_wake_up
  kworker/-6071    0dN.h2.. 3702731us : preempt_count_add <-_raw_spin_lock
  kworker/-6071    0dN.h3.. 3702731us : ttwu_do_activate.constprop.104 <-try_to_wake_up
  kworker/-6071    0dN.h3.. 3702731us : activate_task <-ttwu_do_activate.constprop.104
  kworker/-6071    0dN.h3.. 3702731us : update_rq_clock.part.91 <-activate_task
  kworker/-6071    0dN.h3.. 3702732us : enqueue_task_rt <-activate_task
  kworker/-6071    0dN.h3.. 3702732us : dequeue_rt_stack <-enqueue_task_rt
  kworker/-6071    0dN.h3.. 3702732us : dequeue_top_rt_rq <-dequeue_rt_stack
  kworker/-6071    0dN.h3.. 3702732us : update_rt_migration <-enqueue_task_rt
  kworker/-6071    0dN.h3.. 3702732us : _raw_spin_lock <-enqueue_task_rt
  kworker/-6071    0dN.h3.. 3702732us : preempt_count_add <-_raw_spin_lock
  kworker/-6071    0dN.h4.. 3702732us : preempt_count_sub <-enqueue_task_rt
  kworker/-6071    0dN.h3.. 3702732us : enqueue_top_rt_rq <-enqueue_task_rt
  kworker/-6071    0dN.h3.. 3702732us : ttwu_do_wakeup <-ttwu_do_activate.constprop.104
  kworker/-6071    0dN.h3.. 3702732us : check_preempt_curr <-ttwu_do_wakeup
  kworker/-6071    0dN.h3.. 3702732us : resched_curr <-check_preempt_curr
  kworker/-6071    0dN.h3.. 3702732us : task_woken_rt <-ttwu_do_wakeup
  kworker/-6071    0dN.h3.. 3702733us : preempt_count_sub <-try_to_wake_up
  kworker/-6071    0dN.h2.. 3702733us : _raw_spin_unlock_irqrestore <-try_to_wake_up
  kworker/-6071    0dN.h2.. 3702733us : preempt_count_sub <-_raw_spin_unlock_irqrestore
  kworker/-6071    0dN.h1.. 3702733us : note_interrupt <-handle_irq_event_percpu
  kworker/-6071    0dN.h1.. 3702733us : _raw_spin_lock <-handle_irq_event
  kworker/-6071    0dN.h1.. 3702733us : preempt_count_add <-_raw_spin_lock
  kworker/-6071    0dN.h2.. 3702733us : preempt_count_sub <-handle_edge_irq
  kworker/-6071    0dN.h1.. 3702733us : irq_exit <-do_IRQ
  kworker/-6071    0dN.h1.. 3702733us : preempt_count_sub <-irq_exit
  kworker/-6071    0dN..1.. 3702733us : rcu_irq_exit <-irq_exit

Which, looks to me to be the normal "forced primary" interrupt handling
path, which simply wakes the created irqthread.

However, what isn't clear from the data is _which_ irqthread(s) is being
woken up.  Presumably, due to the prior igb traces, it's one of the igb
interrupts, but that would be nice to confirm using the sched_wakeup
event or other means.

Similar to the PCI write-buffering cases, we've also observed that when
the PCI interconnect is bogged down with traffic from other masters,
it's possible that a read from the CPU can be stalled (in some cases,
for quite awhile, but it depends on the PCI topology, switches used,
their configurations, etc).

So, one plausible narrative here: it's conceivable that the
rd32(E1000_RXSTMPH) in igb_ptp_rx_hang() is "stuck" in the bus somewhere
presumably due to load from other masters (the trace seems to indicate
it might be as much as 20us), with CPU execution stalled awaiting it's
completion.  Meanwhile, the CPU is encountering interrupts from other
sources (local APIC, etc).  Once the read "completes", the CPU is
finally able to service all of the interrupts that have piled up, which
is why we see in the traces these 9 wakeups happening in a row.

The question is: how can we confirm/refute this, or are there other,
more plausible scenarios it's possible to run into?

   Julia
--
To unsubscribe from this list: send the line "unsubscribe linux-rt-users" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html