Re: [RFC] e1000e: Add delays after writing to registers

Henrik Austad <henrik@xxxxxxxxx> · Fri, 6 Nov 2015 06:53:49 +0100

On Tue, Nov 03, 2015 at 04:10:23PM -0600, Jonathan David wrote:
> On 11/03/2015 01:42 PM, Henrik Austad wrote:
> >On Tue, Nov 03, 2015 at 11:43:21AM -0600, Jonathan David wrote:
> >>On 10/22/2015 12:59 AM, Henrik Austad wrote:
> 
> >>>>Adding a delay after long series of writes gives them time to
> >>>>complete, and for higher priority tasks to run unimpeded.
> >>>
> >>>Aren't we running with threaded interrupts?
> >>>
> >>>What happens to the thread(s) pushing data to the network?
> >>>What about xmit-buffer once it is full? Which thread will block on send or
> >>>have its sk_buff dropped?
> >>
> >>All of this is totally irrelevant to the problem we are seeing.
> >
> >If this is irrelevant, why hack at the network-driver, hmm?
> 
> It is relevant to the network driver, as this is where the symptoms were
> discovered; however, it has no relation to the packet delivery path. This is
> related purely to link configuration.

I was under the impression that a PCI link configuration/training was down 
to speed etc, not how many MMIO read/writes it could do. Then again, a lot 
of this stuff is pure (black) magic.

> >>The e1000x driver itself is not responsible for the delay here.
> >
> >... then why hack the network-driver?
> 
> Lack of better known options.
> 
> >>The issue is with PCI where issuing a large number of MMIO writes
> >>followed by a read (to force said writes to execute) will stall the CPU.
> >>When the CPU is stalled, no interrupts are serviced, including the local
> >>apic timer interrupt, which was responsible for waking up cyclictest.
> >>This behavior was observed within traces gathered from cyclictest with
> >>ftrace enabled.
> >
> >So you get bogged down with interrupts disabled;
> 
> No, interrupts are entirely enabled while the PCI MMIO writes/read are
> issued; but the local apic timer still arrives late, presumably because the
> CPU is waiting to complete whatever writes remain in the buffer.

Heh, strange, is the interrupt signal itself delivered late as well, or 
just the handling of it?

> I think this might be the root of our miscommunication. You are asking good
> questions about threaded interrupts, etc, but it isn't clear how they are
> related to the specific problem we are seeing.

Perhaps a trace of the problem could be shared?

A full function-trace with irq-events and timer-events would be appreciated 
:)

-- 
Henrik Austad
Attachment:
signature.asc

Description: Digital signature