Re: [PATCH] Rewrite MSI-HOWTO

Grant Grundler <grundler@xxxxxxxxxxxxxxxx> · Tue, 30 Sep 2008 16:32:47 -0600

On Tue, Sep 30, 2008 at 03:26:34PM -0600, Matthew Wilcox wrote:
...
> > > +A device indicates MSI support by implementing the MSI or the MSI-X
> > > +capability in its PCI configuration space.  It may implement both the
> > > +MSI capability structure and the MSI-X capability structure, but only
> > > +one may be enabled.
> 
> I just rewrote this section.  How about this:
> 
> 2. What are MSIs?
> 
> A Message Signalled Interrupt is a write from the device to a special
> address which causes an interrupt to be received by the CPU.
> 
> The MSI capability was first specified in PCI 2.2 and was later enhanced
> in PCI 3.0 to allow each interrupt to be masked individually.  The MSI-X
> capability was also introduced with PCI 3.0.  It supports more interrupts
> per device than MSI and allows interrupts to be independently configured.
> 
> Devices may support both MSI and MSI-X, but only one can be enabled at
> a time.

Looks good.

> 
> 
> > > +3. Why use MSIs?
> > > +
> > > +Pin-based PCI interrupts are often shared amongst several devices.
> > > +To support this, the kernel must call each interrupt handler associated
> > > +with an interrupt which leads to increased latency for the interrupt
> > > +handlers which are registered last.
> > > +
> > > +When a device performs DMA to memory and raises a pin-based interrupt, it
> > 
> > "to memory" is redundant.
> 
> No ... DMA can be performed from memory, not just to memory.

Ah....I know that as "DMA Write" (vs DMA Read).
In this case you are right in being specific about the direction.

> > Perhaps "When a device completes a DMA write operation and ..."
> 
> Then you're relying on the reader to know that 'write' means 'from the
> perspective of the device' and not 'away from the cpu'.  How about:
> 
> When a device writes to memory, then raises a pin-based interrupt, it

Ok. That would work too.

I would prefer to define DMA Read/Write and MMIO Read/Write at the
top of the document and use those terms consistently since they are
well known and correlate well with both standards and chipset specs.

> 
> > > +is possible that the interrupt may arrive before all the data has arrived
> > > +in memory (this becomes more likely with devices behind PCI-PCI bridges).
> > > +In order to assure that all DMA has arrived in memory, the interrupt
> > > +handler must read a register on the device which raised the interrupt.
> > 
> > "DMA" is an action and not an object. s/DMA/DMA'd data/
> 
> How about just 'data'?  'writes' would also work, but might be a bit
> jargon.

"data" is too vague and can be misinterpreted.
Ditto for "writes". "DMA writes" captures it correctly.

If people need to learn 4 terms to understand the issue, I'd rather
have that than try to write wishy-washy stuff that looks easy but
isn't precise. Some will be annoyed later when they finally figure out
what was meant after reading this document and several others
5 times.

> > > +PCI ordering rules require that the writes be flushed to memory before
> > > +the value can be returned from the register.
> > 
> > Be specific. s/writes/DMA writes/ and s/value/MMIO read/.
> > Or to rewrite it:
> > +PCI transaction ordering rules require DMA writes reach memory before
> > +the MMIO read operation can complete.
> 
> I think this is too jargon.

Then define the jargon at the top. Or point at some good
definitions. Getting rid of the terms will not make this
issue more accessible. If people can not understand a
definition of "DMA write", then I'm skeptical they will
understand why MSI avoids particular issues.

> Besides, it doesn't have to be an MMIO
> read, it could be a portIO or even config space read.

True - the doc should say that. Device driver maintainers/writers
should be aware of the options. In fact, because of differences
in master abort handling on "some platforms" between Config space
and MMIO space, I think it's good to include this observation.

> I do like 'reach' of 'flush' though.
> 
> PCI transaction ordering rules require that all the data reaches memory
> before the value can be returned from the register.

*sigh* This will probably work.

> > >...  MSI avoids this problem
> > > +as the interrupt-generating write cannot pass the DMA writes, so by the
> > > +time the interrupt is raised, the driver knows that the DMA has completed.
> > 
> > To be consistent, the last phrase should be:
> >   ..., the driver is certain DMA data has reached memory.
> > 
> > [ Nit: the data just has to reach the CPU cache coherency DMA so it's
> > visible to the CPUs...assuming DMA is in general cache coherent. But average
> > reader will understand "reaches memory" just fine.]
> 
> Yes, I agree.  It would be too confusing to launch into a full
> discussion of cache behaviour here.

*nod*
What about the consistent use of terminology?
I don't want people to think "DMA has completed" is something different
than "data reaches memory" (used above).

> > > +Using MSI enables the device to support more interrupts, allowing
> > > +each interrupt to be specialised to a different purpose.  This allows
> > > +infrequent conditions (such as errors) to be given their own interrupt and
> > > +not have to check for errors during the normal interrupt handling path.
> > 
> > We should note this (and previous) version of linux only supports one
> > MSI per device. Only MSI-X support allows a linux device drivers to
> > use more than one interrupt.
> 
> I've changed it to use 'MSIs' rather than just 'MSI' here, and expanded
> the section a little.  Here's the whole of the new section 3:
> 
> 3. Why use MSIs?
> 
> There are three reasons why using MSIs can give an advantage over
> traditional pin-based interrupts.
> 
> Pin-based PCI interrupts are often shared amongst several devices.
> To support this, the kernel must call each interrupt handler associated
> with an interrupt which leads to increased latency for the interrupt
> handlers which are registered last.

This is true if the system is idle.

If the CPU has just started running the last handler, it can take just
as long to get back to the first handler in the list as well.  Think of
it as a code loop with the average latency being a function of the size
of the entire loop.

This is why I was suggesting a less precise statement.

>  MSIs are never shared, so this problem cannot arise.
> 
> When a device writes data to memory, then raises a pin-based interrupt, 
> it is possible that the interrupt may arrive before all the data has
> arrived in memory (this becomes more likely with devices behind PCI-PCI
> bridges).  In order to ensure that all the data has arrived in memory,
> the interrupt handler must read a register on the device which raised
> the interrupt.  PCI transaction ordering rules require that all the data
> arrives in memory before the value can be returned from the register.
> Using MSIs avoids this problem as the interrupt-generating write cannot 
> pass the data writes, so by the time the interrupt is raised, the driver
> knows that all the data has arrived in memory.

Yeah, if you insist on avoiding certain terms, I think most people will
understand this as intended.

> PCI devices can only support a single pin-based interrupt per function.
> Often drivers have to query the device to find out what event has
> occurred, slowing down interrupt handling for the common case.  With
> MSIs, a device can support more interrupts, allowing each interrupt
> to be specialised to a different purpose.  One possible design gives
> infrequent conditions (such as errors) their own interrupt which allows
> the driver to handle the normal interrupt handling path more efficiently.
> Other possible designs include giving one interrupt to each packet queue
> in a network card or each port in a storage controller.

Looks good.

> 
> 
> > > +4.3.1 pci_enable_msix
> > > +
> > > +int pci_enable_msix(struct pci_dev *dev, struct msix_entry *entries, int nvec)
> > > +
> > > +Calling this function asks the PCI subsystem to allocate 'nvec' MSIs.
> > > +The 'entries' argument is a pointer to an array of msix_entry structs
> > > +which should be at least 'nvec' entries in size.  On success, the
> > > +function will return 0 and the device will have been switched into
> > > +MSI-X interrupt mode.  The 'vector' elements in each entry will have
> > > +been filled in with the interrupt number.
> > > +
> > > +If this function returns a negative number, it indicates an error and
> > > +the driver should not attempt to allocate any more MSI-X interrupts for
> > > +this device.  If it returns a positive number, it indicates the maximum
> > > +number of interrupt vectors that could have been allocated.
> > > +
> > > +This function, in contrast with pci_enable_msi(), does not adjust
> > > +pdev->irq.  The device will not generate interrupts for this interrupt
> > > +number once MSI-X is enabled.  The device driver is responsible for
> > > +keeping track of the interrupts assigned to the MSI-X vectors so it can
> > > +free them again later.
> > 
> > We need to state the driver should call request_irq() to register a handler
> > for each allocated msix_entry.
> 
> How about this:
> 
> @@ -162,7 +170,8 @@ The 'entries' argument is a pointer to an array of msix_entr
>  which should be at least 'nvec' entries in size.  On success, the
>  function will return 0 and the device will have been switched into
>  MSI-X interrupt mode.  The 'vector' elements in each entry will have
> -been filled in with the interrupt number.
> +been filled in with the interrupt number.  The driver should then call
> +request_irq() for each 'vector' that it decides to use.
> 
> ?

Yup - that'll do.

If you are set on avoiding use of "DMA write" and "MMIO Read" terms, the
proposals above will probably work and can add my:
   Signed-off-by: Grant Grundler <grundler@xxxxxxxxxxxxxxxx>

thanks,
grant

> 
> -- 
> Matthew Wilcox				Intel Open Source Technology Centre
> "Bill, look, we understand that you're interested in selling us this
> operating system, but compare it to ours.  We can't possibly take such
> a retrograde step."
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html