Re: [PATCH] Rewrite MSI-HOWTO

Matthew Wilcox <matthew@xxxxxx> · Tue, 30 Sep 2008 15:26:34 -0600




On Sat, Sep 27, 2008 at 12:41:47PM -0600, Grant Grundler wrote:
> > +2. What are MSIs?
> > +
> > +Message Signaled Interrupt (MSI) is an optional feature for devices
> > +which implement the PCI Local Bus Specification Revision 2.2 and later.
> > +MSI enables a device to generate an interrupt by sending a normal write
> > +to a special address in the host chipset that is translated into a CPU
> > +interrupt.  MSI-X (introduced in PCI 3.0) is a more flexible scheme
> > +than MSI.  It allows for greater control over what interrupts can be
> > +generated and supports a greater number of interrupts.
> 
> Suggestion to combine the last two sentences:
> MSI-X (introduced in PCI 3.0) allows for greater control over how interrupts
> are allocated, how they are directed at CPUs, and supports a greater number
> of interrupts.
> 
> > +
> > +A device indicates MSI support by implementing the MSI or the MSI-X
> > +capability in its PCI configuration space.  It may implement both the
> > +MSI capability structure and the MSI-X capability structure, but only
> > +one may be enabled.

I just rewrote this section.  How about this:

2. What are MSIs?

A Message Signalled Interrupt is a write from the device to a special
address which causes an interrupt to be received by the CPU.

The MSI capability was first specified in PCI 2.2 and was later enhanced
in PCI 3.0 to allow each interrupt to be masked individually.  The MSI-X
capability was also introduced with PCI 3.0.  It supports more interrupts
per device than MSI and allows interrupts to be independently configured.

Devices may support both MSI and MSI-X, but only one can be enabled at
a time.


> > +3. Why use MSIs?
> > +
> > +Pin-based PCI interrupts are often shared amongst several devices.
> > +To support this, the kernel must call each interrupt handler associated
> > +with an interrupt which leads to increased latency for the interrupt
> > +handlers which are registered last.
> > +
> > +When a device performs DMA to memory and raises a pin-based interrupt, it
> 
> "to memory" is redundant.

No ... DMA can be performed from memory, not just to memory.

> Perhaps "When a device completes a DMA write operation and ..."

Then you're relying on the reader to know that 'write' means 'from the
perspective of the device' and not 'away from the cpu'.  How about:

When a device writes to memory, then raises a pin-based interrupt, it

> > +is possible that the interrupt may arrive before all the data has arrived
> > +in memory (this becomes more likely with devices behind PCI-PCI bridges).
> > +In order to assure that all DMA has arrived in memory, the interrupt
> > +handler must read a register on the device which raised the interrupt.
> 
> "DMA" is an action and not an object. s/DMA/DMA'd data/

How about just 'data'?  'writes' would also work, but might be a bit
jargon.

> > +PCI ordering rules require that the writes be flushed to memory before
> > +the value can be returned from the register.
> 
> Be specific. s/writes/DMA writes/ and s/value/MMIO read/.
> Or to rewrite it:
> +PCI transaction ordering rules require DMA writes reach memory before
> +the MMIO read operation can complete.

I think this is too jargon.  Besides, it doesn't have to be an MMIO
read, it could be a portIO or even config space read.  I do like 'reach'
instead of 'flush' though.

PCI transaction ordering rules require that all the data reaches memory
before the value can be returned from the register.

> >...  MSI avoids this problem
> > +as the interrupt-generating write cannot pass the DMA writes, so by the
> > +time the interrupt is raised, the driver knows that the DMA has completed.
> 
> To be consistent, the last phrase should be:
>   ..., the driver is certain DMA data has reached memory.
> 
> [ Nit: the data just has to reach the CPU cache coherency DMA so it's
> visible to the CPUs...assuming DMA is in general cache coherent. But average
> reader will understand "reaches memory" just fine.]

Yes, I agree.  It would be too confusing to launch into a full
discussion of cache behaviour here.

> > +
> > +Using MSI enables the device to support more interrupts, allowing
> > +each interrupt to be specialised to a different purpose.  This allows
> > +infrequent conditions (such as errors) to be given their own interrupt and
> > +not have to check for errors during the normal interrupt handling path.
> 
> We should note this (and previous) version of linux only supports one
> MSI per device. Only MSI-X support allows a linux device drivers to
> use more than one interrupt.

I've changed it to use 'MSIs' rather than just 'MSI' here, and expanded
the section a little.  Here's the whole of the new section 3:

3. Why use MSIs?

There are three reasons why using MSIs can give an advantage over
traditional pin-based interrupts.

Pin-based PCI interrupts are often shared amongst several devices.
To support this, the kernel must call each interrupt handler associated
with an interrupt which leads to increased latency for the interrupt
handlers which are registered last.  MSIs are never shared, so this
problem cannot arise.

When a device writes data to memory, then raises a pin-based interrupt, 
it is possible that the interrupt may arrive before all the data has
arrived in memory (this becomes more likely with devices behind PCI-PCI
bridges).  In order to ensure that all the data has arrived in memory,
the interrupt handler must read a register on the device which raised
the interrupt.  PCI transaction ordering rules require that all the data
arrives in memory before the value can be returned from the register.
Using MSIs avoids this problem as the interrupt-generating write cannot 
pass the data writes, so by the time the interrupt is raised, the driver
knows that all the data has arrived in memory.

PCI devices can only support a single pin-based interrupt per function.
Often drivers have to query the device to find out what event has
occurred, slowing down interrupt handling for the common case.  With
MSIs, a device can support more interrupts, allowing each interrupt
to be specialised to a different purpose.  One possible design gives
infrequent conditions (such as errors) their own interrupt which allows
the driver to handle the normal interrupt handling path more efficiently.
Other possible designs include giving one interrupt to each packet queue
in a network card or each port in a storage controller.


> > +4.3.1 pci_enable_msix
> > +
> > +int pci_enable_msix(struct pci_dev *dev, struct msix_entry *entries, int nvec)
> > +
> > +Calling this function asks the PCI subsystem to allocate 'nvec' MSIs.
> > +The 'entries' argument is a pointer to an array of msix_entry structs
> > +which should be at least 'nvec' entries in size.  On success, the
> > +function will return 0 and the device will have been switched into
> > +MSI-X interrupt mode.  The 'vector' elements in each entry will have
> > +been filled in with the interrupt number.
> > +
> > +If this function returns a negative number, it indicates an error and
> > +the driver should not attempt to allocate any more MSI-X interrupts for
> > +this device.  If it returns a positive number, it indicates the maximum
> > +number of interrupt vectors that could have been allocated.
> > +
> > +This function, in contrast with pci_enable_msi(), does not adjust
> > +pdev->irq.  The device will not generate interrupts for this interrupt
> > +number once MSI-X is enabled.  The device driver is responsible for
> > +keeping track of the interrupts assigned to the MSI-X vectors so it can
> > +free them again later.
> 
> We need to state the driver should call request_irq() to register a handler
> for each allocated msix_entry.

How about this:

@@ -162,7 +170,8 @@ The 'entries' argument is a pointer to an array of msix_entr
 which should be at least 'nvec' entries in size.  On success, the
 function will return 0 and the device will have been switched into
 MSI-X interrupt mode.  The 'vector' elements in each entry will have
-been filled in with the interrupt number.
+been filled in with the interrupt number.  The driver should then call
+request_irq() for each 'vector' that it decides to use.

?

-- 
Matthew Wilcox				Intel Open Source Technology Centre
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours.  We can't possibly take such
a retrograde step."
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html