Re: [PATCH] PCI: Enable Function Level Reset(FLR) for PCI-E

"Yang, Sheng" <sheng.yang@xxxxxxxxx> · Wed, 24 Sep 2008 14:01:54 +0800

On Wednesday 24 September 2008 12:34:30 Matthew Wilcox wrote:
> On Wed, Sep 24, 2008 at 11:26:41AM +0800, Yang, Sheng wrote:
> > On Wednesday 24 September 2008 03:16:40 Matthew Wilcox wrote:
> > > /**
> > >  * pcie_do_flr() - execute Function Level Reset
> > >  * @dev: Device to reset
> > >  *
> > >  * Function Level Reset is a feature introduced by the PCIe spec
> > > version * XXX.  It allows one function of a multifunction PCI device to
> > > be * reset without affecting the other functions in the same device. *
> > > Resetting the device will make the contents of PCI configuration space
> > > * random, so any caller of this must be prepared to reinitialise the *
> > > device including MSI, bus mastering, BARs, decoding IO and memory
> > > spaces, * etc.
> > >  *
> > >  * Returns 0 if the device was successfully reset or an error code if
> > >  * the device doesn't support FLR or we failed to reset the device
> > >  * properly.
> > >  */
> >
> > Matthew, thanks to be so detail...
>
> You're welcome.  I wonder if perhaps the function would be better named
> pci_reset_function() or pci_function_reset().  I also notice that I
> didn't implement what I documented above wrt the reset failing.

OK, I will add the check for completion.

> > > > +int pcie_do_flr(struct pci_dev *dev)
> > > > +{
> > > > +     u16 status;
> > > > +     u32 cap;
> > > > +     int exppos = pci_find_capability(dev, PCI_CAP_ID_EXP);
> > > > +
> > > > +     if (!exppos)
> > > > +             return -ENOTSUPP;
> > >
> > >                 return -ENOTTY;
> > >
> > > > +     pci_read_config_dword(dev, exppos + PCI_EXP_DEVCAP, &cap);
> > > > +     if (!(cap & PCI_EXP_DEVCAP_FLR))
> > > > +             return -ENOTSUPP;
> > >
> > >                 return -ENOTTY;
> >
> > I will modify this. But I still don't understand why -ENOTTY... I think
> > there are some history about it?
>
> Yes.  ENOTTY means "This operation cannot be performed on this device".
> EOPNOTSUPP is some networking error, I forget the details now.
>
> > > > +
> > > > +     pci_block_user_cfg_access(dev);
> > > > +
> > > > +     /* Clear entire Command register */
> > > > +     pci_write_config_word(dev, PCI_COMMAND, 0);
> > > > +     pci_write_config_word(dev, exppos + PCI_EXP_DEVCTL, 0);
> > >
> > > Is this wise?  Clearing the 'disable INTx' bit (10) might cause the
> > > device to generate interrupts on its interrupt pin which could cause
> > > lockups.  I'm also not sure we need to clear any of the bits in
> > > PCI_EXP_DEVCTL -- can you explain?
> > >
> > >From the implemention note of FLR, "Software clears the entire Command
> >
> > register, disabling the Function from issuing any new Requests." So here
> > we are.
>
> Ah yes, it was on page 384 so I didn't notice it.
>
> I think this implementation note is incorrect in that clearing all the
> bits does not prevent the device from generating INTx interrupts.  Might
> be worth raising with the PCI committee.

Feel a little strange... This kind of obvious issue should be expected to 
raise quickly... I will mail them for the reason.
>
> This implementation note also says, "Software that's performing
> the FLR synchronizes with other software that might potentially access
> the Function directly, and ensures such accesses do not occur during
> this algorithm."
>
> If we have a shared interrupt and the device being reset still has a
> handler attached, it's entirely possible that another device will
> generate an interrupt and cause the interrupt handler for this device to
> try to access the device during the FLR.
>
> We could fix this by temporarily unregistering the interrupt handler for
> this device, or calling disable_irq().

I think unregister interrupt handler is better, after we set the interrupt 
disable bit in control register.

> > And from PCI spec 6.2.2. Device Control
> >
> > "The Command register provides coarse control over a device's ability to
> > generate and respond to PCI cycles. When a 0 is written to this register,
> > the device is logically disconnected from the PCI bus for all accesses
> > except configuration accesses. "
>
> To be fair, that's talking about Conventional PCI which doesn't transmit
> INTx over the PCI bus, they're on sideband signals.

Thanks for point it out...
>
> > > > +             dev_info(&dev->dev, "trasaction pending when doing FLR,
> > > > " +                     "waiting for another 5 seconds\n");
> > >
> > > I don't ever want to see 'FLR' in a message the user might see.  If you
> > > have to, spell it out as 'Function Level Reset' or just say "resetting
> > > function" or something.  We use far too many acronyms as it is.  What
> > > was wrong with the text I proposed in my first review?
> >
> > Um... I just think that message is too ambiguous to know the exactly
> > reason, e.g. "still busy", user can ask "what are you waiting for?" So I
> > want to show what the function is doing here, and didn't aware the
> > acronyms problem...
> >
> > What's about "Still busy after 100ms when resetting function. Waiting 5
> > seconds\n"?
>
> Looks fine.  Maybe we can be more pithy, how about:
> "Function reset incomplete after 100ms, sleeping for 5 seconds\n"
> ?

OK, that's enough for user. :)
>
> > > Let's talk about what else we might need to do here.  The spec says:
> > >
> > >    Any outstanding INTx interrupt asserted by the Function must be
> > >    de-asserted by sending the corresponding Deassert_INTx Message prior
> > >    to starting the FLR.
> > >
> > > So we need to make sure there are no pending interrupts for this
> > > function before issuing the FLR.  Is that the responsibility of the
> > > caller?  We should say so if it is.  If we're going to do it in this
> > > function, we should use disable_irq() to be sure that the interrupt
> > > isn't currently executing on some other CPU (having first checked that
> > > the device isn't using MSI-X, because if it is, we don't use the
> > > interrupt in pdev->irq).  If we're using a shared interrupt, other
> > > users are going to suffer.
> > >
> > > I still don't really understand how this flr fits in with every other
> > > part of the system.  Is the device supposed to be quiesced before this
> > > is called?  If so, a lot of what it does is pointless.  Or is it
> > > supposed to do the quiescing?  If so, it's missing some bits.
> >
> > I think the device didn't suppose to be completely quiesced before doing
> > flr. The spec also said
> >
> > "Note that the controls that enable the Function to initiate requests on
> > PCI Express are cleared, including Bus Master Enable, MSI Enable, and the
> > like, effectively causing the Function to *become quiescent* on the
> > Link."
>
> That paragraph is describing the effect of the FLR on the function, not
> what the software stack does first.

Yeah, that's what I means...
>
> > So I think the caller should take care of INTx, and this routine cover
> > others. If the interrupts are shared, seems we can't prevent other user
> > from suffering a little. But luckily, the most modern device which need
> > to be reset should have used MSI/MSI-X, so I think it's reasonable.
>
> They ought to use MSIs, yes.  I fear that they won't.
>
> > But what's the missing bits your mentioned?
>
> Things like the interrupts, the MSI/MSI-X state, the BARs ... looking at
> it again, there's pci_save_state() and pci_restore_state() which should
> do the trick.
>
> The use cases mentioned in the PCIe spec imply that there are different
> use cases.  Ones where the driver has detached from the device in an
> orderly manner already (the 'partitioned system' example), and error
> handling cases (where we might still have a driver attached).  I almost
> wonder if we want two functions, one that assumes the device is already
> quiesced and just goes ahead and resets it, and another that knows it
> has to do all the locking, disabling, etc.

Seems it's reasonable. I will try to update patch to two functions.

> By the way, what happens if the admin loads a module that wants to probe
> this device while it's being reset?  It seems like it might well send
> configuration cycles to the card, or even mmio cycles.  I almost wonder
> if we shouldn't remove the device from the system while its being reset,
> then re-add it after the reset has completed.

It's unbelievable that the admin can catch at that small window... Yeah, just 
kidding, this can't be a excuse. How about add a lock on probing? And seems 
something like pci_set_power_state also got the similar problem?

--
regards
Yang, Sheng
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html