On Mon, Aug 20, 2018 at 02:39:04PM +1000, Benjamin Herrenschmidt wrote: > This partially reverts commit 7e9084b36740b2ec263ea35efb203001f755e1d8. > > This only reverts the Documentation/PCI/pci-error-recovery.txt changes > > Those changes are incorrect, they change the documentation to adapt > to the (imho incorrect) AER implementation, and as a result making > it no longer match the EEH implementation. > > I believe the policy described originally in this document is what > should be implemented by everybody and the changes done by that commit > would compromise, among others, the ability to recover from errors with > storage devices. I think we should align EEH, AER, and DPC as much as possible, including making this documentation match the code. Because of its name, this file *looks* like it should match the code in the PCI core, i.e., drivers/pci/... So I think it would be confusing to simply apply this revert without making a more direct connection between this documentation and the powerpc-specific EEH code. If we can change AER & DPC to correspond to EEH, then I think it would make sense to apply this revert along with those AER & DPC changes so the documentation stays in step with the code. > Signed-off-by: Benjamin Herrenschmidt <benh@xxxxxxxxxxxxxxxxxxx> > --- > Documentation/PCI/pci-error-recovery.txt | 35 +++++++----------------- > 1 file changed, 10 insertions(+), 25 deletions(-) > > diff --git a/Documentation/PCI/pci-error-recovery.txt b/Documentation/PCI/pci-error-recovery.txt > index 688b69121e82..0b6bb3ef449e 100644 > --- a/Documentation/PCI/pci-error-recovery.txt > +++ b/Documentation/PCI/pci-error-recovery.txt > @@ -110,7 +110,7 @@ The actual steps taken by a platform to recover from a PCI error > event will be platform-dependent, but will follow the general > sequence described below. > > -STEP 0: Error Event: ERR_NONFATAL > +STEP 0: Error Event > ------------------- > A PCI bus error is detected by the PCI hardware. On powerpc, the slot > is isolated, in that all I/O is blocked: all reads return 0xffffffff, > @@ -228,7 +228,13 @@ proceeds to either STEP3 (Link Reset) or to STEP 5 (Resume Operations). > If any driver returned PCI_ERS_RESULT_NEED_RESET, then the platform > proceeds to STEP 4 (Slot Reset) > > -STEP 3: Slot Reset > +STEP 3: Link Reset > +------------------ > +The platform resets the link. This is a PCI-Express specific step > +and is done whenever a fatal error has been detected that can be > +"solved" by resetting the link. > + > +STEP 4: Slot Reset > ------------------ > > In response to a return value of PCI_ERS_RESULT_NEED_RESET, the > @@ -314,7 +320,7 @@ Failure). > >>> However, it probably should. > > > -STEP 4: Resume Operations > +STEP 5: Resume Operations > ------------------------- > The platform will call the resume() callback on all affected device > drivers if all drivers on the segment have returned > @@ -326,7 +332,7 @@ a result code. > At this point, if a new error happens, the platform will restart > a new error recovery sequence. > > -STEP 5: Permanent Failure > +STEP 6: Permanent Failure > ------------------------- > A "permanent failure" has occurred, and the platform cannot recover > the device. The platform will call error_detected() with a > @@ -349,27 +355,6 @@ errors. See the discussion in powerpc/eeh-pci-error-recovery.txt > for additional detail on real-life experience of the causes of > software errors. > > -STEP 0: Error Event: ERR_FATAL > -------------------- > -PCI bus error is detected by the PCI hardware. On powerpc, the slot is > -isolated, in that all I/O is blocked: all reads return 0xffffffff, all > -writes are ignored. > - > -STEP 1: Remove devices > --------------------- > -Platform removes the devices depending on the error agent, it could be > -this port for all subordinates or upstream component (likely downstream > -port) > - > -STEP 2: Reset link > --------------------- > -The platform resets the link. This is a PCI-Express specific step and is > -done whenever a fatal error has been detected that can be "solved" by > -resetting the link. > - > -STEP 3: Re-enumerate the devices > --------------------- > -Initiates the re-enumeration. > > Conclusion; General Remarks > --------------------------- > >