Re: PATCH] Partial revert of "PCI/AER: Handle ERR_FATAL with removal and re-enumeration of devices"

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Mon, Aug 20, 2018 at 02:39:04PM +1000, Benjamin Herrenschmidt wrote:
> This partially reverts commit 7e9084b36740b2ec263ea35efb203001f755e1d8.
> 
> This only reverts the Documentation/PCI/pci-error-recovery.txt changes
> 
> Those changes are incorrect, they change the documentation to adapt
> to the (imho incorrect) AER implementation, and as a result making
> it no longer match the EEH implementation.
> 
> I believe the policy described originally in this document is what
> should be implemented by everybody and the changes done by that commit
> would compromise, among others, the ability to recover from errors with
> storage devices.

I think we should align EEH, AER, and DPC as much as possible,
including making this documentation match the code.

Because of its name, this file *looks* like it should match the code
in the PCI core, i.e., drivers/pci/...  So I think it would be
confusing to simply apply this revert without making a more direct
connection between this documentation and the powerpc-specific EEH
code.

If we can change AER & DPC to correspond to EEH, then I think it would
make sense to apply this revert along with those AER & DPC changes so
the documentation stays in step with the code.

> Signed-off-by: Benjamin Herrenschmidt <benh@xxxxxxxxxxxxxxxxxxx>
> ---
>  Documentation/PCI/pci-error-recovery.txt | 35 +++++++-----------------
>  1 file changed, 10 insertions(+), 25 deletions(-)
> 
> diff --git a/Documentation/PCI/pci-error-recovery.txt b/Documentation/PCI/pci-error-recovery.txt
> index 688b69121e82..0b6bb3ef449e 100644
> --- a/Documentation/PCI/pci-error-recovery.txt
> +++ b/Documentation/PCI/pci-error-recovery.txt
> @@ -110,7 +110,7 @@ The actual steps taken by a platform to recover from a PCI error
>  event will be platform-dependent, but will follow the general
>  sequence described below.
>  
> -STEP 0: Error Event: ERR_NONFATAL
> +STEP 0: Error Event
>  -------------------
>  A PCI bus error is detected by the PCI hardware.  On powerpc, the slot
>  is isolated, in that all I/O is blocked: all reads return 0xffffffff,
> @@ -228,7 +228,13 @@ proceeds to either STEP3 (Link Reset) or to STEP 5 (Resume Operations).
>  If any driver returned PCI_ERS_RESULT_NEED_RESET, then the platform
>  proceeds to STEP 4 (Slot Reset)
>  
> -STEP 3: Slot Reset
> +STEP 3: Link Reset
> +------------------
> +The platform resets the link.  This is a PCI-Express specific step
> +and is done whenever a fatal error has been detected that can be
> +"solved" by resetting the link.
> +
> +STEP 4: Slot Reset
>  ------------------
>  
>  In response to a return value of PCI_ERS_RESULT_NEED_RESET, the
> @@ -314,7 +320,7 @@ Failure).
>  >>> However, it probably should.
>  
>  
> -STEP 4: Resume Operations
> +STEP 5: Resume Operations
>  -------------------------
>  The platform will call the resume() callback on all affected device
>  drivers if all drivers on the segment have returned
> @@ -326,7 +332,7 @@ a result code.
>  At this point, if a new error happens, the platform will restart
>  a new error recovery sequence.
>  
> -STEP 5: Permanent Failure
> +STEP 6: Permanent Failure
>  -------------------------
>  A "permanent failure" has occurred, and the platform cannot recover
>  the device.  The platform will call error_detected() with a
> @@ -349,27 +355,6 @@ errors. See the discussion in powerpc/eeh-pci-error-recovery.txt
>  for additional detail on real-life experience of the causes of
>  software errors.
>  
> -STEP 0: Error Event: ERR_FATAL
> --------------------
> -PCI bus error is detected by the PCI hardware. On powerpc, the slot is
> -isolated, in that all I/O is blocked: all reads return 0xffffffff, all
> -writes are ignored.
> -
> -STEP 1: Remove devices
> ---------------------
> -Platform removes the devices depending on the error agent, it could be
> -this port for all subordinates or upstream component (likely downstream
> -port)
> -
> -STEP 2: Reset link
> ---------------------
> -The platform resets the link.  This is a PCI-Express specific step and is
> -done whenever a fatal error has been detected that can be "solved" by
> -resetting the link.
> -
> -STEP 3: Re-enumerate the devices
> ---------------------
> -Initiates the re-enumeration.
>  
>  Conclusion; General Remarks
>  ---------------------------
> 
> 



[Index of Archives]     [DMA Engine]     [Linux Coverity]     [Linux USB]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [Greybus]

  Powered by Linux