Re: PCI error recovery for the Emulex LPFC

linas@xxxxxxxxxxxxxx (Linas Vepstas) · Tue, 31 Oct 2006 11:19:02 -0600

On Tue, Oct 31, 2006 at 08:51:08AM -0500, James Smart wrote:
> Linas,
> 
> I don't know of anything in this area.
> I also need a deeper understand of what the error was, and how,
> that was injected. This play into it.

When the PCI slot is frozen, the PCI bridge will block all writes
to the device, and will return all 0xffffffff for reads. All DMA
will be prevented from going through. 

> Also, PCI error recovery is not a simple task. 

I've implemented it for the ipr and symbios SCSI controllers, 
and for the e100, e1000, ixgb and s2io ethernet cards.  If you 
revew the actual code, you will see its fairly tiny. Mostly
I've discovered that if the device driver has clean, clear-cut 
device-up/device-down routines, then recovery is straightforward.

FWIW, I've run some of the kernels & devices through 48-hour runs 
with thousands of errors injected and successfully recovered from.

> There are many
> aspects to the adapter messaging interface and the affects of the
> PCI error recovery scheme that has to be closely looked at. DMA
> errors can be very fatal, even if the PCI bus survives. In many
> cases, the only safe recovery is a hard adapter reset (with little
> to no interaction with the adapter to clean up). 

Currently, all of the device drivers I mention above perform the 
recovery with a hard reset. The generic API does not require this,
but this seems to be the simplest, most robust/reliable route.
I experimeted with non-hard-reset on the s2io, which I got "almost
working". I don't know that its worth the trouble.

Just to be clear, I'm refering to the infrastructure documented 
in Documentation/pci-error-recovery.txt

--linas

-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html