On 09/12/16 17:24, Linas Vepstas wrote:
I suppose I'm confused, but I recall that link resets are non-fatal. Fatal errors typically require that the the pci adapter be completely reset, any adapter firmware to be reloaded from scratch, the device driver has to kill all device state and start from scratch. Its huge.
Is there a difference in terminology between an AER fatal error and what EEH/IBM people think of as a fatal error?
If the fatal error is on pci device that is under a block device holding a file system, then (usually) there is no way to recover, because the block layer (and file system) cannot deal with a block device that disappeared and then reappeared some few seconds later. (maybe some future zfs or lvm or btrfs might be able to deal with this, but not today)
Is this still true? I'm not at all familiar with the block device side of it, but the cxlflash driver has reasonably full EEH support, including surviving a full PHB fence and complete reset.
-- Andrew Donnellan OzLabs, ADL Canberra andrew.donnellan@xxxxxxxxxxx IBM Australia Limited -- To unsubscribe from this list: send the line "unsubscribe linux-doc" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html