On Fri, 2018-08-17 at 15:59 +0530, poza@xxxxxxxxxxxxxx wrote: > > The existing design is; framework dictating drivers what it should do > rather than driver deciding. > let me explain. The AER code design might be but it's not what was documented in Documentation/PCI/pci-error-recovery.txt before you changed it ! And it's not what about 20 years of doing PCI error recovery before there even was such a thing as PCIe or AER taught us on powerpc. The basic idea is that the "action" to be taken to recover is the "deepest" of all the ones requested by the framework and all the involved drivers (more than one device could be part of an error handling "domain" under some circumstances, for example when the failure originates from a bridge). What the PCIe spec says is rather uninteresting and can be trusted as far as policy is concern about as much as you can trust HW to be compliant with it, that is about zilch. The only interesting thing to take from the spec is that in the case of a FATAL error, you *must* at least reset the link. That's it. It doesn't mean that you shouldn't reset more (PERST for example) and it doesn't mean that you shouldn't reset the device or link for a NON_FATAL error either. And even if it did we should ignore it. There are many reasons why a driver might request a reset. For example, the error, while non-fatal to the link itself, might still have triggered a non-recoverable event in the device, either in HW or SW, and the only safe way out might well be a reset. On POWER systems, with EEH HW support, we have additional rules that say that on *any* non-corrected error (and under some circumstances if corrected errors cross a certain threshold) will isolate the device completely and the typical recovery from that is a reset (it doesn't *have* to be but that's what most drivers chose). This is rules design to prevent propagation of potentially corrupted data which is considering way more important than QoS of the device. We should see if we can make some of the AER and EEH code more common, the first thing would be to take things out of pcie/ since EEH isn't PCIe specific, and abstract the recovery process from the underlying architecture hooks to perform things like reset etc... > aer_isr > aer_isr_one_error > get_device_error_info >> this checks > { > /* Link is still healthy for IO reads */ >> Bjorn > this looks like a very strange thing to me, if link is not healthy we > are not even going for pcie_do_fatal_recovery() > pci_read_config_dword(dev, pos + PCI_ERR_UNCOR_STATUS, > &info->status); > pci_read_config_dword(dev, pos + PCI_ERR_UNCOR_MASK, > &info->mask); > if (!(info->status & ~info->mask)) > return 0; > > { > handle_error_source > pcie_do_nonfatal_recovery or pcie_do_fatal_recovery > > now it does not seem to me that driver has some way of knowing if it has > to return PCI_ERS_RESULT_CAN_RECOVER ? or PCI_ERS_RESULT_NEED_RESET? You don't know that. A driver can look at the device state at the enable_mmio phase and decide that the device is sufficiently sane that it can continue... or not. A driver could have a policy of always resetting. Etc... > let us take an example of storage driver here e.g. NVME > nvme_error_detected() relies heavily on pci_channel_state_t which is > passed by error framework (err.c) So what ? This is one example... the IPR SCSI driver is much more complex in its error recovery afaik. > I understand your point of view and original intention of design that > let driver dictate whether it wants slot_reset ? It's both the driver *and* the framework, which ever wants the "harshest" solution wins. If either wants a reset we do a reset. > but driver relies on framework to pass that information in order to take > decision. No. Some drivers might. This is by no means the absolute rule. There are also all other type of errors that aren't covered by the AER categories that fit in that framework such as iommu errors (at least for us on powerpc). > In case of pci_channel_io_frozen (which is ERR_FATAL), most of the > drivers demand link reset. No, they demand a *device* reset. ie, hot reset or PERST. It's not about the link. It's about the device. > but in case of ERR_NONFATAL, the link is supposed to be functional and > there is no need for link reset. It depends whether the error was corrected or not. If it was not corrected, then something bad did happen and the framework has no way to know what is the appropriate remediation for a given device. > and exactly all the PCI based drivers returns > PCI_ERS_RESULT_CAN_RECOVER, which is valid. You are mixing the AER terminology of FATAL/NON_FATAL which is AER specific and the framework defined channel states, which are "functional", "frozen" and "permanently dead". These don't necessarily match 1:1. > so I think to conclude, you are referring more of your views towards > pcie_do_fatal_recovery() Both. But yes, pcie_do_fatal_recovery() should definitely NOT do an unplug/replug blindly. The whole point of the error handlers was to allow driver to recover from these things without losing the links to their respective consumers (such as filesystems for block devices). > which does > > > remove_devices from the downstream link > > reset_link() (Secondary bus reset) > > re-enumerate. > > and this is what DPC defined, and AER is just following that. And is completely and utterly broken and useless as a policy. Again, somebody cared too much about the spec rather than what makes actual sense in practice. You should *only* remove devices if the driver doesn't have the necessary callbacks. In fact, remove and re-plug device is probably the worst thing you can do, I don't remember all the details but it's been causing us endless issues with EEH, which is why as I mentioned, we are looking more toward an unbind and rebind of the driver, but that's a details. If the driver has the error callbacks it should participate in the recovery and NOT be unbound or the device unplugged. That's completely useless for Linux to do that since it means you cannot recover from an error on a storage driver. Ben. > Regards, > Oza. > > > > > Also because multiple devices, at least on power, can share a domain, > > we can get into a situation where one device requires a reset, so all > > will get reset, and their respective drivers need to be notified. > > > > Note: it's not so much about resetting the *link* than about resetting > > the *device*. > > > > > although > > > I see following note in the code as well. > > > /* > > > * TODO: Should call platform-specific > > > * functions to reset slot before calling > > > * drivers' slot_reset callbacks? > > > */ > > > > Yup :-) > > > > Cheers, > > Ben. > > > > > Regards, > > > Oza. > > > > > > > > > > > Also we should do a hot reset at least, not just a link reset. > > > > > > > > > report_resume() > > > > > > > > > > If you suggest how it is broken, it will help me to understand. > > > > > probably you might want to point out what are the calls need to be > > > > > added > > > > > or removed or differently handled, specially storage point of view. > > > > > > > > > > > > > Regards, > > > > > Oza. > > > > > > > > > > > > > > > > > Keep in mind that those callbacks were designed originally for EEH > > > > > > (which predates AER), and so was the spec written. > > > > > > > > > > > > We don't actually use the AER code on POWER today, so we didn't notice > > > > > > how broken the implementation was :-) > > > > > > > > > > > > We should fix that. > > > > > > > > > > > > Either we can sort all that out by email, or we should plan some kind > > > > > > of catch-up, either at Plumbers (provided I can go there) or maybe a > > > > > > webex call. > > > > > > > > > > > > Cheers, > > > > > > Ben.