On Tue, Jul 17, 2018 at 02:03:29PM -0500, Bjorn Helgaas wrote: > Hi Oza, > > Thanks for doing this! > > On Fri, Jun 22, 2018 at 05:58:09AM -0400, Oza Pawandeep wrote: > > pci_cleanup_aer_uncorrect_error_status() is called by different slot_reset > > callbacks in case of ERR_NONFATAL. > > IIRC, the current strategy is: > > ERR_COR: log only > ERR_NONFATAL: call driver callbacks (pci_error_handlers) > ERR_FATAL: remove device and re-enumerate > > So these slot_reset callbacks are only used for ERR_NONFATAL, which > are all uncorrectable errors, of course. > > This patch makes it so that when the slot_reset callbacks call > pci_cleanup_aer_uncorrect_error_status(), we only clear the > bits set by ERR_NONFATAL events (this is determined by > PCI_ERR_UNCOR_SEVER). > > That makes good sense to me. All these status bits are RW1CS, so they > will be preserved across a reset but will be cleared when we > re-enumerate, in this path: > > pci_init_capabilities > pci_aer_init > pci_cleanup_aer_error_status_regs > # clear all PCI_ERR_UNCOR_STATUS and PCI_ERR_COR_STATUS bits > > > AER uncorrectable error status should take severity into account in order > > to clear the bits, so that ERR_NONFATAL path does not clear the bit which > > are marked with severity fatal. > > Two comments: > > 1) Can you split this into two patches, one that changes > pci_cleanup_aer_uncorrect_error_status() so it looks like the error > clearing code in aer_error_resume(), and a second that factors out the > duplicate code? > > 2) Maybe use "sev" or "sever" instead of "mask" for the local > variable, since there is also a separate PCI_ERR_UNCOR_MASK register, > which is not involved here. Let me back up a little here: I'm not asking you to do the things below here. They're just possible future things, so we can think about them after this series. And the things above are things I can easily do myself. So no action required from you, unless you think I'm on the wrong track :) > 3) The "pci_cleanup_aer_uncorrect_error_status()" name no longer > really describes what this does. Something like > "pci_aer_clear_nonfatal_status()" would be more descriptive. But I > see you have a subsequent patch (which I haven't looked at yet) that > is related to this. > > 4) I don't think the driver slot_reset callbacks should be responsible > for clearing these AER status bits. Can we clear them somewhere in > the pcie_do_nonfatal_recovery() path and remove these calls from the > drivers? > > 5) In principle, we should only read PCI_ERR_UNCOR_STATUS *once* per > device when handling an error. We currently read it three times: > > aer_isr > aer_isr_one_error > find_source_device > find_device_iter > is_error_source > read PCI_ERR_UNCOR_STATUS # 1 > aer_process_err_devices > get_device_error_info(e_info->dev[i]) > read PCI_ERR_UNCOR_STATUS # 2 > handle_error_source > pcie_do_nonfatal_recovery > ... > report_slot_reset > driver->err_handler->slot_reset > pci_cleanup_aer_uncorrect_error_status > read PCI_ERR_UNCOR_STATUS # 3 > > OK, that was more than two comments :)