On Thu, Aug 17, 2017 at 01:06:32PM +0000, Gabriele Paoloni wrote: > Hi Bjorn > > > -----Original Message----- > > From: Bjorn Helgaas [mailto:helgaas@xxxxxxxxxx] > > Sent: 16 August 2017 15:08 > > To: liudongdong (C) > > Cc: linux-pci@xxxxxxxxxxxxxxx; Gabriele Paoloni; Chenxin (Charles); > > Linuxarm > > Subject: Re: [PATCH] PCIe AER: report non fatal errors only to the > > functions of the same device > > > > On Wed, Aug 16, 2017 at 04:50:16PM +0800, Dongdong Liu wrote: > > > Hi Bjorn > > > > > > 在 2017/8/16 6:50, Bjorn Helgaas 写道: > > > >On Fri, Aug 04, 2017 at 10:18:26AM +0800, Dongdong Liu wrote: > > > >>From: Gabriele Paoloni <gabriele.paoloni@xxxxxxxxxx> > > > >> > > > >>Currently if an uncorrectable error is reported by an EP the AER > > > >>driver walks over all the devices connected to the upstream port > > > >>bus and in turns call the report_error_detected() callback. > > > >>If any of the devices connected to the bus does not implement > > > >>dev->driver->err_handler->error_detected() do_recovery() will fail. > > > >> > > > >>However for non fatal errors the PCIe link should not be considered > > > >>compromised, therefore it makes sense to report the error only to > > > >>all the functions of a multifunction device. > > > >>This patch implements this new behaviour for non fatal errors. > > > > > > > >Why do we bother even with other functions in the same multifunction > > > >device? PCIe r3.1, sec 6.2.2.2.2, says non-fatal errors only affect > > a > > > >particular transaction, and "devices not associated with the > > > >transaction in error are not impacted." > > > > > > > >A transaction is only associated with one function, so other > > functions > > > >in the same device shouldn't be affected, should they? > > > > > > PCIe r3.1, sec 6.2.4 Error Logging > > > PCI Express errors are not Function-specific. "Software is > > > responsible for scanning all Functions in a multi-Function device > > > when it detects one of those errors" > > > > The previous text basically says that if a multi-function device > > should generate at most one error reporting message, even if several > > functions have logged an error of the same severity. I think that > > single message corresponds to a single interrupt. > > > > So when it says "software is responsible for scanning all Functions in > > a multi-Function device," I think the point is that software should > > read the error reporting registers of all functions in case several of > > them have logged errors. > > Right, looking again at the AER core it seems that find_source_device() > would look for the error sources by walking the PCIe hierarchy starting > from the RP that reported the error (however from AER_MAX_MULTI_ERR_DEVICES > max 5 devices can log an error on a single AER interrupt...). > > Anyway as it is now, and assuming that we have no more than 5 functions > in a multi-function device, AER core should call handle_error_source() > for each function that logged an error, right? > > > > > But IIUC, this patch has nothing to do with reading the error CSRs > > (which should be done in the PCI/AER core). This patch merely changes > > the set of devices for which we call the driver's error reporting > > interfaces. > > Correct. From our point of view if a fatal AER is reported by a function > then we need to call the driver callbacks also on all the function under > the same bus as the reporting device (as the link is compromised). > > We thought that for non-fatal errors this is not necessary as the bus > link should not be considered compromised, but we thought that for MF > devices maybe it would have been appropriate to call the driver callbacks > on the other functions of the same device. However after your consideration > above and after double checking the AER core it seems that also this is > not necessary (in fact for a MF device handle_error_source() will be called > for each function that logged the error) > > > > > If this is fixing a problem, maybe it would help clarify things if you > > could include the concrete details of what's going wrong. > > In Hi1620 we have some integrated controllers that appear as PCIe EPs > under the same bus. Some of these controllers (e.g. the SATA > controller) are missing the err_handler callbacks. > > If one device reports a non-fatal uncorrectable error with the current > AER core code the callbacks for all the devices under the same bus will > be called and, if any of the devices is missing the callback all the > devices in the subtree are left in error state without recovery... > This patch is needed to sort out a situation like this one. > > Anyway I think that after the considerations above on the MF device > I can modify this patch to just call the callback for the reporting > function...? Sounds like a reasonable approach. I'll compare the patch with the spec in more detail when you post it.