On Thu, Oct 15, 2020 at 11:04 AM Kuppuswamy, Sathyanarayanan <sathyanarayanan.kuppuswamy@xxxxxxxxxxxxxxx> wrote: > > > > On 10/14/20 6:58 PM, Ethan Zhao wrote: > > On Thu, Oct 15, 2020 at 1:06 AM Kuppuswamy, Sathyanarayanan > > <sathyanarayanan.kuppuswamy@xxxxxxxxxxxxxxx> wrote: > >> > >> > >> > >> On 10/14/20 8:07 AM, Ethan Zhao wrote: > >>> On Wed, Oct 14, 2020 at 5:00 PM Kuppuswamy Sathyanarayanan > >>> <sathyanarayanan.nkuppuswamy@xxxxxxxxx> wrote: > >>>> > >>>> Commit bdb5ac85777d ("PCI/ERR: Handle fatal error recovery") > >>>> merged fatal and non-fatal error recovery paths, and also made > >>>> recovery code depend on hotplug handler for "remove affected > >>>> device + rescan" support. But this change also complicated the > >>>> error recovery path and which in turn led to the following > >>>> issues. > >>>> > >>>> 1. We depend on hotplug handler for removing the affected > >>>> devices/drivers on DLLSC LINK down event (on DPC event > >>>> trigger) and DPC handler for handling the error recovery. Since > >>>> both handlers operate on same set of affected devices, it leads > >>>> to race condition, which in turn leads to NULL pointer > >>>> exceptions or error recovery failures.You can find more details > >>>> about this issue in following link. > >>>> > >>>> https://lore.kernel.org/linux-pci/20201007113158.48933-1-haifeng.zhao@xxxxxxxxx/T/#t > >>>> > >>>> 2. For non-hotplug capable devices fatal (DPC) error recovery > >>>> is currently broken. Current fatal error recovery implementation > >>>> relies on PCIe hotplug (pciehp) handler for detaching and > >>>> re-enumerating the affected devices/drivers. So when dealing with > >>>> non-hotplug capable devices, recovery code does not restore the state > >>>> of the affected devices correctly. You can find more details about > >>>> this issue in the following links. > >>>> > >>>> https://lore.kernel.org/linux-pci/20200527083130.4137-1-Zhiqiang.Hou@xxxxxxx/ > >>>> https://lore.kernel.org/linux-pci/12115.1588207324@famine/ > >>>> https://lore.kernel.org/linux-pci/0e6f89cd6b9e4a72293cc90fafe93487d7c2d295.1585000084.git.sathyanarayanan.kuppuswamy@xxxxxxxxxxxxxxx/ > >>>> > >>>> In order to fix the above two issues, we should stop relying on hotplug > >>> Yes, it doesn't rely on hotplug handler to remove and rescan the device, > >>> but it couldn't prevent hotplug drivers from doing another replicated > >>> removal/rescanning. > >>> it doesn't make sense to leave another useless removal/rescanning there. > >>> Maybe that's why these two paths were merged to one and made it rely on > >>> hotplug. > >> No, as per PCIe spec, hotplug and DPC has no functional dependency. Hence > >> depending on it to handle some of its recovery function is in-correct and > >> would lead to issues in non-hotplug capable platforms (which is true > >> currently). > >>> > > > pci_lock_rescan_remove() is global lock for PCIe, the mal-functional > > device's port holds this lock, it prevents the whole system from doing > > hot-plug operation. > It does not prevent the hotplug operation, but it might delay it. Since both > DPC and hotplug operates on same set of devices, it must be synchronized. Think about a large system with some PCI domains, every domain has some ports and devices attached. why DPC and hotplug *must* operate on the same set of devices from different domains ? if it must be synchronized, why make the hotplug handlers threadable ? > > Though pciehp is not so hot/scalable and performance critical, but there > > is per cpu thread to handle hot-plug operation. synchronize all threads > > make them walk backwards for scalability. > DPC events does not happen in high frequency. So I don't think we should It's holding global lock, once malfunction happens to one device and it's driver, the whole system, everyone holds it, would be blocked to work. > worry about the performance here. Even hotplug handler will hold this lock > when adding/removing the devices. So adding/removing devices is a serialized You don't worry about performance, but if there is a requirement needs more scalable and reliable hotplug, the effect will be much harder. what to do then ? choose another OS ? To be honest, I don't like the global lock/ pci_lock_rescan_remove(). BTW, I didn't try the FATAL errors brute force injection on your patch, duplicated removal will work naturally because it was removed ? Thanks, Ethan > operation. > > > > >> > >>>> -- > >>>> 2.17.1 > >>>> > >> > >> -- > >> Sathyanarayanan Kuppuswamy > >> Linux Kernel Developer > > -- > Sathyanarayanan Kuppuswamy > Linux Kernel Developer