On Thu, Nov 16, 2017 at 03:52:47PM -0500, Sinan Kaya wrote: > >> > >> Whether the AER driver reads ~0 or not really depends on timing. The > >> link may come up from the DPC driver by the time AER driver reaches > >> here as an example. > >> > >> Bad things do happen. We have seen this with e1000e driver. > > > > I don't doubt that bad things happen. I'm just trying to understand > > exactly *what* bad things happen and how, so we can fix them cleanly. > > This was random crashes in the e1000e drivers accompanied with stack > traces coming from WARN and msi allocation routines. I didn't look in detail, but I'm not sure there's sufficient locking in the AER path to make it safe from concurrent device removal. I suspect AER could be improved both with respect to handling ~0 data and this potential concurrency issue. > > So the "stop" and "recover" commands you mention must be related to > > AER. > > I was talking about pci_stop_and_remove_bus_device() vs. error_detected() > as "stop" and "recover" Thanks for clearing that up! > > I suspect this all probably requires tighter integration between DPC > > and AER, and I'm totally fine with that. I think the current > > separation as separate "drivers" is pretty artificial anyway. > > Got it. We will try to plumb DPC error handling into AER driver's error > handling mechanism. Looking at the AER code today, I noticed it already uses "DPC" in another sense. I don't know what it stands for there (probably "deferred" something), but I don't think it's "Downstream Port Containment" :) > What do you think about the rescan following link up? The only entity > that does rescan today is hotplug after DPC recovery. There could be > a platform with DPC support but no hotplug support. > > How should we handle it? Good question. If your system does support both DPC and hotplug, I assume the link comes back up after you clear DPC Trigger Status. Does pciehp notice that "link up" event and add the device back? So I think your question is whether the DPC code should explicitly do a rescan so that if we don't have pciehp, we'll still automatically rediscover the device. I dunno, maybe. Seems like a plausible idea anyway. Bjorn