On Mon, Apr 29, 2019 at 08:07:53PM -0500, Alex G wrote: > On 4/29/19 1:56 PM, Bjorn Helgaas wrote: > > From: Bjorn Helgaas <bhelgaas@xxxxxxxxxx> > > > > This reverts commit e8303bb7a75c113388badcc49b2a84b4121c1b3e. > > > > e8303bb7a75c added logging whenever a link changed speed or width to a > > state that is considered degraded. Unfortunately, it cannot differentiate > > signal integrity-related link changes from those intentionally initiated by > > an endpoint driver, including drivers that may live in userspace or VMs > > when making use of vfio-pci. Some GPU drivers actively manage the link > > state to save power, which generates a stream of messages like this: > > > > vfio-pci 0000:07:00.0: 32.000 Gb/s available PCIe bandwidth, limited by 2.5 GT/s x16 link at 0000:00:02.0 (capable of 64.000 Gb/s with 5 GT/s x16 link) > > > > We really *do* want to be alerted when the link bandwidth is reduced > > because of hardware failures, but degradation by intentional link state > > management is probably far more common, so the signal-to-noise ratio is > > currently low. > > > > Until we figure out a way to identify the real problems or silence the > > intentional situations, revert the following commits, which include the > > initial implementation (e8303bb7a75c) and subsequent fixes: > > I think we're overreacting to a bit of perceived verbosity in the system > log. Intentional degradation does not seem to me to be as common as > advertised. I have not observed this with either radeon, nouveau, or amdgpu, > and the proper mechanism to save power at the link level is ASPM. I stand to > be corrected and we have on CC some very knowledgeable fellows that I am > certain will jump at the opportunity to do so. I can't quantify how common it is, but the verbosity is definitely *there*, and it seems unlikely to me that a hardware failure is more common than any intentional driver-driven degradation. If we can reliably distinguish hardware failures from benign changes, we should certainly log the failures. But in this case even the failures are fully functional, albeit at lower performance, so if the messages end up being 99% false positives, I think it'll just be confusing for users. > What it seems like to me is that a proprietary driver running in a VM is > initiating these changes. And if that is the case then it seems this is a > virtualization problem. A quick glance over GPU drivers in linux did not > reveal any obvious places where we intentionally downgrade a link. I'm not 100% on board with the idea of drivers directly manipulating the link because it seems like the PCI core might need to be at least aware of this. But some drivers certainly do manipulate it today for ASPM, gen2/gen3 retraining, etc. If we treat this as a virtualization problem, I guess you're suggesting the host kernel should prevent that sort of link manipulation? We could have a conversation about that, but it doesn't seem like the immediate solution to this problem. > I'm not convinced a revert is the best call. I have very limited options at this stage of the release, but I'd be glad to hear suggestions. My concern is that if we release v5.1 as-is, we'll spend a lot of energy on those false positives. Bjorn