Re: [PATCH v3] PCI: vmd: always enable host bridge hotplug support flags

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Thu, May 02, 2024 at 03:38:00PM -0700, Paul M Stillwell Jr wrote:
> On 5/2/2024 3:08 PM, Bjorn Helgaas wrote:
> > On Mon, Apr 08, 2024 at 11:39:27AM -0700, Paul M Stillwell Jr wrote:
> > > Commit 04b12ef163d1 ("PCI: vmd: Honor ACPI _OSC on PCIe features") added
> > > code to copy the _OSC flags from the root bridge to the host bridge for each
> > > vmd device because the AER bits were not being set correctly which was
> > > causing an AER interrupt storm for certain NVMe devices.
> > > 
> > > This works fine in bare metal environments, but causes problems when the
> > > vmd driver is run in a hypervisor environment. In a hypervisor all the
> > > _OSC bits are 0 despite what the underlying hardware indicates. This is
> > > a problem for vmd users because if vmd is enabled the user *always*
> > > wants hotplug support enabled. To solve this issue the vmd driver always
> > > enables the hotplug bits in the host bridge structure for each vmd.
> > > 
> > > Fixes: 04b12ef163d1 ("PCI: vmd: Honor ACPI _OSC on PCIe features")
> > > Signed-off-by: Nirmal Patel <nirmal.patel@xxxxxxxxxxxxxxx>
> > > Signed-off-by: Paul M Stillwell Jr <paul.m.stillwell.jr@xxxxxxxxx>
> > > ---
> > >   drivers/pci/controller/vmd.c | 10 ++++++++--
> > >   1 file changed, 8 insertions(+), 2 deletions(-)
> > > 
> > > diff --git a/drivers/pci/controller/vmd.c b/drivers/pci/controller/vmd.c
> > > index 87b7856f375a..583b10bd5eb7 100644
> > > --- a/drivers/pci/controller/vmd.c
> > > +++ b/drivers/pci/controller/vmd.c
> > > @@ -730,8 +730,14 @@ static int vmd_alloc_irqs(struct vmd_dev *vmd)
> > >   static void vmd_copy_host_bridge_flags(struct pci_host_bridge *root_bridge,
> > >   				       struct pci_host_bridge *vmd_bridge)
> > >   {
> > > -	vmd_bridge->native_pcie_hotplug = root_bridge->native_pcie_hotplug;
> > > -	vmd_bridge->native_shpc_hotplug = root_bridge->native_shpc_hotplug;
> > > +	/*
> > > +	 * there is an issue when the vmd driver is running within a hypervisor
> > > +	 * because all of the _OSC bits are 0 in that case. this disables
> > > +	 * hotplug support, but users who enable VMD in their BIOS always want
> > > +	 * hotplug suuport so always enable it.
> > > +	 */
> > > +	vmd_bridge->native_pcie_hotplug = 1;
> > > +	vmd_bridge->native_shpc_hotplug = 1;
> > 
> > Deferred for now because I think we need to figure out how to set all
> > these bits the same, or at least with a better algorithm than "here's
> > what we want in this environment."
> > 
> > Extended discussion about this at
> > https://lore.kernel.org/r/20240417201542.102-1-paul.m.stillwell.jr@xxxxxxxxx
> 
> That's ok by me. I thought where we left it was that if we could find a
> solution to the Correctable Errors from the original issue that maybe we
> could revert 04b12ef163d1.
> 
> I'm not sure I would know if a patch that fixes the Correctable Errors comes
> in... We have a test case we would like to test against that was pre
> 04b12ef163d1 (BIOS has AER disabled and we hotplug a disk which results in
> AER interrupts) so we would be curious if the issues we saw before goes away
> with a new patch for Correctable Errors.

My current theory is that there's some issue with that particular
Samsung NVMe device that causes the Correctable Error flood.  Kai-Heng
says they happen even with VMD disabled.

And there are other reports that don't seem to involve VMD but do
involve this NVMe device ([144d:a80a]):

  https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1852420
  https://forums.unraid.net/topic/113521-constant-errors-on-logs-after-nvme-upgrade/
  https://forums.unraid.net/topic/118286-nvme-drives-throwing-errors-filling-logs-instantly-how-to-resolve/
  https://forum.proxmox.com/threads/pve-kernel-panics-on-reboots.144481/
  https://www.eevblog.com/forum/general-computing/linux-mint-21-02-clone-replace-1tb-nvme-with-a-2tb-nvme/

NVMe has weird power management stuff, so it's always possible we're
doing something wrong in a driver.

But I think we really need to handle Correctable Errors better by:

  - Possibly having drivers mask errors if they know about defects
  - Making the log messages less alarming, e.g.,  a single line report
  - Rate-limiting them so they're never overwhelming
  - Maybe automatically masking them in the PCI core to avoid interrupts

> > >   	vmd_bridge->native_aer = root_bridge->native_aer;
> > >   	vmd_bridge->native_pme = root_bridge->native_pme;
> > >   	vmd_bridge->native_ltr = root_bridge->native_ltr;
> > > -- 
> > > 2.39.1
> > > 
> > 
> 




[Index of Archives]     [DMA Engine]     [Linux Coverity]     [Linux USB]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [Greybus]

  Powered by Linux