Re: Bug: Completion-Wait loop timed out with vfio

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 3/3/2023 2:06 AM, Alex Williamson wrote:
> On Thu, 2 Mar 2023 09:40:35 +0200
> Tasos Sahanidis <tasos@xxxxxxxxxxxx> wrote:
> 
>> On 2023-03-01 16:10, Alex Williamson wrote:
>>> 0000:02:00.0 is the upstream port of that switch and 0000:03:02.0 is
>>> the downstream port for the 7790.  0000:03:02.0 is the port that should
>>> also now enter D3hot.
>>>   
>>>> If so, I tested in 5.18, both before and while running the VM, with 6.2
>>>> both with and without disable_idle_d3, and in all cases they stayed at D0.  
>>>
>>> It's possible the switch has a problem with D3hot support and it may
>>> need to be disabled or augmented with a PCI quirk.  In addition to
>>> investigating what power state the downstream port is achieving and
>>> reporting lspci -vvv with and without disable_idle_d3, would you mind
>>> reporting "lspci -nns 2:00.0" and "lspci -nns 3:" to report all the
>>> vendor and device IDs of the switch.  Thanks,
>>>   
>>
>> It seems that way, especially after manually preventing the root port
>> for the graphics card from entering D3hot, however the one for the NIC
>> seems to be doing that just fine, which makes things more confusing.
> 
> Yes, the fact that the NIC works suggests there's not simply a blatant
> chip defect where we should blindly disable D3 power state support for
> this downstream port.  I'm also not seeing any difference in the
> downstream port configuration between the VM running after the port has
> resumed from D3hot and the case where the port never entered D3hot.
> 
> But it suddenly dawns on me that you're assigning a Radeon HD 7790,
> which is one of the many AMD GPUs which is plagued by reset problems.
> I wonder if that's a factor there.  This particular GPU even has
> special handling in QEMU to try to manually reset the device, and which
> likely has never been tested since adding runtime power management
> support.  In fact, I'm surprised anyone is doing regular device
> assignment with an HD 7790 and considers it a normal, acceptable
> experience even with the QEMU workarounds.
> 
> I certainly wouldn't feel comfortable proposing a quirk for the
> downstream port to disable D3hot for an issue only seen when assigning
> a device with such a nefarious background relative to device
> assignment.  It does however seem like there are sufficient options in
> place to work around the issue, either disabling power management at
> the vfio-pci driver, or specifically for the downstream port via sysfs.

  Thanks Tasos and Alex. 

  We can use the udev rules to toggle the sysfs entries automatically.
  The information regarding udev parameters for the downstream or upstream bridge
  can be fetched through 

  # udevadm info <device_path>

  And then create a rules file.
  For nvidia GPU runtime PM, the udev rules are documented in 

  https://download.nvidia.com/XFree86/Linux-x86_64/525.89.02/README/dynamicpowermanagement.html#AutomatedSetup803b0

  We can create similar kind of udev rules, if we want to disable runtime PM only
  for specific device automatically.
 
> I don't really have any better suggestions given our limited ability to
> test and highly suspect target device.  Any other ideas, Abhishek?

 Given that we already tried all the possible isolation steps from the
 user-space side, so nothing new from my side which can be tried easily.
 I checked the lspci dumps and no issues observed with that.
 
 I have written standalone programs by using the example mentioned in
 https://www.kernel.org/doc/html/next/driver-api/vfio.html when I did
 testing for my runtime PM patches to get more coverage. For this issue
 also, if we can try to repro the issue through standalone programs first, then
 the debugging may be easier. But it requires effort and access to register
 manual, so it won't be worth to try at Tasos's end.

 We can go with delay option as you suggested in the latest thread and see if
 that helps.

 Thanks,
 Abhishek



[Index of Archives]     [KVM ARM]     [KVM ia64]     [KVM ppc]     [Virtualization Tools]     [Spice Development]     [Libvirt]     [Libvirt Users]     [Linux USB Devel]     [Linux Audio Users]     [Yosemite Questions]     [Linux Kernel]     [Linux SCSI]     [XFree86]

  Powered by Linux