Hi Maxim,
I can't help with the display related stuff. Probably best approach to get this fixes would be to open up a bug tracker for this on FDO.
But I'm the one who implemented the resizeable BAR support and your analysis of the problem sounds about correct to me.
The reason why this works on Linux is most likely because we restore the BAR size on resume (and maybe during initial boot as well).
See this patch for reference:
commit d3252ace0bc652a1a244455556b6a549f969bf99
Author: Christian König <ckoenig.leichtzumerken@xxxxxxxxx>
Date: Fri Jun 29 19:54:55 2018 -0500
PCI: Restore resized BAR state on resume
Resize BARs after resume to the expected size again.
BugLink: https://bugzilla.kernel.org/show_bug.cgi?id=199959
Fixes: d6895ad39f3b ("drm/amdgpu: resize VRAM BAR for CPU access v6")
Fixes: 276b738deb5b ("PCI: Add resizable BAR infrastructure")
Signed-off-by: Christian König <christian.koenig@xxxxxxx>
Signed-off-by: Bjorn Helgaas <bhelgaas@xxxxxxxxxx>
CC: stable@xxxxxxxxxxxxxxx # v4.15+
It should be trivial to add this to the reset module as well. Most likely even completely vendor independent since I'm not sure what a bus reset will do to this configuration and restoring it all the time should be the most defensive approach.
Let me know if you got any more questions on this.
Regards,
Christian.
Am 02.01.21 um 23:42 schrieb Maxim Levitsky:
I can't help with the display related stuff. Probably best approach to get this fixes would be to open up a bug tracker for this on FDO.
But I'm the one who implemented the resizeable BAR support and your analysis of the problem sounds about correct to me.
The reason why this works on Linux is most likely because we restore the BAR size on resume (and maybe during initial boot as well).
See this patch for reference:
commit d3252ace0bc652a1a244455556b6a549f969bf99
Author: Christian König <ckoenig.leichtzumerken@xxxxxxxxx>
Date: Fri Jun 29 19:54:55 2018 -0500
PCI: Restore resized BAR state on resume
Resize BARs after resume to the expected size again.
BugLink: https://bugzilla.kernel.org/show_bug.cgi?id=199959
Fixes: d6895ad39f3b ("drm/amdgpu: resize VRAM BAR for CPU access v6")
Fixes: 276b738deb5b ("PCI: Add resizable BAR infrastructure")
Signed-off-by: Christian König <christian.koenig@xxxxxxx>
Signed-off-by: Bjorn Helgaas <bhelgaas@xxxxxxxxxx>
CC: stable@xxxxxxxxxxxxxxx # v4.15+
It should be trivial to add this to the reset module as well. Most likely even completely vendor independent since I'm not sure what a bus reset will do to this configuration and restoring it all the time should be the most defensive approach.
Let me know if you got any more questions on this.
Regards,
Christian.
Am 02.01.21 um 23:42 schrieb Maxim Levitsky:
Hi! I am using this card for about a year and I would like first to say thanks for open source driver that you made for it, for the big navi and for the threadripper which brought back fun to the computing. I bought that card primary to use as a host GPU in VFIO enabled multi-seat system I am building, and recently I was able (with a minor issue I managed to solve, more about it later) to pass that GPU to both linux and windows guest mostly flawlessly. I do have experience in kernel development, and debugging so I am willing to test patches, etc. Any help is welcome! So these are the issues: 1.(the biggest issue): The amdgpu driver often crashes when plugging an input. I tested this now on purpose with 'amdgpu.dc=1' by slowly plugging and unplugging an input connector while I wait for the output to stabilize between each cycle, and still the issue reproduced after a dozen (or so) tries. (It only happens when I plug the connector, and never happens when I unplug it) Then I unloaded the amdgpu driver and loaded it again with dc=0. This does sort of work but takes a lot of time. The dmesg output is attached (amdgpu_dc1_plug_bug.txt) I did try to increase the number of tries in dm_helpers_read_local_edid, to something silly like 1000, but no luck. I also tried to remove the code below the 'Abort detection for non-DP connectors if we have no EDID' Also no luck. This bug pretty much makes it impossible to use the card daily as is since I do connect/disconnect monitors often, especially due to VFIO usage. 2. I found out that running without the new DC framework (amdgpu.dc=0) solves issue 1 completely (but costs HDMI sound - HDMI sound only works with amdgpu.dc=1) I am using this card like that for about at least half an year and haven't had a single connector plug/unplug related crash. Issue 2 however is that in this mode (I haven't tried to reproduce this with amdgpu.dc=1 yet), sometimes when I unbind the amdgpu driver the amdgpu complains about a leaked connector and crashes a bit later on. I haven't yet tracked the combination of things needed to trigger this, but it did happen to me about 3 times already. I did put a WARN_ON(1) to __drm_connector_put_safe, to see who is the caller that triggers the delayed work that frees the connector when it is too late. I attached a backtrace with the above WARN_ON and the crash (connector_leak_bug.txt) I also attached the script 'amdgpu_unbind' for the reference that I use to unbind the amdgpu driver. 3. When doing VFIO passthrough of this card, I found out that it doesn't suffer that much from the reset bug. As long as I shut down the guest in clean manner, I can start it again). The vendor_reset module however makes the reset work even when I shut down the guest right in the middle of a 3D app running and I tested it many times. _However_ this only works if I never load the amdgpu linux driver. Otherwise a windows guest still boots but all 3D apps in it crash very early. I tried both the stock drivers that windows auto installs and latest AMD workstation drivers from AMD site. Linux guests do work. I found out that amdgpu driver resizes the device bars (I have TRX40 platform, so I don't know if this platform supports the AMD Smart Memory or not, but according to lspci the device does support resizable BARs). If I patch the amdgpu's bar resize out, then, the windows guest _does_ work regardless if I loaded amdgpu prior or not. Linux guests also still work. I haven't measured the performance impact of this. For debugging this, I did try to hide the PCI_EXT_CAP_ID_REBAR capability from the VM, but it made no difference. I suspect that once the GPU is resetted, the bars revert to their original sizes, but VFIO uses the sizes that are cached by the kernel, so that the guest thinks that the bars are of one size while they are of an another. I don't have an idea though why this does work with a Linux guest. I had attached the pci config with amdgpu running, once with my patch that stops it from resizing the bars, and once without that patch for reference. (amdgpu_pciconfig_noresize.txt, amdgpu_pciconfig_resize.txt) 4. I found out that amdgpu runtime PM sometimes breaks the card if last output is disconnected from it. I didn't debug it much as I just disabled it with amdgpu.runpm=0) I will do more debug on this later. Please let me know if you have any questions, Don't hesitate to ask me for more information. My setup: 3 outputs, all HDMI, converted with DP->HDMI adapters, of which 2 are 1080P monitors, and 1 is a 1080P TV. The issues I describe above are reproducible on all the outputs. I am running 5.10.0 kernel with few patches and kvm-queue branch merged for my day to day work on KVM. You can find the exact kernel I use and its .config on https://gitlab.com/maximlevitsky/linux/-/commits/kernel-starship-5.10 Best regards, Maxim Levitsky
_______________________________________________ amd-gfx mailing list amd-gfx@xxxxxxxxxxxxxxxxxxxxx https://lists.freedesktop.org/mailman/listinfo/amd-gfx
_______________________________________________ amd-gfx mailing list amd-gfx@xxxxxxxxxxxxxxxxxxxxx https://lists.freedesktop.org/mailman/listinfo/amd-gfx