Hi all, I'm trying to work through this bug: https://bugs.freedesktop.org/ show_bug.cgi?id=93649 . The main symptom that something has gone wrong is the system locks up, with some process trying to reset the gpu while the gpu is trying to be reset which deadlocks. The system still works over ssh, just the graphics get stuck. I'm trying to fix the kernel side of this first, so my gpu can reliably reset when the game triggers the gpu lockup, after which I'll try tracking down the mesa issue which causes the lockup in the first place. I've started some preliminary investigating, but I'm running out of ideas as public documentation on some of the AMD hardware is currently not available. As far as I can tell, when the radeon module tries to reset the GPU it will always fail to bring up the VCE (which I haven't looked at yet, as it doesn't seem to be involved with this issue.) and the UVD. The VCE failure is caught early, and so the kernel module just ignores the whole thing. However, the UVD claims to initialize properly. But when the kernel module tries to run a test IB on the UVD ring, it stalls forever. Note: before any issues, the UVD works on my GPU, tested with a random media file and vlc. I poked IRC some time ago, where Dave Airlie suggested that UVD is really unhappy with being reset, and to try disabling that as a test. Nothing I tried yielded any improvement. I also noticed that the SMC (I assume that is some sort of power manager? I didn't find anything on it besides the source code) fails to initialize after a reset, with the error: [drm:si_dpm_set_power_state [radeon]] *ERROR* si_set_sw_state failed I'm wondering if this might be causing the issue instead, as the source code fiddles with the UVD after this error. Not knowing more, I can't say for sure. Details on testing done: For the UVD, I tried forcing it to be completely reset by setting the appropriate bit in SRBM_SOFT_RESET, but that still cause the failure to happen in the same place. Based on the advice from IRC, I tried disabling large parts of the UVD startup and shutdown code, to avoid disabling anything. Some of the initialization process also disabled parts of the UVD, which is which it was disabled. There was no change. Note the initial start was never changed, and vlc was always able to play a video using it. Suspecting the SMC, I've got the return code from the message sent in si_set_sw_state. It always returns 0x0, which doesn't have a name in the source code. I guess this means a timeout, from looking at the code. I have no idea where to look further I couldn't find any documentation. If there is any I missed, I'd be happy to take a look and see what is going on. I also captured traces of every command sent to the SMC, if that would help. I haven't checked them much, other then to note they are different then on boot. Also, is there a bit in either GRBM_SOFT_RESET or SRBM_SOFT_RESET to reset the SMC? I'm just curious if that might help. I've been using vlc playing a movie while forcing a gpu reset through debugfs to speed up testing, as it quickly and reliably causes this issue. I can also reproduce this with TF2 reliably, it just takes 30-60 minutes to test. For solutions I was hopeful on, I'd use TF2 to confirm that vlc using the UVD wasn't causing a failure on reset different from the TF2 one. Any help in debugging this issue would be greatly appreciated. Any documentation I can review to better understand the GPU would be helpful. I already checked the documentation linked to from the fdo wiki, but it didn't mention this part. One last thing, I can partial work around the hang by allowing the ib test of the UVD to time out. I've used a long time out (20 seconds) for testing. Would a patch limiting this be accepted? It might allow users who run into this to recover (sometimes TF2 will recover thanks to that workaround, and continuing playing. Sometimes the system still lockups due to other issues, but those don't seem to be hardware errors so I rather work on that later). Right now I add a timeout to every call to radeon_fence_wait, but if that isn't a good idea I could add another similar function (radeon_fence_wait_timeout?) that takes a timeout, and update the ring tests appropriately. Thanks for reading my wall of text, -- Matthew
Attachment:
smime.p7s
Description: S/MIME cryptographic signature
_______________________________________________ dri-devel mailing list dri-devel@xxxxxxxxxxxxxxxxxxxxx http://lists.freedesktop.org/mailman/listinfo/dri-devel