Comment # 4
on bug 107762
from Martin Peres
(In reply to Michel Dänzer from comment #2) > (In reply to Martin Peres from comment #0) > > [ 358.292609] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring sdma0 timeout, signaled seq=137, emitted seq=137 > > [ 358.292635] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring sdma1 timeout, signaled seq=145, emitted seq=145 > > (In reply to Martin Peres from comment #1) > > [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring sdma0 timeout, signaled seq=137, emitted seq=137 > > [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring sdma0 timeout, signaled seq=147, emitted seq=147 > > Hmm, signalled and emitted sequence numbers are always the same, meaning the > hardware hasn't actually timed out? > > I can think of two possibilities: > > * A GPU scheduler bug causing the job timeout handling to be triggered > spuriously. (Could something be stalling the system work queue, so the items > scheduled by drm_sched_job_finish_cb can't call drm_sched_job_finish in > time?) > > * A problem with the handling of the GPU's interrupts. Do the numbers on the > amdgpu line in /proc/interrupts still increase after these messages > appeared, or at least in the ten seconds before they appear? Here is the IGT run log: [283/301] skip: 65, pass: 218 - running: igt/amdgpu/amd_cs_nop/sync-fork-gfx0 [283/301] skip: 65, pass: 218 \ dmesg-warn: igt/amdgpu/amd_cs_nop/sync-fork-gfx0 [284/301] skip: 65, pass: 218, dmesg-warn: 1 \ running: igt/amdgpu/amd_prime/i915-to-amd [284/301] skip: 65, pass: 218, dmesg-warn: 1 | pass: igt/amdgpu/amd_prime/i915-to-amd [285/301] skip: 65, pass: 219, dmesg-warn: 1 | running: igt/amdgpu/amd_prime/amd-to-i915 [285/301] skip: 65, pass: 219, dmesg-warn: 1 / dmesg-fail: igt/amdgpu/amd_prime/amd-to-i915 It shows that both the tests #283 and #285 generated the timeout, yet the seqno has increased by 10 between the two tests, suggesting that the GPU is not hung. I can't easily check if interrupts in /proc/interrupts are still increasing on a machine that is part of our CI, but I guess if this is what it takes to get this bug forward, I will try to get my hands on a KBLg platform and help you trigger. However, if it is scheduler bug, I would assume this issue to be reproducible on any AMD GPU. Can you try running igt@amdgpu/amd_cs_nop@sync-fork-gfx0 in a loop for an hour or so? Your second proposal would point at a KBLg-specific bug, but let's first rule out the scheduler as being part of the problem. In any case, thanks for your answer :)
You are receiving this mail because:
- You are the assignee for the bug.
_______________________________________________ dri-devel mailing list dri-devel@xxxxxxxxxxxxxxxxxxxxx https://lists.freedesktop.org/mailman/listinfo/dri-devel