[Bug 105113] [hawaii, radeonsi, clover] Running Piglit cl/program/execute/{, tail-}calls{, -struct, -workitem-id}.cl cause GPU VM error and ring stalled GPU lockup

bugzilla-daemon@xxxxxxxxxxxxxxx · Mon, 19 Nov 2018 14:03:36 +0000

            Comment # 8
              on bug 105113
              from  Maciej S. Szmigiero

        (In reply to Jan Vesely from comment #7)
> (In reply to Maciej S. Szmigiero from comment #6)
> > There are really two issues at play here:
> > 1) If the LLVM-generated code cannot be run properly then it should be simply
> > rejected by whatever is actually in charge of submitting it to the GPU (I
> > guess
> > this would be Mesa?).
> > This way an application will know it cannot use OpenCL for computation, at
> > least
> > not with this compute kernel.
> > 
> > Instead, it currently looks like many of these test run but give incorrect
> > results, which is obviously rather bad.
> 
> Do you have an example of this? clover should return OUT_OF_RESOURCES error
> when the compute state creation fails (like in the presence of code
> relocations).
> It does not change the content of the buffer, so it will return whatever was
> stored in the buffer on creation.

Aren't program@execute@calls-struct and program@execute@tail-calls tests
from comment 4 examples of this behavior?
These seem to run but return wrong results, or am I not parsing the piglit
test results correctly?

> > 2) Some (previous) Mesa + LLVM versions generate a command stream that
> > crashes the GPU and, as far as I can remember, sometimes even lockup the
> > whole machine.
> > 
> > It should not be possible to crash the GPU, regardless how incorrect a
> > command stream that userspace sends to it is - because otherwise it is
> > possible for
> > an unprivileged user with GPU access to DoS the machine.
> 
> This is a separate issue. GPU hangs are generally addressed via gpu reset
> which should be enabled for gfx8/9 GPUs in recent amdgpu.ko [0]
> 
> [0] https://patchwork.freedesktop.org/patch/257994/

This would explain why "amdgpu" seemed to not even attempt to reset the GPU
after a crash.

However, I think I've got at least one lockup when testing this issue half a
year ago on "radeon" driver ("amdgpu" is still marked as experimental for SI
parts).
If I am able to reproduce it in the future I will report it then.

      You are receiving this mail because:

          You are the assignee for the bug.

_______________________________________________
dri-devel mailing list
dri-devel@xxxxxxxxxxxxxxxxxxxxx
https://lists.freedesktop.org/mailman/listinfo/dri-devel