Am 03.05.23 um 17:08 schrieb Felix Kuehling:
Am 2023-05-03 um 03:59 schrieb Christian König:
Am 02.05.23 um 20:41 schrieb Alex Deucher:
On Tue, May 2, 2023 at 11:22 AM Timur Kristóf
<timur.kristof@xxxxxxxxx> wrote:
[SNIP]
In my opinion, the correct solution to those problems would be
if
the kernel could give userspace the necessary information about
a
GPU hang before a GPU reset.
The fundamental problem here is that the kernel doesn't have
that
information either. We know which IB timed out and can
potentially do
a devcoredump when that happens, but that's it.
Is it really not possible to know such a fundamental thing as what
the
GPU was doing when it hung? How are we supposed to do any kind of
debugging without knowing that?
Yes, that's indeed something at least I try to figure out for years
as well.
Basically there are two major problems:
1. When the ASIC is hung you can't talk to the firmware engines any
more and most state is not exposed directly, but just through some
fw/hw interface.
Just take a look at how umr reads the shader state from the SQ.
When that block is hung you can't do that any more and basically have
no chance at all to figure out why it's hung.
Same for other engines, I remember once spending a week figuring
out why the UVD block is hung during suspend. Turned out to be a
debugging nightmare because any time you touch any register of that
block the whole system would hang.
2. There are tons of things going on in a pipeline fashion or even
completely in parallel. For example the CP is just the beginning of a
rather long pipeline which at the end produces a bunch of pixels.
In almost all cases I've seen you ran into a problem somewhere
deep in the pipeline and only very rarely at the beginning.
I wonder what AMD's Windows driver team is doing with this problem,
surely they must have better tools to deal with GPU hangs?
For better or worse, most teams internally rely on scan dumps via
JTAG
which sort of limits the usefulness outside of AMD, but also gives
you
the exact state of the hardware when it's hung so the hardware teams
prefer it.
How does this approach scale? It's not something we can ask users to
do, and even if all of us in the radv team had a JTAG device, we
wouldn't be able to play every game that users experience random hangs
with.
It doesn't scale or lend itself particularly well to external
development, but that's the current state of affairs.
The usual approach seems to be to reproduce a problem in a lab and
have a JTAG attached to give the hw guys a scan dump and they can
then tell you why something didn't worked as expected.
That's the worst-case scenario where you're debugging HW or FW issues.
Those should be pretty rare post-bringup. But are there hangs caused
by user mode driver or application bugs that are easier to debug and
probably don't even require a GPU reset? For example most VM faults
can be handled without hanging the GPU. Similarly, a shader in an
endless loop should not require a full GPU reset. In the KFD compute
case, that's still preemptible and the offending process can be killed
with Ctrl-C or debugged with rocm-gdb.
We also have infinite loop in shader abort for gfx and page faults are
pretty rare with OpenGL (a bit more often with Vulkan) and can be
handled gracefully on modern hw (they just spam the logs).
The majority of the problems is unfortunately that we really get hard
hangs because of some hw issues. That can be caused by unlucky timing,
power management or doing things in an order the hw doesn't expected.
Regards,
Christian.
It's more complicated for graphics because of the more complex
pipeline and the lack of CWSR. But it should still be possible to do
some debugging without JTAG if the problem is in SW and not HW or FW.
It's probably worth improving that debugability without getting
hung-up on the worst case.
Maybe user mode graphics queues will offer a better way of recovering
from these kinds of bugs, if the graphics pipeline can be unstuck
without a GPU reset, just by killing the offending user mode queue.
Regards,
Felix
And yes that absolutely doesn't scale.
Christian.
Alex