Re: [RFC PATCH 0/1] Add AMDGPU_INFO_GUILTY_APP ioctl

Marek Olšák <maraeo@xxxxxxxxx> · Wed, 3 May 2023 21:13:26 -0400

On Wed, May 3, 2023, 14:53 André Almeida <andrealmeid@xxxxxxxxxx> wrote:
Em 03/05/2023 14:08, Marek Olšák escreveu:

> GPU hangs are pretty common post-bringup. They are not common per user, 

> but if we gather all hangs from all users, we can have lots and lots of 

> them.

> 

> GPU hangs are indeed not very debuggable. There are however some things 

> we can do:

> - Identify the hanging IB by its VA (the kernel should know it)

How can the kernel tell which VA range is being executed? I only found 

that information at mmCP_IB1_BASE_ regs, but as stated in this thread by 

Christian this is not reliable to be read.

The kernel receives the VA and the size via the CS ioctl. When user queues are enabled, the kernel will no longer receive them.

> - Read and parse the IB to detect memory corruption.

> - Print active waves with shader disassembly if SQ isn't hung (often 

> it's not).

> 

> Determining which packet the CP is stuck on is tricky. The CP has 2 

> engines (one frontend and one backend) that work on the same command 

> buffer. The frontend engine runs ahead, executes some packets and 

> forwards others to the backend engine. Only the frontend engine has the 

> command buffer VA somewhere. The backend engine only receives packets 

> from the frontend engine via a FIFO, so it might not be possible to tell 

> where it's stuck if it's stuck.

Do they run at the same asynchronously or does the front end waits the 

back end to execute?

They run asynchronously and should run asynchronously for performance, but they can be synchronized using a special packet (PFP_SYNC_ME).

Marek

> 

> When the gfx pipeline hangs outside of shaders, making a scandump seems 

> to be the only way to have a chance at finding out what's going wrong, 

> and only AMD-internal versions of hw can be scanned.

> 

> Marek

> 

> On Wed, May 3, 2023 at 11:23 AM Christian König 

> <ckoenig.leichtzumerken@xxxxxxxxx 

> <mailto:ckoenig.leichtzumerken@xxxxxxxxx>> wrote:

> 

>     Am 03.05.23 um 17:08 schrieb Felix Kuehling:

>      > Am 2023-05-03 um 03:59 schrieb Christian König:

>      >> Am 02.05.23 um 20:41 schrieb Alex Deucher:

>      >>> On Tue, May 2, 2023 at 11:22 AM Timur Kristóf

>      >>> <timur.kristof@xxxxxxxxx <mailto:timur.kristof@xxxxxxxxx>> wrote:

>      >>>> [SNIP]

>      >>>>>>>> In my opinion, the correct solution to those problems would be

>      >>>>>>>> if

>      >>>>>>>> the kernel could give userspace the necessary information

>     about

>      >>>>>>>> a

>      >>>>>>>> GPU hang before a GPU reset.

>      >>>>>>>>

>      >>>>>>>   The fundamental problem here is that the kernel doesn't have

>      >>>>>>> that

>      >>>>>>> information either. We know which IB timed out and can

>      >>>>>>> potentially do

>      >>>>>>> a devcoredump when that happens, but that's it.

>      >>>>>>

>      >>>>>> Is it really not possible to know such a fundamental thing

>     as what

>      >>>>>> the

>      >>>>>> GPU was doing when it hung? How are we supposed to do any

>     kind of

>      >>>>>> debugging without knowing that?

>      >>

>      >> Yes, that's indeed something at least I try to figure out for years

>      >> as well.

>      >>

>      >> Basically there are two major problems:

>      >> 1. When the ASIC is hung you can't talk to the firmware engines any

>      >> more and most state is not exposed directly, but just through some

>      >> fw/hw interface.

>      >>     Just take a look at how umr reads the shader state from the SQ.

>      >> When that block is hung you can't do that any more and basically

>     have

>      >> no chance at all to figure out why it's hung.

>      >>

>      >>     Same for other engines, I remember once spending a week

>     figuring

>      >> out why the UVD block is hung during suspend. Turned out to be a

>      >> debugging nightmare because any time you touch any register of that

>      >> block the whole system would hang.

>      >>

>      >> 2. There are tons of things going on in a pipeline fashion or even

>      >> completely in parallel. For example the CP is just the beginning

>     of a

>      >> rather long pipeline which at the end produces a bunch of pixels.

>      >>     In almost all cases I've seen you ran into a problem somewhere

>      >> deep in the pipeline and only very rarely at the beginning.

>      >>

>      >>>>>>

>      >>>>>> I wonder what AMD's Windows driver team is doing with this

>     problem,

>      >>>>>> surely they must have better tools to deal with GPU hangs?

>      >>>>> For better or worse, most teams internally rely on scan dumps via

>      >>>>> JTAG

>      >>>>> which sort of limits the usefulness outside of AMD, but also

>     gives

>      >>>>> you

>      >>>>> the exact state of the hardware when it's hung so the

>     hardware teams

>      >>>>> prefer it.

>      >>>>>

>      >>>> How does this approach scale? It's not something we can ask

>     users to

>      >>>> do, and even if all of us in the radv team had a JTAG device, we

>      >>>> wouldn't be able to play every game that users experience

>     random hangs

>      >>>> with.

>      >>> It doesn't scale or lend itself particularly well to external

>      >>> development, but that's the current state of affairs.

>      >>

>      >> The usual approach seems to be to reproduce a problem in a lab and

>      >> have a JTAG attached to give the hw guys a scan dump and they can

>      >> then tell you why something didn't worked as expected.

>      >

>      > That's the worst-case scenario where you're debugging HW or FW

>     issues.

>      > Those should be pretty rare post-bringup. But are there hangs caused

>      > by user mode driver or application bugs that are easier to debug and

>      > probably don't even require a GPU reset? For example most VM faults

>      > can be handled without hanging the GPU. Similarly, a shader in an

>      > endless loop should not require a full GPU reset. In the KFD compute

>      > case, that's still preemptible and the offending process can be

>     killed

>      > with Ctrl-C or debugged with rocm-gdb.

> 

>     We also have infinite loop in shader abort for gfx and page faults are

>     pretty rare with OpenGL (a bit more often with Vulkan) and can be

>     handled gracefully on modern hw (they just spam the logs).

> 

>     The majority of the problems is unfortunately that we really get hard

>     hangs because of some hw issues. That can be caused by unlucky timing,

>     power management or doing things in an order the hw doesn't expected.

> 

>     Regards,

>     Christian.

> 

>      >

>      > It's more complicated for graphics because of the more complex

>      > pipeline and the lack of CWSR. But it should still be possible to do

>      > some debugging without JTAG if the problem is in SW and not HW or

>     FW.

>      > It's probably worth improving that debugability without getting

>      > hung-up on the worst case.

>      >

>      > Maybe user mode graphics queues will offer a better way of

>     recovering

>      > from these kinds of bugs, if the graphics pipeline can be unstuck

>      > without a GPU reset, just by killing the offending user mode queue.

>      >

>      > Regards,

>      >   Felix

>      >

>      >

>      >>

>      >> And yes that absolutely doesn't scale.

>      >>

>      >> Christian.

>      >>

>      >>>

>      >>> Alex

>      >>

>