Re: [RFC PATCH 0/1] Add AMDGPU_INFO_GUILTY_APP ioctl

Timur Kristóf <timur.kristof@xxxxxxxxx> · Tue, 02 May 2023 17:22:46 +0200

On Tue, 2023-05-02 at 09:45 -0400, Alex Deucher wrote:
> On Tue, May 2, 2023 at 9:35 AM Timur Kristóf
> <timur.kristof@xxxxxxxxx> wrote:
> > 
> > Hi,
> > 
> > On Tue, 2023-05-02 at 13:14 +0200, Christian König wrote:
> > > > 
> > > > Christian König <christian.koenig@xxxxxxx> ezt írta (időpont:
> > > > 2023.
> > > > máj. 2., Ke 9:59):
> > > > 
> > > > > Am 02.05.23 um 03:26 schrieb André Almeida:
> > > > >  > Em 01/05/2023 16:24, Alex Deucher escreveu:
> > > > >  >> On Mon, May 1, 2023 at 2:58 PM André Almeida
> > > > > <andrealmeid@xxxxxxxxxx>
> > > > >  >> wrote:
> > > > >  >>>
> > > > >  >>> I know that devcoredump is also used for this kind of
> > > > > information,
> > > > >  >>> but I believe
> > > > >  >>> that using an IOCTL is better for interfacing Mesa +
> > > > > Linux
> > > > > rather
> > > > >  >>> than parsing
> > > > >  >>> a file that its contents are subjected to be changed.
> > > > >  >>
> > > > >  >> Can you elaborate a bit on that?  Isn't the whole point
> > > > > of
> > > > > devcoredump
> > > > >  >> to store this sort of information?
> > > > >  >>
> > > > >  >
> > > > >  > I think that devcoredump is something that you could use
> > > > > to
> > > > > submit to
> > > > >  > a bug report as it is, and then people can read/parse as
> > > > > they
> > > > > want,
> > > > >  > not as an interface to be read by Mesa... I'm not sure
> > > > > that
> > > > > it's
> > > > >  > something that I would call an API. But I might be wrong,
> > > > > if
> > > > > you know
> > > > >  > something that uses that as an API please share.
> > > > >  >
> > > > >  > Anyway, relying on that for Mesa would mean that we would
> > > > > need
> > > > > to
> > > > >  > ensure stability for the file content and format, making
> > > > > it
> > > > > less
> > > > >  > flexible to modify in the future and probe to bugs, while
> > > > > the
> > > > > IOCTL is
> > > > >  > well defined and extensible. Maybe the dump from Mesa +
> > > > > devcoredump
> > > > >  > could be complementary information to a bug report.
> > > > > 
> > > > >  Neither using an IOCTL nor devcoredump is a good approach
> > > > > for
> > > > > this since
> > > > >  the values read from the hw register are completely
> > > > > unreliable.
> > > > > They
> > > > >  could not be available because of GFXOFF or they could be
> > > > > overwritten or
> > > > >  not even updated by the CP in the first place because of a
> > > > > hang
> > > > > etc....
> > > > > 
> > > > >  If you want to track progress inside an IB what you do
> > > > > instead
> > > > > is to
> > > > >  insert intermediate fence write commands into the IB. E.g.
> > > > > something
> > > > >  like write value X to location Y when this executes.
> > > > > 
> > > > >  This way you can not only track how far the IB processed,
> > > > > but
> > > > > also in
> > > > >  which stages of processing we where when the hang occurred.
> > > > > E.g.
> > > > > End of
> > > > >  Pipe, End of Shaders, specific shader stages etc...
> > > > > 
> > > > > 
> > > > 
> > > > Currently our biggest challenge in the userspace driver is
> > > > debugging "random" GPU hangs. We have many dozens of bug
> > > > reports
> > > > from users which are like: "play the game for X hours and it
> > > > will
> > > > eventually hang the GPU". With the currently available tools,
> > > > it is
> > > > impossible for us to tackle these issues. André's proposal
> > > > would be
> > > > a step in improving this situation.
> > > > 
> > > > We already do something like what you suggest, but there are
> > > > multiple problems with that approach:
> > > > 
> > > > 1. we can only submit 1 command buffer at a time because we
> > > > won't
> > > > know which IB hanged
> > > > 2. we can't use chaining because we don't know where in the IB
> > > > it
> > > > hanged
> > > > 3. it needs userspace to insert (a lot of) extra commands such
> > > > as
> > > > extra synchronization and memory writes
> > > > 4. It doesn't work when GPU recovery is enabled because the
> > > > information is already gone when we detect the hang
> > > > 
> > >  You can still submit multiple IBs and even chain them. All you
> > > need
> > > to do is to insert into each IB commands which write to an extra
> > > memory location with the IB executed and the position inside the
> > > IB.
> > > 
> > >  The write data command allows to write as many dw as you want
> > > (up to
> > > multiple kb). The only potential problem is when you submit the
> > > same
> > > IB multiple times.
> > > 
> > >  And yes that is of course quite some extra overhead, but I think
> > > that should be manageable.
> > 
> > Thanks, this sounds doable and would solve the limitation of how
> > many
> > IBs are submitted at a time. However it doesn't address the problem
> > that enabling this sort of debugging will still have extra
> > overhead.
> > 
> > I don't mean the overhead from writing a couple of dwords for the
> > trace, but rather, the overhead from needing to emit flushes or top
> > of
> > pipe events or whatever else we need so that we can tell which
> > command
> > hung the GPU.
> > 
> > > 
> > > > In my opinion, the correct solution to those problems would be
> > > > if
> > > > the kernel could give userspace the necessary information about
> > > > a
> > > > GPU hang before a GPU reset.
> > > > 
> > >  The fundamental problem here is that the kernel doesn't have
> > > that
> > > information either. We know which IB timed out and can
> > > potentially do
> > > a devcoredump when that happens, but that's it.
> > 
> > 
> > Is it really not possible to know such a fundamental thing as what
> > the
> > GPU was doing when it hung? How are we supposed to do any kind of
> > debugging without knowing that?
> > 
> > I wonder what AMD's Windows driver team is doing with this problem,
> > surely they must have better tools to deal with GPU hangs?
> 
> For better or worse, most teams internally rely on scan dumps via
> JTAG
> which sort of limits the usefulness outside of AMD, but also gives
> you
> the exact state of the hardware when it's hung so the hardware teams
> prefer it.
> 

How does this approach scale? It's not something we can ask users to
do, and even if all of us in the radv team had a JTAG device, we
wouldn't be able to play every game that users experience random hangs
with.