Re: [RFC PATCH 0/1] Add AMDGPU_INFO_GUILTY_APP ioctl

Timur Kristóf <timur.kristof@xxxxxxxxx> · Tue, 02 May 2023 15:34:59 +0200

Hi,

On Tue, 2023-05-02 at 13:14 +0200, Christian König wrote:
> > 
> > Christian König <christian.koenig@xxxxxxx> ezt írta (időpont: 2023.
> > máj. 2., Ke 9:59):
> >  
> > > Am 02.05.23 um 03:26 schrieb André Almeida:
> > >  > Em 01/05/2023 16:24, Alex Deucher escreveu:
> > >  >> On Mon, May 1, 2023 at 2:58 PM André Almeida
> > > <andrealmeid@xxxxxxxxxx> 
> > >  >> wrote:
> > >  >>>
> > >  >>> I know that devcoredump is also used for this kind of
> > > information, 
> > >  >>> but I believe
> > >  >>> that using an IOCTL is better for interfacing Mesa + Linux
> > > rather 
> > >  >>> than parsing
> > >  >>> a file that its contents are subjected to be changed.
> > >  >>
> > >  >> Can you elaborate a bit on that?  Isn't the whole point of
> > > devcoredump
> > >  >> to store this sort of information?
> > >  >>
> > >  >
> > >  > I think that devcoredump is something that you could use to
> > > submit to 
> > >  > a bug report as it is, and then people can read/parse as they
> > > want, 
> > >  > not as an interface to be read by Mesa... I'm not sure that
> > > it's 
> > >  > something that I would call an API. But I might be wrong, if
> > > you know 
> > >  > something that uses that as an API please share.
> > >  >
> > >  > Anyway, relying on that for Mesa would mean that we would need
> > > to 
> > >  > ensure stability for the file content and format, making it
> > > less 
> > >  > flexible to modify in the future and probe to bugs, while the
> > > IOCTL is 
> > >  > well defined and extensible. Maybe the dump from Mesa +
> > > devcoredump 
> > >  > could be complementary information to a bug report.
> > >  
> > >  Neither using an IOCTL nor devcoredump is a good approach for
> > > this since 
> > >  the values read from the hw register are completely unreliable.
> > > They 
> > >  could not be available because of GFXOFF or they could be
> > > overwritten or 
> > >  not even updated by the CP in the first place because of a hang
> > > etc....
> > >  
> > >  If you want to track progress inside an IB what you do instead
> > > is to 
> > >  insert intermediate fence write commands into the IB. E.g.
> > > something 
> > >  like write value X to location Y when this executes.
> > >  
> > >  This way you can not only track how far the IB processed, but
> > > also in 
> > >  which stages of processing we where when the hang occurred. E.g.
> > > End of 
> > >  Pipe, End of Shaders, specific shader stages etc...
> > >  
> > > 
> >  
> > Currently our biggest challenge in the userspace driver is
> > debugging "random" GPU hangs. We have many dozens of bug reports
> > from users which are like: "play the game for X hours and it will
> > eventually hang the GPU". With the currently available tools, it is
> > impossible for us to tackle these issues. André's proposal would be
> > a step in improving this situation.
> > 
> > We already do something like what you suggest, but there are
> > multiple problems with that approach:
> >  
> > 1. we can only submit 1 command buffer at a time because we won't
> > know which IB hanged
> > 2. we can't use chaining because we don't know where in the IB it
> > hanged
> > 3. it needs userspace to insert (a lot of) extra commands such as
> > extra synchronization and memory writes
> > 4. It doesn't work when GPU recovery is enabled because the
> > information is already gone when we detect the hang
> > 
>  You can still submit multiple IBs and even chain them. All you need
> to do is to insert into each IB commands which write to an extra
> memory location with the IB executed and the position inside the IB.
> 
>  The write data command allows to write as many dw as you want (up to
> multiple kb). The only potential problem is when you submit the same
> IB multiple times.
> 
>  And yes that is of course quite some extra overhead, but I think
> that should be manageable.

Thanks, this sounds doable and would solve the limitation of how many
IBs are submitted at a time. However it doesn't address the problem
that enabling this sort of debugging will still have extra overhead.

I don't mean the overhead from writing a couple of dwords for the
trace, but rather, the overhead from needing to emit flushes or top of
pipe events or whatever else we need so that we can tell which command
hung the GPU.

>  
> > In my opinion, the correct solution to those problems would be if
> > the kernel could give userspace the necessary information about a
> > GPU hang before a GPU reset.
> >   
>  The fundamental problem here is that the kernel doesn't have that
> information either. We know which IB timed out and can potentially do
> a devcoredump when that happens, but that's it.

Is it really not possible to know such a fundamental thing as what the
GPU was doing when it hung? How are we supposed to do any kind of
debugging without knowing that?

I wonder what AMD's Windows driver team is doing with this problem,
surely they must have better tools to deal with GPU hangs?

Best regards,
Timur