On Tue, 2023-05-02 at 09:45 -0400, Alex Deucher wrote: > On Tue, May 2, 2023 at 9:35 AM Timur Kristóf > <timur.kristof@xxxxxxxxx> wrote: > > > > Hi, > > > > On Tue, 2023-05-02 at 13:14 +0200, Christian König wrote: > > > > > > > > Christian König <christian.koenig@xxxxxxx> ezt írta (időpont: > > > > 2023. > > > > máj. 2., Ke 9:59): > > > > > > > > > Am 02.05.23 um 03:26 schrieb André Almeida: > > > > > > Em 01/05/2023 16:24, Alex Deucher escreveu: > > > > > >> On Mon, May 1, 2023 at 2:58 PM André Almeida > > > > > <andrealmeid@xxxxxxxxxx> > > > > > >> wrote: > > > > > >>> > > > > > >>> I know that devcoredump is also used for this kind of > > > > > information, > > > > > >>> but I believe > > > > > >>> that using an IOCTL is better for interfacing Mesa + > > > > > Linux > > > > > rather > > > > > >>> than parsing > > > > > >>> a file that its contents are subjected to be changed. > > > > > >> > > > > > >> Can you elaborate a bit on that? Isn't the whole point > > > > > of > > > > > devcoredump > > > > > >> to store this sort of information? > > > > > >> > > > > > > > > > > > > I think that devcoredump is something that you could use > > > > > to > > > > > submit to > > > > > > a bug report as it is, and then people can read/parse as > > > > > they > > > > > want, > > > > > > not as an interface to be read by Mesa... I'm not sure > > > > > that > > > > > it's > > > > > > something that I would call an API. But I might be wrong, > > > > > if > > > > > you know > > > > > > something that uses that as an API please share. > > > > > > > > > > > > Anyway, relying on that for Mesa would mean that we would > > > > > need > > > > > to > > > > > > ensure stability for the file content and format, making > > > > > it > > > > > less > > > > > > flexible to modify in the future and probe to bugs, while > > > > > the > > > > > IOCTL is > > > > > > well defined and extensible. Maybe the dump from Mesa + > > > > > devcoredump > > > > > > could be complementary information to a bug report. > > > > > > > > > > Neither using an IOCTL nor devcoredump is a good approach > > > > > for > > > > > this since > > > > > the values read from the hw register are completely > > > > > unreliable. > > > > > They > > > > > could not be available because of GFXOFF or they could be > > > > > overwritten or > > > > > not even updated by the CP in the first place because of a > > > > > hang > > > > > etc.... > > > > > > > > > > If you want to track progress inside an IB what you do > > > > > instead > > > > > is to > > > > > insert intermediate fence write commands into the IB. E.g. > > > > > something > > > > > like write value X to location Y when this executes. > > > > > > > > > > This way you can not only track how far the IB processed, > > > > > but > > > > > also in > > > > > which stages of processing we where when the hang occurred. > > > > > E.g. > > > > > End of > > > > > Pipe, End of Shaders, specific shader stages etc... > > > > > > > > > > > > > > > > > > Currently our biggest challenge in the userspace driver is > > > > debugging "random" GPU hangs. We have many dozens of bug > > > > reports > > > > from users which are like: "play the game for X hours and it > > > > will > > > > eventually hang the GPU". With the currently available tools, > > > > it is > > > > impossible for us to tackle these issues. André's proposal > > > > would be > > > > a step in improving this situation. > > > > > > > > We already do something like what you suggest, but there are > > > > multiple problems with that approach: > > > > > > > > 1. we can only submit 1 command buffer at a time because we > > > > won't > > > > know which IB hanged > > > > 2. we can't use chaining because we don't know where in the IB > > > > it > > > > hanged > > > > 3. it needs userspace to insert (a lot of) extra commands such > > > > as > > > > extra synchronization and memory writes > > > > 4. It doesn't work when GPU recovery is enabled because the > > > > information is already gone when we detect the hang > > > > > > > You can still submit multiple IBs and even chain them. All you > > > need > > > to do is to insert into each IB commands which write to an extra > > > memory location with the IB executed and the position inside the > > > IB. > > > > > > The write data command allows to write as many dw as you want > > > (up to > > > multiple kb). The only potential problem is when you submit the > > > same > > > IB multiple times. > > > > > > And yes that is of course quite some extra overhead, but I think > > > that should be manageable. > > > > Thanks, this sounds doable and would solve the limitation of how > > many > > IBs are submitted at a time. However it doesn't address the problem > > that enabling this sort of debugging will still have extra > > overhead. > > > > I don't mean the overhead from writing a couple of dwords for the > > trace, but rather, the overhead from needing to emit flushes or top > > of > > pipe events or whatever else we need so that we can tell which > > command > > hung the GPU. > > > > > > > > > In my opinion, the correct solution to those problems would be > > > > if > > > > the kernel could give userspace the necessary information about > > > > a > > > > GPU hang before a GPU reset. > > > > > > > The fundamental problem here is that the kernel doesn't have > > > that > > > information either. We know which IB timed out and can > > > potentially do > > > a devcoredump when that happens, but that's it. > > > > > > Is it really not possible to know such a fundamental thing as what > > the > > GPU was doing when it hung? How are we supposed to do any kind of > > debugging without knowing that? > > > > I wonder what AMD's Windows driver team is doing with this problem, > > surely they must have better tools to deal with GPU hangs? > > For better or worse, most teams internally rely on scan dumps via > JTAG > which sort of limits the usefulness outside of AMD, but also gives > you > the exact state of the hardware when it's hung so the hardware teams > prefer it. > How does this approach scale? It's not something we can ask users to do, and even if all of us in the radv team had a JTAG device, we wouldn't be able to play every game that users experience random hangs with.