Re: [PATCH i-g-t] tests/initial_state: Add a test to capture the state of the GPU

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Tue, May 16, 2017 at 10:07:41AM +0000, Lofstedt, Marta wrote:
> 
> 
> > -----Original Message-----
> > From: Chris Wilson [mailto:chris@xxxxxxxxxxxxxxxxxx]
> > Sent: Tuesday, May 16, 2017 12:48 PM
> > To: Lofstedt, Marta <marta.lofstedt@xxxxxxxxx>
> > Cc: Daniel Vetter <daniel@xxxxxxxx>; Martin Peres
> > <martin.peres@xxxxxxxxxxxxxxx>; intel-gfx@xxxxxxxxxxxxxxxxxxxxx
> > Subject: Re:  [PATCH i-g-t] tests/initial_state: Add a test to capture
> > the state of the GPU
> > 
> > On Tue, May 16, 2017 at 09:43:52AM +0000, Lofstedt, Marta wrote:
> > >
> > >
> > > > -----Original Message-----
> > > > From: Chris Wilson [mailto:chris@xxxxxxxxxxxxxxxxxx]
> > > > Sent: Tuesday, May 16, 2017 12:04 PM
> > > > To: Lofstedt, Marta <marta.lofstedt@xxxxxxxxx>
> > > > Cc: Daniel Vetter <daniel@xxxxxxxx>; Martin Peres
> > > > <martin.peres@xxxxxxxxxxxxxxx>; intel-gfx@xxxxxxxxxxxxxxxxxxxxx
> > > > Subject: Re:  [PATCH i-g-t] tests/initial_state: Add a
> > > > test to capture the state of the GPU
> > > >
> > > > On Tue, May 16, 2017 at 08:54:51AM +0000, Lofstedt, Marta wrote:
> > > > >
> > > > >
> > > > > > -----Original Message-----
> > > > > > From: Chris Wilson [mailto:chris@xxxxxxxxxxxxxxxxxx]
> > > > > > Sent: Tuesday, May 16, 2017 11:21 AM
> > > > > > To: Lofstedt, Marta <marta.lofstedt@xxxxxxxxx>
> > > > > > Cc: Daniel Vetter <daniel@xxxxxxxx>; Martin Peres
> > > > > > <martin.peres@xxxxxxxxxxxxxxx>; intel-gfx@xxxxxxxxxxxxxxxxxxxxx
> > > > > > Subject: Re:  [PATCH i-g-t] tests/initial_state: Add
> > > > > > a test to capture the state of the GPU
> > > > > >
> > > > > > On Tue, May 16, 2017 at 07:42:51AM +0000, Lofstedt, Marta wrote:
> > > > > > > I hereby pull-out this patch.
> > > > > > > The idea of it was to know if we were already wedged at the
> > > > > > > beginning of
> > > > > > testing, that would give us information on how to interpret
> > > > > > silly results; such that test starting to get skipped and/or we
> > > > > > got dmesg-warns/incomplete on tests that usually should be skipped.
> > > > > > > Also, we are planning to soon deploy a piglit.conf solution
> > > > > > > where testing
> > > > > > will be terminated on wedged, so I agree that my test isn't really
> > needed.
> > > > > >
> > > > > > Not everything is broken by wedged; internally we just use that
> > > > > > as an indicator that GEM is hosed. KMS should still work, we
> > > > > > must still be able to drive the displays to show the error and
> > > > > > keep the servers alive until the data is saved (and hopefully
> > > > > > gracefully degrade that we don't have to interrupt their immediate
> > session).
> > > > >
> > > > > It doesn't matter if it is broken or not, if we are terminally
> > > > > wedged the rest
> > > > of the result may be silly. Look for example at CI_DRM_2612, the
> > > > fi-elk-e7500 is wedged at igt@gem_busy@basic-hang-default, then all
> > > > test are skipped until gem_exec_reloc@basic-cpu-gtt-noreloc where
> > > > the machine hangs, but it is a gem test so it should have been
> > > > skipped, right. My conclusion from seeing this pattern multiple
> > > > times is that after terminally wedged, silly things can happen, i.e.
> > > > we can't trust the results, and since we don't want silly bugs, the CI
> > testing should be stopped.
> > > >
> > > > The machine didn't hang, it was remotely killed because the run timed
> > out.
> > > How do you know that?
> > 
> > The dmesg is a stream of flip timeouts until we run out of total BAT runtime
> > (12 minutes + some startup slack).
> > -Chris
> 
> Then look at CI_DRM_2602, wedged at igt@gem_busy@basic-hang-default, after a lot of skipping, we get incomplete result for another test, this time gem_exec_reloc@basic-gtt-cpu-noreloc
> 
> So, gem_exec_reloc@basic-cpu-gtt-noreloc and gem_exec_reloc@basic-gtt-cpu-noreloc are falsely getting blamed and my conclusion is that this is due to the permanent wedging started at gem_busy@basic-hang-default. So, to avoid bug reports for gem_exec_reloc@basic-cpu-gtt-noreloc and gem_exec_reloc@basic-gtt-cpu- noreloc the suggestion is to stop testing after we are terminally wedged. 

The conclusion is still wrong though, it doesn't derive from the wedged
state itself but from another bug that can be triggered by a recovered
gpu hang. The issue is that post-processing isn't yet smart enough to
determine the real cause and picks on the victim. We are always going to
have that problem in some form, the question is what can be done to make
the process smart to avoid false positives or rather just say this is a
general bug and be careful not to be precise on the blame.
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre
_______________________________________________
Intel-gfx mailing list
Intel-gfx@xxxxxxxxxxxxxxxxxxxxx
https://lists.freedesktop.org/mailman/listinfo/intel-gfx




[Index of Archives]     [Linux USB Devel]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]
  Powered by Linux