[PATCH 2/4] drm/i915: Only slightly increment hangcheck score if we succesfully kick a ring

chris at chris-wilson.co.uk (Chris Wilson) · Tue, 11 Jun 2013 17:10:38 +0100

On Tue, Jun 11, 2013 at 04:37:26PM +0200, Daniel Vetter wrote:
> On Tue, Jun 11, 2013 at 4:16 PM, Chris Wilson <chris at chris-wilson.co.uk> wrote:
> > On Tue, Jun 11, 2013 at 04:05:41PM +0200, Daniel Vetter wrote:
> >> On Tue, Jun 11, 2013 at 02:40:19PM +0100, Chris Wilson wrote:
> >> > Not sure what you mean here. The check is fairly easy and has gotten us
> >> > out of many a hole before, and makes for a good defense. So how would
> >> > you want to fine tune it?
> >>
> >> Something like the MI_WAIT hangcheck score, but like I've said as long as
> >> we don't have a real-world bug report (some poor guy disabled semaphores
> >> maybe due to the snb issue?) not worth bothering at all.
> >>
> >> I've just thought that if we're unlucky and miss the interrupt a few times
> >> in a row we don't want to accidentally declare the gpu dead.
> >
> > I regarded it as a driver bug, that a GPU reset would not help. So the
> > choice is between limping along with the hopefully occasional stall, or
> > terminating the GPU with extreme prejudice. I chose the former, hence
> > did not increment the hangcheck.
> 
> Hm, maybe I'm reading the logic wrongly, but don't we add a += HUNG
> score now for a stuck, but idle ring? So pretty short of declaring the
> thing dead?

Yeah... Didn't mean to do that, as all the time I was thinking "don't
hang here, this is our bug not userspace's".

> Ofc there's the slow decline if the gpu isn't actually
> dead, but if we have more than 1 such stall every HUNG (=20) hangcheck
> times we'll eventually declare it dead despite the limping along.
> 
> Anyway nothing to really worry about, just wanted to check my
> understanding here.

Looks like my fingers mutinied; and I am the one confused.
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre