Re: [PATCH 18/46] drm/i915: Replace global_seqno with a hangcheck heartbeat seqno

Chris Wilson <chris@xxxxxxxxxxxxxxxxxx> · Tue, 12 Feb 2019 13:36:30 +0000

Quoting Tvrtko Ursulin (2019-02-11 16:56:03)
> 
> On 11/02/2019 12:44, Chris Wilson wrote:
> > Quoting Tvrtko Ursulin (2019-02-11 12:40:07)
> >>
> >> On 06/02/2019 13:03, Chris Wilson wrote:
> >>> To determine whether an engine has 'stuck', we simply check whether or
> >>> not is still on the same seqno for several seconds. To keep this simple
> >>> mechanism intact over the loss of a global seqno, we can simply add a
> >>> new global heartbeat seqno instead. As we cannot know the sequence in
> >>> which requests will then be completed, we use a primitive random number
> >>> generator instead (with a cycle long enough to not matter over an
> >>> interval of a few thousand requests between hangcheck samples).
> >>
> >> We couldn't keep the global seqno just for hangcheck puposes? I mean as
> >> long as it is unique, which would be guaranteed by obtaining an
> >> increment on every submission to hw and storing it in atomic_t
> >> i915->hangcheck_global_seqno / rq->hangcheck_global_seqno, hangcheck
> >> does not care about the order of execution, no?
> > 
> > s/global_seqno/hangcheck_seqno/ ?
> 
> Yes sure, I was just trying to express the idea that a "globally" unique 
> number is all that I thought we need. Like:
> 
>      rq->hangcheck_seqno = atomic_inc_return(&i915->hangcheck_seqno);
> 
> Did I get that right then? That we don't really need the pseudo random 
> number solution? We could even avoid calling it a seqno if desired. 
> rq->unique, wait.. we possibly had this name for something in the past..

We don't need it to be random, I just picked the psuedo-random number so
we got used to not expecting it to be sequential and to be sure we
didn't make the mistake of assuming it was.

> > (a) the goal is to kill off global_seqno entirely so we are all sure
> > there is no such seqno or ordering anymore
> > (b) this is a temporary patch and we kill off hangcheck_seqno, just as
> > soon as I can submit requests without struct_mutex
> 
> The heartbeat request solution? Is that better than the hangcheck seqno?

Yes. We don't need an extra seqno every request and handles preemptible
OpenCL persistent kernels, as well as any other long running compute
batch (thinking some of the WebGL tests, they both expect hangcheck and
expect that isn't too quick afair).
-Chris
_______________________________________________
Intel-gfx mailing list
Intel-gfx@xxxxxxxxxxxxxxxxxxxxx
https://lists.freedesktop.org/mailman/listinfo/intel-gfx