[benjamin.widawsky@xxxxxxxxx: intel_gpu_top broken for HSW. Ideas needed]

ben at bwidawsk.net (Ben Widawsky) · Fri, 12 Jul 2013 10:27:06 -0700



On Fri, Jul 12, 2013 at 07:16:37PM +0200, Daniel Vetter wrote:
> On Fri, Jul 12, 2013 at 7:12 PM, Ben Widawsky
> <benjamin.widawsky at intel.com> wrote:
> > FWD'd from our internal list now that we have more insight.
> > ----- Forwarded message from Ben Widawsky <benjamin.widawsky at intel.com> -----
> >
> > Date: Thu, 11 Jul 2013 10:32:03 -0700
> > From: Ben Widawsky <benjamin.widawsky at intel.com>
> > To: linux-gfx at linux.intel.com
> > Subject: intel_gpu_top broken for HSW. Ideas needed
> > Message-ID: <20130711173202.GB8802 at intel.com>
> >
> > Hi everybody.
> >
> > While investigating a hard hang on Haswell. Eero noticed that
> > intel_gpu_top helped to invoke the hang faster. I used this in my test
> > case to validation, and they are suspecting it is a known issue which we
> > have not yet worked around (and cannot reasonably workaround).
> >
> > [internal bug sighting redacted]
> >
> > To sum up, we cannot concurrently access registers within the same
> > cacheline. It has the potential to hit a known bug.
> >
> > I see some choices:
> > 1. Don't do anything.
> > 2. Try to eliminate shared registers as much as possible. Instdone is
> >    used by the hangcheck, and we can eliminate hangcheck with a
> >    module parameter. Eero, can you try this as a workaround, btw?
> > 3. Somehow make the kernel collect the top data and serialize access
> >    there.
> >
> > Anyone else have input? I personally do not use top very much, so I
> > won't be volunteering to do any of these.
> 
> 
> For now I'd just vote for a warning on gen6+ on the intel-gpu-top
> screen that this might hang hw. If anyone cares we could add a debugfs
> interface (or finally get real approval for the performance counters
> the hw has an expose them properly). Not a intel_gpu_top user myself
> though.
> -Daniel
>
Eero: I meant to add by the way, ring head/tail are also used as much as
instdone. So Maybe we can get rid of that for the ring fullness check.
We're *very* likely to hit that one.

-- 
Ben Widawsky, Intel Open Source Technology Center