[benjamin.widawsky@xxxxxxxxx: intel_gpu_top broken for HSW. Ideas needed]

ben at bwidawsk.net (Ben Widawsky) · Fri, 12 Jul 2013 10:35:24 -0700

On Fri, Jul 12, 2013 at 10:12:39AM -0700, Ben Widawsky wrote:
> FWD'd from our internal list now that we have more insight.
> ----- Forwarded message from Ben Widawsky <benjamin.widawsky at intel.com> -----
> 
> Date: Thu, 11 Jul 2013 10:32:03 -0700
> From: Ben Widawsky <benjamin.widawsky at intel.com>
> To: linux-gfx at linux.intel.com
> Subject: intel_gpu_top broken for HSW. Ideas needed
> Message-ID: <20130711173202.GB8802 at intel.com>
> 
> Hi everybody.
> 
> While investigating a hard hang on Haswell. Eero noticed that
> intel_gpu_top helped to invoke the hang faster. I used this in my test
> case to validation, and they are suspecting it is a known issue which we
> have not yet worked around (and cannot reasonably workaround).
> 
> [internal bug sighting redacted]
> 
> To sum up, we cannot concurrently access registers within the same
> cacheline. It has the potential to hit a known bug.
> 
> I see some choices:
> 1. Don't do anything.
> 2. Try to eliminate shared registers as much as possible. Instdone is
>    used by the hangcheck, and we can eliminate hangcheck with a
>    module parameter. Eero, can you try this as a workaround, btw?
> 3. Somehow make the kernel collect the top data and serialize access
>    there.
> 
> Anyone else have input? I personally do not use top very much, so I
> won't be volunteering to do any of these.
> 

BTW, of course any tool which reads or writes registers is subject to
the same problem. GPU top is just the one that kind of depends upon us
not synchronizing with the kernel.

-- 
Ben Widawsky, Intel Open Source Technology Center