Re: [PATCH 15/21] drm/i915/gtt: Fill scratch page

Tomas Elf <tomas.elf@xxxxxxxxx> · Thu, 04 Jun 2015 12:08:17 +0100

On 01/06/2015 16:53, Chris Wilson wrote:
On Wed, May 27, 2015 at 07:12:02PM +0100, Tomas Elf wrote:
On 22/05/2015 18:05, Mika Kuoppala wrote:
During review of dynamic page tables series, I was able
to hit a lite restore bug with execlists. I assume that
due to incorrect pd, the batch run out of legit address space
and into the scratch page area. The ACTHD was increasing
due to scratch being all zeroes (MI_NOOPs). And as gen8
address space is quite large, the hangcheck happily waited
for a long long time, keeping the process effectively stuck.

According to Chris Wilson any modern gpu will grind to halt
if it encounters commands of all ones. This seemed to do the
trick and hang was declared promptly when the gpu wandered into
the scratch land.

v2: Use 0xffff00ff pattern (Chris)

Just for my own benefit:

1. Is there any particular reason for this pattern rather than 0xffffffff?

It is more obvious when userspace reads from the page and copies it into
its own data structures or surfaces. See below, if this does impact
userspace we should probably revert this patch anyway.

2. Someone please correct me if I'm wrong here but at least based on
my own experiences with gen9 submitting batch buffers filled with
bad instructions (0xffffffff) to the GPU does not hang it. I'm
guessing that is because there's allegedly a hardware security
parser that MI_NOOPs out invalid instructions during execution. If
that's the case here then I guess we might have to come up with
something else for gen9+ if we want to induce engine hangs once the
execution reaches the scratch page?

It's not a problem, there will be a GPU hang eventually (in theory at
least). Mika is just trying to shortcircuit that by causing an immediate
hang.

Interesting! Why do you think the execution would hang eventually? 
Simply because it would end up at an invalid opcode somewhere in memory 
at some point, which would trig a hang? If we're talking gen9 then I'm 
assuming that the security parser would MI_NOOP out all invalid opcodes 
before they get a chance to hang the GPU (that would certainly be the 
case with 0xffffffff and 0xffff00ff, based on my own experiments). Or 
are we assuming that at some point the execution would end up at a valid 
opcode that would still put the GPU in an arbitrary execution state that 
would take a practically infinite amount of time to complete? Such as 
executing a semaphore instruction that no thread would ever think of 
signalling?

Thanks,
Tomas

On the other hand, on gen9+ page faulting is supposedly not broken
anymore so maybe we don't need the scratch page to begin with there
so maybe it's all moot at that point? Again, if I'm making no sense
here feel free to set things straight, I'm very curious about how
all of this is supposed to work.

Generating a pagefault for invalid access is an ABI change and requires
opt-in (we have discussed context flags in the past). The most obvious
example is the CS prefetch, which we have to prevent generating faults by
providing guard pages (on older chipsets at least). But as we have been
historically lax on allowing userspace to access invalid pages, we have
to assume that userspace has been taking advantage of that.
-Chris

_______________________________________________
Intel-gfx mailing list
Intel-gfx@xxxxxxxxxxxxxxxxxxxxx
http://lists.freedesktop.org/mailman/listinfo/intel-gfx