On 11/05/18 16:51, Chris Wilson wrote:
Quoting Lionel Landwerlin (2018-05-11 16:43:02)
On 11/05/18 15:18, Chris Wilson wrote:
Quoting Lionel Landwerlin (2018-05-11 15:14:13)
My understanding of the virtual memory addressing from the GPU is limited...
But how can the GPU poke at the kernel's allocated data?
I thought we mapped into the GPU's address space only what is allocated
through gem.
Correct. The HW should only be accessing the pages through the GTT and
the GTT should only contain known pages (or a pointer to the scratch
page). There is maybe a hole where we are freeing the memory before
the HW has finished using it (still writing through stale TLB and
whatnot even though the system has reallocated the pages), but other
than that quite, quite scary. Hence this awooga.
-Chris
I managed to reproduce a kasan backtrace on the same test.
So it's not just the CI machine.
But I can't even startup a gdm on that machine with drm-tip. So maybe
there is some much more broken...
Don't leave us in suspense...
Your first patch (check that OA is actually disabled) seems to get rid
of the issue on my machine.
Thanks a lot a for finding that!
Trying to find when HSW when wrong now. Same kernel works just fine on
my SKL.
i915/perf unpins the object correctly before freeing (at which point it
could be reused).
Sure, but does perf know that the OA unit has stopped writing at that
point... That's not so clear (from my pov).
Clearly it wasn't :(
Should we ensure i915_vma_destroy() i915/perf maybe?
It almost seems like this is an issue that could arise in other part of
the driver too.
The problem of the HW continuing to access the pages after unbinding is
inherent to the system (and what actually happens if we change PTE in
flight is usually undefined), hence the great care we go to track HW
activity and try not to release pages while it is still using them.
-Chris