Re: [PATCH v2 2/2] drm/i915/bxt: Fix inadvertent CPU snooping due to incorrect MOCS config

Dave Gordon <david.s.gordon@xxxxxxxxx> · Wed, 27 Apr 2016 19:42:43 +0100

On 27/04/16 15:53, Chris Wilson wrote:
On Wed, Apr 27, 2016 at 04:25:09PM +0300, Eero Tamminen wrote:
Hi,

On 26.04.2016 20:25, Frederick, Michael T wrote:
Sorry I'm not tracking all the MOCs discussions.  I just want to indicate what the coherency means in SoC for BXT.

GTI sets the non-inclusive bit on the IDI interface based on how it treats the memory.  In BXT case where there is no uncore cache, "non-inclusive" just indicates snoop or not.  BXT has a snoop filter in order to make the latency of snooping GT from a core roughly similar to snooping another core.

For BXT:
If GTI sets non-inclusive=0 (i.e. coherent): transaction looks up in the SF and the SA snoops the cores.  The potential impact here is that for high BW coherent traffic, the SF will become the BW limiter of the system and cap BW at 33% * 34GBps. For writes like WCILFs snoops to cores must be resolved before SA requests WR data from GT.  For reads the common case should have no impact because snoop latency is generally much less than memory data latency.  In general snoop latency for a core is relatively small, but there is also the prospect that a core could be down (e.g. ratio change) or loaded w/ snooping.
If GTI sets non-inclusive=1 (i.e. non-coherent): transaction takes the SF bypass and the SA does not snoop the cores.  This is best for high-BW since it removes the SF bottleneck and doesn't require core interaction.

Thanks for the explanation!

AFAIK:

* In regards to 3D driver operations, CPU side doesn't modify the
buffer contents while GPU is working on them.  CPU side sets up the
buffers (textures, VBOs, batches etc), and then (after a flush) GPU
is asked to act on them.

* For things like texture streaming, the driver either internally
synchronizes the data or creates a new copy of it whenever
application tells that data is updated.  There's always some kind of
"upload" involved (GL API needs it as non-integrated GPU's don't
share memory with CPU).

While it's possible that there's a case where snooping would be
needed, I cannot think of any myself.

Daniel, Chris, did you have some concrete example in mind where 3D
driver would require CPU to snoop GPU?

Not mesa, but X can do concurrent rendering to a Pixmap whilst also
rendering from other parts of that Pixmap into a GPU side buffer and
presentation/compositing thereof. X uses snooping both ways (from client
memory to GPU and from GPU to client memory) as well as mixed rendering.

Mesa should be using snooping for both SubTexImage and GetTexImage. On
the SubTexImage path you can use the sampler to do format conversions
that even including the sync overhead for correctness when using client
memory avoid the awful format conversion code in mesa. Using the GPU to
write into client memory and avoiding WC reads is approximately an
order of magnitude (8x) faster than the current code mesa uses.
-Chris

Presumably its useful for the CPU to snoop the h/w status page(s), and 
maybe the ring-context part of a context image (so that TAIL updates are 
coherent), but OTOH snooping the rest of the context image might add 
overhead, and AFAIK we don't normally read (or write) any of that after 
setup. So maybe we don't want vmap-whole-object after all?

.Dave.

_______________________________________________
Intel-gfx mailing list
Intel-gfx@xxxxxxxxxxxxxxxxxxxxx
https://lists.freedesktop.org/mailman/listinfo/intel-gfx