Re: [PATCH v2 2/2] drm/i915/bxt: Fix inadvertent CPU snooping due to incorrect MOCS config

Imre Deak <imre.deak@xxxxxxxxx> · Thu, 28 Apr 2016 20:15:24 +0300

On to, 2016-04-28 at 16:48 +0200, Daniel Vetter wrote:
> On Thu, Apr 28, 2016 at 11:38:55AM +0300, Imre Deak wrote:
> > On to, 2016-04-28 at 10:17 +0200, Daniel Vetter wrote:
> > > On Tue, Apr 26, 2016 at 07:01:06PM +0300, Imre Deak wrote:
> > > > On ti, 2016-04-26 at 15:42 +0100, Chris Wilson wrote:
> > > > > On Tue, Apr 26, 2016 at 05:26:43PM +0300, Eero Tamminen
> > > > > wrote:
> > > > > > Hi,
> > > > > > 
> > > > > > On 26.04.2016 16:23, Chris Wilson wrote:
> > > > > > > On Tue, Apr 26, 2016 at 04:17:55PM +0300, Imre Deak
> > > > > > > wrote:
> > > > > > > > On ti, 2016-04-26 at 13:57 +0100, Chris Wilson wrote:
> > > > > > > > > On Tue, Apr 26, 2016 at 03:44:22PM +0300, Imre Deak
> > > > > > > > > wrote:
> > > > > > > > > > Setting a write-back cache policy in the MOCS entry
> > > > > > > > > > definition also
> > > > > > > > > > implies snooping, which has a considerable
> > > > > > > > > > overhead. This
> > > > > > > > > > is
> > > > > > > > > > unexpected for a few reasons:
> > > > > > > > > 
> > > > > > > > > If it is snooping, then I don't see why it is
> > > > > > > > > undesirable to
> > > > > > > > > have it
> > > > > > > > > available in a mocs setting. If it is bogus and the
> > > > > > > > > bit is
> > > > > > > > > undefined,
> > > > > > > > > then by all means remove it.
> > > > > > > > 
> > > > > > > > None of these entries are used alone for coherent
> > > > > > > > surfaces. For
> > > > > > > > that
> > > > > > > > the application would have to use entry index#1 or #2
> > > > > > > > _and_
> > > > > > > > call the
> > > > > > > > set caching IOCTL to set the corresponding buffer to be
> > > > > > > > cached.
> > > > > > > 
> > > > > > > No, the application doesn't. There are sufficent
> > > > > > > interfaces
> > > > > > > exposed that
> > > > > > > userspace can bypass the kernel if it so desired.
> > > > > > > 
> > > > > > > > The
> > > > > > > > problem is that without setting the buffer to be
> > > > > > > > cacheable the
> > > > > > > > expectation is that we won't be snooping and incur the
> > > > > > > > corresponding
> > > > > > > > overhead. This is what this patch addresses.
> > > > > > > 
> > > > > > > Not true.
> > > > > > > 
> > > > > > > > The bit is also bogus, if we wanted snooping via MOCS
> > > > > > > > we'd use
> > > > > > > > the
> > > > > > > > dedicated HW flag for that.
> > > > > > > 
> > > > > > > But you keep saying this bit *enables* snooping. So
> > > > > > > either it
> > > > > > > does or it
> > > > > > > doesn't.
> > > > > > > 
> > > > > > > > If we wanted to have a snooping MOCS entry we should
> > > > > > > > add that
> > > > > > > > separately (as a forth entry), but we'd need this
> > > > > > > > change as a
> > > > > > > > fix for
> > > > > > > > current users.
> > > > > > > 
> > > > > > > The current users who are getting what they request but
> > > > > > > don't
> > > > > > > know what
> > > > > > > they were requesting?
> > > > > > 
> > > > > > What this kernel ABI (index entry #2) has been agreed &
> > > > > > documented
> > > > > > to provide?
> > > > > 
> > > > > The ABI is what we agree makes sense between hardware /
> > > > > kernel /
> > > > > userspace, and then we stick to it.
> > > > > 
> > > > > > I thought this entry is supposed to replace the writeback
> > > > > > LLC/eLLC
> > > > > > cache MOCS setting Mesa is using on (e.g. BDW) to speed up
> > > > > > accesses
> > > > > > to a memory area which it knows always to be accessed so
> > > > > > that it
> > > > > > can
> > > > > > be cached.
> > > > > 
> > > > > Which it does... Only it speeds us writeback from the CPU,
> > > > > not the
> > > > > GPU. ;)
> > > > > 
> > > > > The confusion seems to be in mistaking !llc for llc. We have
> > > > > to come
> > > > > to
> > > > > some agreement on whether it makes sense having multiple
> > > > > entries for
> > > > > the
> > > > > same follows-PTE mocs or whether it makes more sense to
> > > > > expose the hw
> > > > > capabilties.
> > > > 
> > > > Note that we can't just say to Mesa to use index #0 on BXT
> > > > instead of
> > > > index #2, since index #0 turns off caching in GPU L3, so we'd
> > > > have to
> > > > also redefine that to be L3 cached. And I don't know what
> > > > turning on L3
> > > > caching for index #0 would break, I would rather avoid doing
> > > > that. Imo
> > > > defining the entries the following way would allow us to use
> > > > the same
> > > > indexes on GEN9 regardless of it being SKL or BXT:
> > > > #0: uncached (neither in L3 nor in (e)LLC)
> > > > #1: PTE passthrough
> > > 
> > > So our rendercpy in igt does set MOCS entry 1. Or how exactly do
> > > all the
> > > set_caching vs. rendercpy tests we currently have pass? Just not?
> > 
> > We don't have tests for coherent surfaces. The current rendercpy
> > test
> > just uses uncached buffers, so they are flushed before checking the
> > result. I could add a new subtest to rendercpy to test with a
> > cached/coherent surface.
> > 
> > > Also, you're guaranateeing that opencl/libva don't screw this up
> > > either?
> > 
> > If they don't set the given buffer to be cached via the set_caching
> > IOCTL (as a consequence making them coherent) they are already
> > screwed
> > on CHV. If they call the IOCTL they are fine on BXT too.
> 
> We do implicit set_caching when displaying something to something
> coherent. To make that work userspace should use the "use PTE" mode by
> default, except when they really know what they're doing.
> That's also the mode that's supposed to give you the most reasonable
> performance. But somehow that mode ended up in MOCS entry 1, so pretty much
> guaranteed userspace will get it wrong. Mesa just hit a perf snag, but
> might as well have been visual corruption. I think it'd be a lot safe to
> make "use PTE" entry 0.

Mesa uses entries 1 and 2. If something else like opencl or libva (or
even Mesa for that matter) uses index 0 for PTE pass-through that's a
bug on its own. I don't know if this is the case, but it's a separate
issue from what I'm trying to fix here.

This isn't about a case where a PTE pass-through entry needs to be
provided, but about the case where a cached but non-coherent one is
needed. Mesa assumes this to be entry 2 and I don't see why we couldn't
make sure that this assumption holds. Note that this entry on BXT could
be both a PTE pass-through one as in this patch or LLC-UC.

--Imre

> > > It's UABI, and we've botched it. Smells like we need to have a
> > > lot more
> > > solid story and make sure we get this right this time around.
> > > There's also
> > > the "how does dma-buf mmap support fit in" question.
> > > 
> > > Oh and nope, none of the relevant testcases are in BAT at all.
> > > 
> > > > #2: cached everywhere (L3 + (e)LLC if it exists), non-coherent
> > > > #3: cached everywhere, coherent
> > > > 
> > > > I'm not sure if there is even a user for coherent surfaces atm,
> > > > so then
> > > > we could delay adding #3 until it's really needed.
> > > 
> > > The problem with the above is that the various access paths on
> > > SoC
> > > (without LLC) and big core (with llc) are massively different.
> > > You need
> > > lots of different cases in upload/download paths for max speed
> > > anyway.
> > 
> > The current assumption everywhere in user space is the above three
> > entries with their definition above as far as I know. There are no
> > users that would use a MOCS entry to get a coherent surface, we
> > could
> > add that as a 4th entry once a user arises.
_______________________________________________
Intel-gfx mailing list
Intel-gfx@xxxxxxxxxxxxxxxxxxxxx
https://lists.freedesktop.org/mailman/listinfo/intel-gfx