Re: [PATCH v3] drm/i915: Refactor PAT/object cache handling

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 




On 20/07/2023 01:22, Matt Roper wrote:
On Wed, Jul 19, 2023 at 05:07:15PM -0700, Yang, Fei wrote:
[snip]
@@ -27,15 +28,8 @@ static bool gpu_write_needs_clflush(struct drm_i915_gem_object *obj)

The code change here looks accurate, but while we're here, I have a side
question about this function in general...it was originally introduced
in commit 48004881f693 ("drm/i915: Mark CPU cache as dirty when used for
rendering") which states that GPU rendering ends up in the CPU cache
(and thus needs a clflush later to make sure it lands in memory).  That
makes sense to me for LLC platforms, but is it really true for non-LLC
snooping platforms (like MTL) as the commit states?

For non-LLC platforms objects can be set to 1-way coherent which means
GPU rendering ending up in CPU cache as well, so for non-LLC platform
the logic here should be checking 1-way coherent flag.

That's the part that I'm questioning (and not just for MTL, but for all
of our other non-LLC platforms too).  Just because there's coherency
doesn't mean that device writes landed in the CPU cache.  Coherency is
also achieved if device writes invalidate the contents of the CPU cache.
I thought our non-LLC snooping platforms were coherent due to
write-invalidate rather than write-update, but I can't find it
specifically documented anywhere at the moment.  If write-invalidate was
used, then there shouldn't be a need for a later clflush either.

[Trying to consolidate by doing a combined reply to the discussion so far.]

On the write-invalidate vs write-update I don't know. If you did not find it in bspec then I doubt I would. I can have a browse still.

My understanding
was that snooping platforms just invalidated the CPU cache to prevent
future CPU reads from seeing stale data but didn't actually stick any
new data in there?  Am I off track or is the original logic of this
function not quite right?

Anyway, even if the logic of this function is wrong, it's a mistake that
would only hurt performance

Yes, this logic will introduce performance impact because it's missing the
checking for obj->pat_set_by_user. For objects with pat_set_by_user==true,
even if the object is snooping or 1-way coherent, we don't want to enforce
a clflush here since the coherency is supposed to be handled by user space.

What should I add you think to fix it?

Add a check for non-coherent WB in gpu_write_needs_clflush as an additional condition for returning false?

And then if Matt is correct write-invalidate is used also !HAS_LLC should just return false?

(flushing more often than we truly need to)
rather than functionality, so not something we really need to dig into
right now as part of this patch.

      if (IS_DGFX(i915))
              return false;

-    /*
-     * For objects created by userspace through GEM_CREATE with pat_index
-     * set by set_pat extension, i915_gem_object_has_cache_level() will
-     * always return true, because the coherency of such object is managed
-     * by userspace. Othereise the call here would fall back to checking
-     * whether the object is un-cached or write-through.
-     */
-    return !(i915_gem_object_has_cache_level(obj, I915_CACHE_NONE) ||
-             i915_gem_object_has_cache_level(obj, I915_CACHE_WT));
+    return i915_gem_object_has_cache_mode(obj, I915_CACHE_MODE_UC) != 1 &&
+           i915_gem_object_has_cache_mode(obj, I915_CACHE_MODE_WT) != 1;
  }

[snip]
@@ -640,15 +640,9 @@ static inline int use_cpu_reloc(const struct reloc_cache *cache,
      if (DBG_FORCE_RELOC == FORCE_GTT_RELOC)
              return false;

-    /*
-     * For objects created by userspace through GEM_CREATE with pat_index
-     * set by set_pat extension, i915_gem_object_has_cache_level() always
-     * return true, otherwise the call would fall back to checking whether
-     * the object is un-cached.
-     */
      return (cache->has_llc ||
              obj->cache_dirty ||
-            !i915_gem_object_has_cache_level(obj, I915_CACHE_NONE));
+            i915_gem_object_has_cache_mode(obj, I915_CACHE_MODE_UC) != 1);

Platforms with relocations and platforms with user-specified PAT have no
overlap, right?  So a -1 return should be impossible here and this is
one case where we could just treat the return value as a boolean, right?


Hm no, or maybe. My thinking behind tri-state is to allow a safe option for "don't know". In case PAT index to cache mode table is not fully populated on some future platform.

My understanding is that the condition here means to say that, if GPU
access is uncached, don't use CPU reloc because the CPU cache might
contain stale data. This condition is sufficient for snooping platforms.
But from MTL onward, the condition show be whether the GPU access is
coherent with CPU. So, we should be checking 1-way coherent flag instead
of UC mode, because even if the GPU access is WB, it's still non-coherent,
thus CPU cache could be out-dated.

Honestly the matrix of caching decision/logic is huge.. Could we re-use one of the "coherent for read/write" here? From which point of view are those flags looking, and from which point of view do we need it to be?

You are saying question is whether GPU reads are coherent with CPU writes, will the GPU see CPU writes?

Why would the condition then be "not UC"? Hm maybe because it is slow so we want to avoid it?

What would it mean for Meteorlake? 1-way coherent WB, which way is one way? :)

In his reply elsewhere Matt seems to be suggesting WB and WB-1-way are coherent in this aspect, and that actually WT and UC are not, I'll quote:

"""
Does the has_wb_mode here matter?  I think MTL's WT (PAT 1) and UC (PAT
2) are also problematic because the GPU still won't snoop the CPU caches
and can potentially read whatever data was in the buffer before a CPU
memory clearing operation, right?  Basically anything besides PAT3 and
PAT4 on MTL can miss updates that are still sitting in the CPU cache.
"""

Also, how does this relate to i915_gem_object_can_bypass_llc. Should we use similar logic at both places - "can GPU not see some CPU update". Just that the MOCS angle and I915_BO_ALLOC_USER alloc user do not apply in use_cpu_reloc.

My point is that this is relocation code --- it should be impossible to
get here on MTL and beyond, right?  So user-provided PAT isn't a
consideration.

Ah no, it isn't, but.. one of the most important points why I wanted to clean this up is to make it generic. Like if we enable set_pat extension on any platform it just works as long as the pat-to-cache table is correctly defined.

So I wouldn't dismiss that goal and if in the process we achieve some good comments and clarifications in the code base that would be a great result I think.

Regards,

Tvrtko


[snip]
@@ -208,12 +230,6 @@ bool i915_gem_object_can_bypass_llc(struct drm_i915_gem_object *obj)
      if (!(obj->flags & I915_BO_ALLOC_USER))
              return false;

-    /*
-     * Always flush cache for UMD objects at creation time.
-     */
-    if (obj->pat_set_by_user)
-            return true;
-

I'm still worried that the removal of these lines would cause the
MESA failure seen before. I know you are checking pat index below, but
that is only about GPU access. It doesn't tell you how CPU is going to
access the memory. If user space is setting an uncached PAT, then use
copy engine to zero out the memory, but on the CPU side the mapping is
cacheable, you could still seeing garbage data.

I agree the lines don't belong here because it doesn't have anything
to do with LLC, but they need to be moved to the right location instead
of being removed.

These lines got replaced with a check for the specific PAT indices that
are problematic rather than just assuming any user-provided PAT might
cause problems.  But I had some concerns about the specific logic there
in my review as well.


Matt


      /*
       * EHL and JSL add the 'Bypass LLC' MOCS entry, which should make it
       * possible for userspace to bypass the GTT caching bits set by the
@@ -226,7 +242,21 @@ bool i915_gem_object_can_bypass_llc(struct drm_i915_gem_object *obj)
       * it, but since i915 takes the stance of always zeroing memory before
       * handing it to userspace, we need to prevent this.
       */
-    return IS_JSL_EHL(i915);
+    if (IS_JSL_EHL(i915))
+            return true;
+




[Index of Archives]     [AMD Graphics]     [Linux USB Devel]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux