On 21/07/2023 05:28, Yang, Fei wrote:
[snip]
@@ -27,15 +28,8 @@ static bool gpu_write_needs_clflush(struct
drm_i915_gem_object *obj)
The code change here looks accurate, but while we're here, I have a
side question about this function in general...it was originally
introduced in commit 48004881f693 ("drm/i915: Mark CPU cache as
dirty when used for
rendering") which states that GPU rendering ends up in the CPU cache
(and thus needs a clflush later to make sure it lands in memory).
That makes sense to me for LLC platforms, but is it really true for
non-LLC snooping platforms (like MTL) as the commit states?
For non-LLC platforms objects can be set to 1-way coherent which
means GPU rendering ending up in CPU cache as well, so for non-LLC
platform the logic here should be checking 1-way coherent flag.
That's the part that I'm questioning (and not just for MTL, but for
all of our other non-LLC platforms too). Just because there's
coherency doesn't mean that device writes landed in the CPU cache.
Coherency is also achieved if device writes invalidate the contents of the CPU cache.
I thought our non-LLC snooping platforms were coherent due to
write-invalidate rather than write-update, but I can't find it
specifically documented anywhere at the moment. If write-invalidate
was used, then there shouldn't be a need for a later clflush either.
[Trying to consolidate by doing a combined reply to the discussion so far.]
On the write-invalidate vs write-update I don't know. If you did not
find it in bspec then I doubt I would. I can have a browse still.
Matt was correct. Quote Ron Silvas from SW ARCH, "MTL GPU doesn't write to
CPU cache, it simply snoop CPU cache on its way to RAM."
Does it apply to all snooping platforms?
And for the cache level/mode based condition, how about replacing it with this:
/*
* Fully coherent cached access may end up with data in the CPU cache
* which hasn't hit memory yet.
*/
return i915_gem_object_has_cache_mode(obj, I915_CACHE_MODE_WB) &&
i915_gem_object_has_cache_flag(obj, I915_CACHE_FLAG_COH2W);
?
Although that would mean old I915_CACHE_LLC on old platforms is actually 2-way coherent.
I am struggling to find a comprehensive explanation in bspec, but for instance 605 makes it sounds like it is fully coherent. Perhaps it really is and I should fix the legacy and Gen12 table..
And if the write-invalidate applies to all snooping platforms then we extend it to:
/*
* Fully coherent cached access may end up with data in the CPU cache
* which hasn't hit memory yet.
*
* But not on snooping platforms, where it is impossible due
* write-invalidate.
*/
return !HAS_SNOOP(i915) &&
(i915_gem_object_has_cache_mode(obj, I915_CACHE_MODE_WB) &&
i915_gem_object_has_cache_flag(obj, I915_CACHE_FLAG_COH2W));
That would prevent any flushing on MTL and make you happy from that aspect.
In fact, the snooping check could be before the cache mode check.
For i915_gem_object_can_bypass_llc it would be ideal if a condition based on the absence of I915_BO_CACHE_COHERENT_FOR_WRITE would work. At least according to the kerneldoc for @cache_coherent:
* I915_BO_CACHE_COHERENT_FOR_WRITE:
*
* When writing through the CPU cache, the GPU is still coherent. Note
* that this also implies I915_BO_CACHE_COHERENT_FOR_READ.
So for objects without it set, we need to force a flush.
And make __i915_gem_object_update_coherency not set it for WB without 1-way coherency set.
According to bspec that would seem correct, because with 1-way snooping on MTL, GPU snoops the IA until first GPU access. So anything the CPU writes before the first GPU access would be coherent and so no need to flush in set pages. But if non-coherent WB is set then we need to flush.
I'll trybot it is and see what happens.
My understanding
was that snooping platforms just invalidated the CPU cache to
prevent future CPU reads from seeing stale data but didn't actually
stick any new data in there? Am I off track or is the original
logic of this function not quite right?
Anyway, even if the logic of this function is wrong, it's a mistake
that would only hurt performance
Yes, this logic will introduce performance impact because it's
missing the checking for obj->pat_set_by_user. For objects with
pat_set_by_user==true, even if the object is snooping or 1-way
coherent, we don't want to enforce a clflush here since the
coherency is supposed to be handled by user space.
What should I add you think to fix it?
I think the simplest would be
if (obj->pat_set_by_user)
return false;
because even checking for incoherent WB is unnecessary, simply no
need for the KMD to initiate a flush if PAT is set by user.
Add a check for non-coherent WB in gpu_write_needs_clflush as an additional condition for returning false?
And then if Matt is correct write-invalidate is used also !HAS_LLC should just return false?
(flushing more often than we truly need to) rather than
functionality, so not something we really need to dig into right now
as part of this patch.
if (IS_DGFX(i915))
return false;
- /*
- * For objects created by userspace through GEM_CREATE with pat_index
- * set by set_pat extension, i915_gem_object_has_cache_level() will
- * always return true, because the coherency of such object is managed
- * by userspace. Othereise the call here would fall back to checking
- * whether the object is un-cached or write-through.
- */
- return !(i915_gem_object_has_cache_level(obj, I915_CACHE_NONE) ||
- i915_gem_object_has_cache_level(obj, I915_CACHE_WT));
+ return i915_gem_object_has_cache_mode(obj, I915_CACHE_MODE_UC) != 1 &&
+ i915_gem_object_has_cache_mode(obj, I915_CACHE_MODE_WT) != 1;
}
[snip]
@@ -640,15 +640,9 @@ static inline int use_cpu_reloc(const struct reloc_cache *cache,
if (DBG_FORCE_RELOC == FORCE_GTT_RELOC)
return false;
- /*
- * For objects created by userspace through GEM_CREATE with pat_index
- * set by set_pat extension, i915_gem_object_has_cache_level() always
- * return true, otherwise the call would fall back to checking whether
- * the object is un-cached.
- */
return (cache->has_llc ||
obj->cache_dirty ||
- !i915_gem_object_has_cache_level(obj, I915_CACHE_NONE));
+ i915_gem_object_has_cache_mode(obj,
+ I915_CACHE_MODE_UC) != 1);
Platforms with relocations and platforms with user-specified PAT
have no overlap, right? So a -1 return should be impossible here
and this is one case where we could just treat the return value as
a boolean, right?
Hm no, or maybe. My thinking behind tri-state is to allow a safe option
for "don't know". In case PAT index to cache mode table is not fully
populated on some future platform.
That would be a problem in the cache mode table. At least max_pat_index
should have guaranteed the PAT index is sane.
Yeah I dropped it locally already.
Regards,
Tvrtko