On Thu, Jul 20, 2023 at 09:28:56PM -0700, Yang, Fei wrote: > >>> [snip] > >>>>> @@ -27,15 +28,8 @@ static bool gpu_write_needs_clflush(struct > >>>>> drm_i915_gem_object *obj) > >>>> > >>>> The code change here looks accurate, but while we're here, I have a > >>>> side question about this function in general...it was originally > >>>> introduced in commit 48004881f693 ("drm/i915: Mark CPU cache as > >>>> dirty when used for > >>>> rendering") which states that GPU rendering ends up in the CPU cache > >>>> (and thus needs a clflush later to make sure it lands in memory). > >>>> That makes sense to me for LLC platforms, but is it really true for > >>>> non-LLC snooping platforms (like MTL) as the commit states? > >>> > >>> For non-LLC platforms objects can be set to 1-way coherent which > >>> means GPU rendering ending up in CPU cache as well, so for non-LLC > >>> platform the logic here should be checking 1-way coherent flag. > >> > >> That's the part that I'm questioning (and not just for MTL, but for > >> all of our other non-LLC platforms too). Just because there's > >> coherency doesn't mean that device writes landed in the CPU cache. > >> Coherency is also achieved if device writes invalidate the contents of the CPU cache. > >> I thought our non-LLC snooping platforms were coherent due to > >> write-invalidate rather than write-update, but I can't find it > >> specifically documented anywhere at the moment. If write-invalidate > >> was used, then there shouldn't be a need for a later clflush either. > > > > [Trying to consolidate by doing a combined reply to the discussion so far.] > > > > On the write-invalidate vs write-update I don't know. If you did not > > find it in bspec then I doubt I would. I can have a browse still. > > Matt was correct. Quote Ron Silvas from SW ARCH, "MTL GPU doesn't write to > CPU cache, it simply snoop CPU cache on its way to RAM." That's good to know. We probably also want to find out if the same is true for old snooping platforms (i.e., things like VLV/CHV/BXT) and/or dgpu's (where I think the behavior is probably defined by the pcie spec, but I'm not sure where to look for that). > > >>>> My understanding > >>>> was that snooping platforms just invalidated the CPU cache to > >>>> prevent future CPU reads from seeing stale data but didn't actually > >>>> stick any new data in there? Am I off track or is the original > >>>> logic of this function not quite right? > >>>> > >>>> Anyway, even if the logic of this function is wrong, it's a mistake > >>>> that would only hurt performance > >>> > >>> Yes, this logic will introduce performance impact because it's > >>> missing the checking for obj->pat_set_by_user. For objects with > >>> pat_set_by_user==true, even if the object is snooping or 1-way > >>> coherent, we don't want to enforce a clflush here since the > >>> coherency is supposed to be handled by user space. > > > > What should I add you think to fix it? > > I think the simplest would be > > if (obj->pat_set_by_user) > return false; > > because even checking for incoherent WB is unnecessary, simply no > need for the KMD to initiate a flush if PAT is set by user. I guess one thing we're missing today is a well-documented explanation of the expectations for i915_gem_set_domain_ioctl behavior when we're using a user-defined PAT? Matt > > > Add a check for non-coherent WB in gpu_write_needs_clflush as an additional condition for returning false? > > > > And then if Matt is correct write-invalidate is used also !HAS_LLC should just return false? > > > >>>> (flushing more often than we truly need to) rather than > >>>> functionality, so not something we really need to dig into right now > >>>> as part of this patch. > >>>> > >>>>> if (IS_DGFX(i915)) > >>>>> return false; > >>>>> > >>>>> - /* > >>>>> - * For objects created by userspace through GEM_CREATE with pat_index > >>>>> - * set by set_pat extension, i915_gem_object_has_cache_level() will > >>>>> - * always return true, because the coherency of such object is managed > >>>>> - * by userspace. Othereise the call here would fall back to checking > >>>>> - * whether the object is un-cached or write-through. > >>>>> - */ > >>>>> - return !(i915_gem_object_has_cache_level(obj, I915_CACHE_NONE) || > >>>>> - i915_gem_object_has_cache_level(obj, I915_CACHE_WT)); > >>>>> + return i915_gem_object_has_cache_mode(obj, I915_CACHE_MODE_UC) != 1 && > >>>>> + i915_gem_object_has_cache_mode(obj, I915_CACHE_MODE_WT) != 1; > >>>>> } > >>> > >>> [snip] > >>>>> @@ -640,15 +640,9 @@ static inline int use_cpu_reloc(const struct reloc_cache *cache, > >>>>> if (DBG_FORCE_RELOC == FORCE_GTT_RELOC) > >>>>> return false; > >>>>> > >>>>> - /* > >>>>> - * For objects created by userspace through GEM_CREATE with pat_index > >>>>> - * set by set_pat extension, i915_gem_object_has_cache_level() always > >>>>> - * return true, otherwise the call would fall back to checking whether > >>>>> - * the object is un-cached. > >>>>> - */ > >>>>> return (cache->has_llc || > >>>>> obj->cache_dirty || > >>>>> - !i915_gem_object_has_cache_level(obj, I915_CACHE_NONE)); > >>>>> + i915_gem_object_has_cache_mode(obj, > >>>>> + I915_CACHE_MODE_UC) != 1); > >>>> > >>>> Platforms with relocations and platforms with user-specified PAT > >>>> have no overlap, right? So a -1 return should be impossible here > >>>> and this is one case where we could just treat the return value as > >>>> a boolean, right? > >>> > > > > Hm no, or maybe. My thinking behind tri-state is to allow a safe option > > for "don't know". In case PAT index to cache mode table is not fully > > populated on some future platform. > > That would be a problem in the cache mode table. At least max_pat_index > should have guaranteed the PAT index is sane. > > -Fei -- Matt Roper Graphics Software Engineer Linux GPU Platform Enablement Intel Corporation