Re: [PATCH v2] drm/i915: Refactor PAT/cache handling

"Yang, Fei" <fei.yang@xxxxxxxxx> · Fri, 7 Jul 2023 00:25:39 +0000

>>>>> @@ -27,15 +28,8 @@ static bool gpu_write_needs_clflush(struct  drm_i915_gem_object *obj)
>>>>>          if (IS_DGFX(i915))
>>>>>              return false;
>>>>> -       /*
>>>>> -        * For objects created by userspace through GEM_CREATE with pat_index
>>>>> -        * set by set_pat extension, i915_gem_object_has_cache_level() will
>>>>> -        * always return true, because the coherency of such object is managed
>>>>> -        * by userspace. Othereise the call here would fall back to checking
>>>>> -        * whether the object is un-cached or write-through.
>>>>> -        */
>>>>> -       return !(i915_gem_object_has_cache_level(obj, I915_CACHE_NONE) ||
>>>>> -                i915_gem_object_has_cache_level(obj, I915_CACHE_WT));
>>>>> +       return i915_gem_object_has_cache_mode(obj, I915_CACHE_MODE_UC) != 1 &&
>>>>> +              i915_gem_object_has_cache_mode(obj, I915_CACHE_MODE_WT) != 1;
>>>>
>>>> This logic was changed for objects with pat index set by user here. It
>>>> used to return false regardless the cache mode. But now it returns true
>>>> if its cache mode is neither UC nor WT.
>>>
>>> Yes, that was half of the motivation of the refactory. Before the PAT
>>> index series code was like this:
>>>
>>>        return !(obj->cache_level == I915_CACHE_NONE ||
>>>                 obj->cache_level == I915_CACHE_WT);
>>> So kernel knew it needs to flush only if caching mode is neither UC or WT.
>>> With the PAT index series you changed it to:
>>>
>>>        return !(i915_gem_object_has_cache_level(obj, I915_CACHE_NONE) ||
>>>                 i915_gem_object_has_cache_level(obj, I915_CACHE_WT));
>>> And as i915_gem_object_has_cache_level was changed to always return true
>>> if PAT was set by user, that made the check meaningless for such objects.
>>
>> But the point is that the KMD should not flush the cache for objects

>> with PAT set by user because UMD is handling the cache conherency.

>> That is the design. Well doing so wouldn't cause functional issue, but

>> it's unecessary and might have performance impact.
>
> Not all i915_gem_object_has_cache_level() are even about flushing the cache
> and if the kernel doesn't know what is behind a PAT index
> (i915_gem_object_has_cache_level lies by always returning true) are we 100%
> sure everything is functionally correct?
>
> flush_write_domain() for instance uses it to determine whether to set

> obj->cache_dirty after GPU activity. How does the UMD manage that?
>
> Then use_cpu_reloc(). Another pointless/misleading question.
>
> Finally vm_fault_gtt() rejects access based on it.
>
> Perhaps the question is moot since the set pat extension is restricted to
> MTL so some other conditions used in the above checks, like HAS_LLC and such,
> make for no practical difference. Even if so, what if the extension was allowed
> on other platforms as it was the plan until it was discovered there is no
> userspace code for other platforms. Would the plan work on all platforms? And
> even if it would I think the implementation is very non-obvious.
>

Understand your point, perhaps we should let i915_gem_object_has_cache_mode()
do what it supposed to do, and add a separate check for obj->pat_set_by_user
in functions like gpu_write_needs_clflush(), use_cpu_reloc(), etc. Anyway,
the design is to let UMD handle coherency for objects with pat set by user.

>>> With my refactoring it becomes meaningful again and restores to the

>>> original behaviour. That's the intent at least.
>>>
>>>>>  bool i915_gem_cpu_write_needs_clflush(struct drm_i915_gem_object *obj)
>>>>> @@ -255,9 +249,9 @@ i915_gem_object_set_to_gtt_domain(struct drm_i915_gem_object *obj, bool write)
>>>>>  }
>>>>>
>>>>>  /**
>>>>> - * i915_gem_object_set_cache_level - Changes the cache-level of an object across all VMA.
>>>>> + * i915_gem_object_set_cache - Changes the cache-level of an object across all VMA.

[...]

>>>>> -       if (i915_gem_object_has_cache_level(obj, cache_level))  
>>>>> +       ret = i915_cache_find_pat(i915, cache);
>>>>> +       if (ret < 0) {
>>>>> +           struct drm_printer p =
>>>>> +                drm_err_printer("Attempting to use unknown caching mode ");
>>>>> +  
>>>>> +           i915_cache_print(&p, cache);
>>>>> +           drm_puts(&p, "!\n");
>>>>> +
>>>>> +           return -EINVAL;
>>>>> +       } else if (ret == obj->pat_index) {
>>>>>             return 0;
>>>> We will have to do this conversion and checking again later in this
>>>> function when calling i915_gem_object_set_cache_coherency().
>>>
>>> Yes the double lookup part is a bit naff. It is not super required  
>>> apart for validating internal callers (could be a debug build only
>>> feature perhaps) and to see if PAT index has changed and so whether  
>>> it needs to call i915_gem_object_wait before proceeding to
>>> i915_gem_object_set_cache_coherency...
>>>
>>>> My thought was to simply remove the use of cache_level/cache and replace
>>>> it with pat_idex. Conversion from cache modes to pat index should be done
>>>> before calling any function used to consume cache_level/cache.
>>>
>>> ... I could probably call the setter which takes PAT index instead of
>>> i915_gem_object_set_cache_coherency few lines below. That would skip the
>>> double lookup and make you happy(-ier)?
>> 
>> Do you see any problem just letting these functions take pat_index as

>> the second argument? These functions are currently called with a

>> constant cache_level/mode, if we have INTEL_INFO(i915)->pat_uc/wt/wb

>> set properly, using pat index makes no difference, right?
>
> Which ones?

i915_gem_object_set_cache_level() and i915_gem_object_set_cache_coherency()
are both being called with cache_level as of now. That is not necessary if
platform specific INTEL_INFO(i915)->pat_uc/wt/wb are there. we can simply

s/I915_CACHE_NONE/INTEL_INFO(i915)->pat_uc
s/I915_CACHE_WT/INTEL_INFO(i915)->pat_wt
s/I915_CACHE_LLC/INTEL_INFO(i915)->pat_wb

[...]

>>>>> if (i915_gem_object_has_cache_mode(obj, I915_CACHE_MODE_WB))
>>>> This looks wrong, at least for MTL. Prior to MTL the GPU automatically

>>>> snoop CPU cache, but from MTL onward you have to specify if WB or

>>>> WB + 1-WAY COH is needed. And for KMD, cacheable mode means WB +

>>>> 1-WAY COH for MTL to keep the behavior consistent.
>>>>
>>>> This used to be taken care of by i915_gem_get_pat_index() call.

>>>> With that being replaced by i915_cache_find_pat(), you would need

>>>> to do something there.
>>>> But, without cachelevel_to_pat[], you might end up hard coding

>>>> something directly in the function, and that is platform

>>>> dependent. hmm..., I don't really like this idea.
>>>>
>>>> That's why I commented in v1 that we should use 
>>>> INTEL_INFO(i915)->pat_uc/wb/wt instead of enum i915_cache_level or

>>>> i915_cache_t.
>>>
>>> I think I get it. I hope so.. So if I made the tables like this:
>>>
>>> #define LEGACY_CACHE_MODES \
>>>        .cache_modes = { \
>>>                [0] = I915_CACHE(UC), \
>>>                [1] = _I915_CACHE(WB, COH1W), \
>>>                [2] = _I915_CACHE(WB, COH1W | L3), \ // 2way??
>>>                [3] = I915_CACHE(WT), \
>>>         }
>>> #define GEN12_CACHE_MODES \
>>>        .cache_modes = { \
>>>                [0] = _I915_CACHE(WB, COH1W), \
>>>                [1] = I915_CACHE(WC), \  
>>>                [2] = I915_CACHE(WT), \
>>>                [3] = I915_CACHE(UC), \
>>>         }
>>> #define MTL_CACHE_MODES \
>>>        .cache_modes = { \
>>>                [0] = _I915_CACHE(WB, COH1W), \
>
>This was a brain fart, should have just been WB.
>
>>>                [1] = I915_CACHE(WT), \
>>>                [2] = I915_CACHE(UC), \
>>>                [3] = _I915_CACHE(WB, COH1W), \
>>>                [4] = _I915_CACHE(WB, COH2W), \
>>> And made i915->pat_wc look up _I915_CACHE(WB, COH1W) would that work?
>>> Hmm and also all "has cache level" call sites would need to look

>>> not just for WB but WB+COH1W.
>>>
>>> Would it work? Too ugly?
>> 
>> I don't think this would work. The cache_modes needs to be aligned

>> with BSpec, otherwise checkings for 
>> INTEL_INFO(i915)->cache_modes[obj->pat_index] might become invalid.

>> Also, COH1W/2W were not even there for platforms prior to MTL.
>
> Not sure what would become invalid?

What if we want to check for a particular pat_index whether it means
cached or uncached, whether it's 1-way coherent or not? if the cache_modes[]
misaligned with bspec, then we would fail such check.

> COH1W/2W are perhaps names associated
> with MTL - but is Gen12 PAT 0 identical in behaviour to PAT 3 or PAT 4 on
> MTL? If yes then we can introduce an i915 specific name for that coherency
> mode and apply it to both platforms.
>
>> I still think setting INTEL_INFO(i915)->pat_uc/wt/wb is the best solution.
>> With that we can also eliminate the use of I915_CACHE({UC|WT|WB}).
>
> How for the call sites which are asking about caching mode characteristics?
> We can't ask if something has PAT index X from the source code since that is
> platform dependent.

We can compare pat index directly for exact match. Even for the case we just
want to distinguish cached or uncached, we can check the bit field of
INTEL_INFO(i915)->cache_modes[obj->pat_index].

>>>>> +357,7 @@ int i915_gem_set_caching_ioctl(struct drm_device *dev, void *data,
>>>>>          switch (args->caching) {
>>>>>          case I915_CACHING_NONE:
>>>>> -               level = I915_CACHE_NONE;
>>>>> +               cache = I915_CACHE(UC);
>>>>
>>>> For code like this, my thought was 
>>>> -               level = I915_CACHE_NONE;
>>>> +               pat_index = INTEL_INFO(i915)->pat_uc;
>>>> And later the set_cache call should take pat_index as argument instead
>>>> of cache mode.
>>>>
>>>>>                  break;
>>>>>          case I915_CACHING_CACHED:
>>>>>                  /*
>>>>> @@ -367,10 +369,10 @@ int i915_gem_set_caching_ioctl(struct drm_device *dev, void *data,
>>>>>                 if (!HAS_LLC(i915) && !HAS_SNOOP(i915))
>>>>>                     return -ENODEV;  
>>>>> -               level = I915_CACHE_LLC;
>>>>> +               cache = I915_CACHE(WB);
>>>>
>>>> -               level = I915_CACHE_LLC;
>>>> +               pat_index = INTEL_INFO(i915)->pat_wb;
>>>> This should take care of the WB + 1-WAY COH issue for MTL mentioned above,
>>>> assuming the i915_cache_init() set pat_wb properly, and the
>>>> i915_gem_object_set_cache() consumes pat_index instead of cache mode.
>>>
>>> That works too yes.
>>>
>>>>
>>>>>                  break;
>>>>>          case I915_CACHING_DISPLAY:
>>>>> -               level = HAS_WT(i915) ? I915_CACHE_WT : I915_CACHE_NONE;
>>>>> +               cache = HAS_WT(i915) ? I915_CACHE(WT) : I915_CACHE(UC);
>>>>>                 break;
>>>>>          default:
>>>>>                 return -EINVAL;
>> 
>> [...]
>> 
>>>>>
>>>>> bool i915_gem_object_can_bypass_llc(struct drm_i915_gem_object *obj)
>>>>> @@ -215,6 +222,7 @@ bool i915_gem_object_can_bypass_llc(struct drm_i915_gem_object *obj)
>>>>>          /*
>>>>>           * Always flush cache for UMD objects at creation time.
>>>>>           */
>>>>> +       /* QQQ/FIXME why? avoidable performance penalty? */
>> 
>> This is needed because UMD's assume zero-initialized BO's are really

>> zero'ed out before getting the handles to the BO's (See VLK-46522).

>> Otherwise UMD's could read stale data, thus cause security issues.
>
> Hah this comes exactly to my point from above. So it looks my propsal
> would exactly solve this. Because i915 would know the caching mode and
> know to flush if not coherent. And it would be better than flushing for
> every obj->pat_set_by_user because that approach pessimistically flushes
> even when it is not needed.

hmm..., This is only called at BO creation time. We do need to clflush all
objects with pat_set_by_user, otherwise the user would get access to stale
data.

-Fei

> Regards,
>
> Tvrtko
>
>> 
>>>>>          if (obj->pat_set_by_user)
>>>>>              return true;
>>>>>
>> 
>> [...]
>> 
>>>>> diff --git a/drivers/gpu/drm/i915/intel_device_info.h b/drivers/gpu/drm/i915/intel_device_info.h
>>>>> index dbfe6443457b..f48a21895a85 100644
>>>>> --- a/drivers/gpu/drm/i915/intel_device_info.h
>>>>> +++ b/drivers/gpu/drm/i915/intel_device_info.h
>>>>> @@ -27,6 +27,8 @@
>>>>>
>>>>>  #include <uapi/drm/i915_drm.h>
>>>>>
>>>>> +#include "i915_cache.h"
>>>>> +
>>>>>  #include "intel_step.h"
>>>>>
>>>>>  #include "gt/intel_engine_types.h"
>>>>> @@ -243,8 +245,8 @@ struct intel_device_info {  >>>           */  
>>>>>          const struct intel_runtime_info __runtime;
>>>>> -        u32 cachelevel_to_pat[I915_MAX_CACHE_LEVEL];
>>>>> -        u32 max_pat_index;
>>>>> +        i915_cache_t cache_modes[9];
>>>> I was commenting on the array size here. It's probably better to make
>>>> it 16 because there are 4 PAT index bits defined in the PTE. Indices
>>>> above max_pat_index are not used, but theoretically new mode could be
>>>> added. Well, it's up to you, not likely to happen anyway.
>>>
>>> Ah okay. I am not too concerned. Compiler will let us know if it happens.
>>>
>>> Unrelated to this comment - what about i915_gem_object_can_bypass_llc?
>>> Could we do better (less pessimistic) with something like my approach and
>>> so maybe penalize MTL less?
>> 
>> The problem is that, for the BO's managed by UMD's, the KMD doesn't

>> know whether they are going to be mapped as cacheable or uncacheable

>> on the CPU side. The PAT index controls GPU access only. That's why we

>> make sure all BO's with PAT set by UMD (which means UMD will take

>> control and managing the
>> coherency) are clflush'ed.
>> 
>> -Fei
>> 
>>> Regards,
>>>
>>> Tvrtko