Re: [PATCH 02/21] drm/i915/tgl: Fix macros for TGL SOC based WA

Aditya Swarup <aditya.swarup@xxxxxxxxx> · Tue, 24 Nov 2020 16:48:02 -0800

On 11/24/20 12:11 PM, Lucas De Marchi wrote:
> +Matt Roper, see question in item (3) below
> 
> On Tue, Nov 24, 2020 at 04:20:40PM +0200, Jani Nikula wrote:
>> On Tue, 24 Nov 2020, Lucas De Marchi <lucas.demarchi@xxxxxxxxx> wrote:
>>> On Mon, Nov 23, 2020 at 05:32:22PM -0800, Aditya Swarup wrote:
>>>> On 11/18/20 1:18 AM, Jani Nikula wrote:
>>>>> On Tue, 17 Nov 2020, Lucas De Marchi <lucas.demarchi@xxxxxxxxx> wrote:
>>>>>> On Tue, Nov 17, 2020 at 10:50:10AM -0800, Aditya Swarup wrote:
>>>>>>> @@ -1579,9 +1579,9 @@ static inline const struct i915_rev_steppings *
>>>>>>> tgl_revids_get(struct drm_i915_private *dev_priv)
>>>>>>> {
>>>>>>>     if (IS_TGL_U(dev_priv) || IS_TGL_Y(dev_priv))
>>>>>>> -        return tgl_uy_revids;
>>>>>>> +        return tgl_uy_revids + INTEL_REVID(dev_priv);
>>>>>>
>>>>>> oohh, no. You have to at least check you are not accessing out of
>>>>>> bounds. New HW running on old kernel should not access create invalid
>>>>>> accesses like this.
>>>>>
>>>>> And this is just one reason why exposing arrays directly as an interface
>>>>> to the rest of the driver is a bad idea. Basically I look at *all*
>>>>> externs in the driver with suspicion, and they're all exceptions that
>>>>> should not be repeated. The revid arrays are a direct invitation to keep
>>>>> adding more and more extern arrays. And more ways to go out of bounds.
>>>>
>>>> We definitely need an array table for the SOC -> Display, GT stepping mapping.
>>>
>>> the mapping could be very well in the define iff you don't have
>>> different mappings per sku as is the case with TGL. Example:
>>>
>>> #define ADLS_REVID_A0        0
>>> #define ADLS_REVID_A1        5
>>>
>>> #define ADLS_DISP_REVID_A0    0
>>> #define ADLS_DISP_REVID_B0    5
>>>
>>> The actual value is actually the *SoC* revid, regardless the name of the
>>> macro. Since we already have to use a different macro -
>>> IS_DISP_REVID() - I don't think this is much worse and would allow us to
>>> get rid of the table *for ADL-S*, at the expense of having to pass as
>>> argument the ADLS_DISP_REVID_*.  However this doesn't apply to TGL as TGL
>>> has a different mapping per sku.
>>>
>>>
>>>> SOC steppings were usually the same as display steppings/GT steppings until TGL and therefore
>>>> didn't require special mapping cases. But from TGL onwards, we have different combinations of
>>>> Disp and GT steppings per SOC stepping. Alderlake-S makes this direct mapping even more difficult
>>>> without the array requiring more macros to deal with SOC -> DISP/GT stepping differences.
>>>>
>>>> Will fix the array bound checks but the possibility of SOC revision id from drm struct going
>>>> out of bounds is minimal. Can only happen if we don't have support for latest SOC -> Disp/GT table
>>>
>>> this is very common. It's just a matter of trying to run a slightly old
>>> kernel in a slightly newer rev of the hardware.
>>
>> Indeed. All kernels released with the arrays are simply bust for any new
>> hardware revisions. They'll need a minimal Cc: stable fix.
>>
>> Here's something I drafted [1] to fix the situation more
>> generally. There are still some issues to overcome, though they exist
>> already in the current code.
>>
>> This could be followed up with converting *all* platforms to the scheme,
>> making it universal, regardless of whether the revids in the hardware
>> are consecutive or not.
>>
>> BR,
>> Jani.
>>
>>
>> [1] https://cgit.freedesktop.org/~jani/drm/log/?h=revid-stepping-scheme
> 
> That is looking good.  Some feedback I can give before this series being
> sent for review:

I like this approach as well and we were discussing this in the ADLS rev ID thread. With the tables it 
makes it simpler to manage rather than worrying about individual macros. Jani do you want me to rebase ADLS
changes on top of your patches and resubmit your patches for review or you will be submitting this series yourself?

> 
> 1) You need to call the init function from somewhere
> 2) For the FIXMEs:
> 
> +    /*
> +     * FIXME: We should be able to take into account new revids not
> +     * recognized by this kernel version.
> +     */
> 
> +    /*
> +     * FIXME: We should be able to handle gaps in revid arrays gracefully,
> +     * and in a way that works sensibly for the range checks. This is true
> +     * for the existing revid range checks; it's fine if a new id pops up in
> +     * the middle.
> +     *
> +     * It's okay for the display stepping to be zero, though in an array all
> +     * or none should be set to non-zero, not a mix.
> +     */
> 
> Maybe consider that gt_stepping will never be 0 and in the case it is (or
> size > ARRAY_SIZE), just backtrack to use the first one we find with
> gt_stepping != 0?  then we probably should add a warning that we are not
> actually using the correct one, but it's the best we can do.
> 
> 3) REVID_BXT_B_LAST

I couldn't spot REVID_NONE as well in the patches.

> 
> what is that? The only thing that comes to mind is for "matching all B
> steps". Matt Roper had a patch to change the way we interpret the WA
> ranges so the bounds are [lower, upper) rather than [lower, upper].
> Matt, any problem you faced with that patch? It makes  more sense
> because we know the stepping in which it's fixed, but we may have
> additional revids before that
> 
> But I don't see any trace of REVID_BXT_B_LAST in the tree, so not sure
> what's this about.
> 
> 4)
> 
> Lastly, I'd still like the simple fix for TGL without all the noise and
> without the refactor.  It's the simplest fix we can do for the 5.10
> timeframe.

I just submitted the fix.

Aditya 

> 
> 
> Lucas De Marchi
> 
>>
>>
>>
>>
>>>
>>>> for TGL from Bspec and if we are picking up wrong revision id from drm struct that means the platform
>>>> information obtained itself is wrong which will be a general platform problem unrelated to Gfx driver.
>>>
>>> Nothing else should really be a problem. We don't really use the revid
>>> much, mostly for WAs. And if other parts of the kernel are trying to use
>>> the SoC revid, then they are reading that info themselves, not using
>>> something we read.
>>>
>>> We are simply reading the revid from hardware and using that value
>>> without checking and that needs to change.
>>>
>>>
>>>>
>>>>>
>>>>> I'd rather we seek for ways to either nuke the revid arrays altogether,
>>>>> or encapsulate them within a .c file with static scope.
>>>>
>>>> I don't think we should nuke the revid arrays but I agree with finding a more appropriate place to
>>>> parse the gt/display stepping info. This should be an exercise for a later patch that takes
>>>> care of kbl,tgl and adl-s mappings.
>>>>
>>>>>
>>>>> And for that .c file... the arrays are now in gt/intel_workarounds.c
>>>>> which is a really weird place for stuff that's used for generic stepping
>>>>> info, and particularly for *display* stepping info.
>>>>
>>>> I agree and we can change the approach with a different patch later.
>>>>
>>>>>
>>>>> BR,
>>>>> Jani.
>>>>>
>>>>>
>>>>
>>> _______________________________________________
>>> Intel-gfx mailing list
>>> Intel-gfx@xxxxxxxxxxxxxxxxxxxxx
>>> https://lists.freedesktop.org/mailman/listinfo/intel-gfx
>>
>> -- 
>> Jani Nikula, Intel Open Source Graphics Center

_______________________________________________
Intel-gfx mailing list
Intel-gfx@xxxxxxxxxxxxxxxxxxxxx
https://lists.freedesktop.org/mailman/listinfo/intel-gfx