Re: [PATCH 1/4] drm/i915/gt: Remove platform comments from workarounds

Tvrtko Ursulin <tvrtko.ursulin@xxxxxxxxxxxxxxx> · Mon, 16 Jan 2023 13:01:05 +0000

On 12/01/2023 21:42, Lucas De Marchi wrote:
On Thu, Jan 05, 2023 at 01:35:52PM +0000, Tvrtko Ursulin wrote:

Okay to sum it up below with some final notes..

On 04/01/2023 19:34, Matt Roper wrote:
On Wed, Jan 04, 2023 at 09:58:13AM +0000, Tvrtko Ursulin wrote:

On 23/12/2022 18:28, Lucas De Marchi wrote:
On Fri, Dec 23, 2022 at 09:02:35AM +0000, Tvrtko Ursulin wrote:

On 22/12/2022 15:55, Lucas De Marchi wrote:
On Thu, Dec 22, 2022 at 10:27:00AM +0000, Tvrtko Ursulin wrote:

On 22/12/2022 08:25, Lucas De Marchi wrote:
The comments are redundant to the checks being done to apply the
workarounds and very often get outdated as workarounds need to be
extended to new platforms or steppings.  Remove them altogether 
with
the following matches (platforms extracted from 
intel_workarounds.c):

    find drivers/gpu/drm/i915/gt/ -name '*.c' | xargs sed -i -E \
's/(Wa.*):(bdw|chv|bxt|glk|skl|kbl|cfl|cfl|whl|cml|aml|chv|cl|bw|ctg|elk|ilk|snb|dg|pvc|g4x|ilk|gen|glk|kbl|cml|glk|kbl|cml|hsw|icl|ehl|ivb|hsw|ivb|vlv|kbl|pvc|rkl|dg|adl|skl|skl|bxt|blk|cfl|cnl|glk|snb|tgl|vlv|xehpsdv).*/\1/'
    find drivers/gpu/drm/i915/gt/ -name '*.c' | xargs sed -i -E \
's/(Wa.*):(bdw|chv|bxt|glk|skl|kbl|cfl|cfl|whl|cml|aml|chv|cl|bw|ctg|elk|ilk|snb|dg|pvc|g4x|ilk|gen|glk|kbl|cml|glk|kbl|cml|hsw|icl|ehl|ivb|hsw|ivb|vlv|kbl|pvc|rkl|dg|adl|skl|skl|bxt|blk|cfl|cnl|glk|snb|tgl|vlv|xehpsdv).*\*\//\1

Same things was executed in the gem directory, omitted
here for brevity.

There were a few false positives that included the workaround
description. Those were manually patched.

sed -E 's/(Wa[a-zA-Z0-9_]+)[:,]([a-zA-Z0-9,-_\+\[]{2,})/\1/'

then there are false negatives. We have Was in the form
"Wa_xxx:tgl,dg2, mtl". False positives we can fixup, false negatives
we simply don't see. After running that in gt/:

$ git grep ": mtl" -- drivers/gpu/drm/i915/
drivers/gpu/drm/i915/gt/intel_gt_pm.c:  /* Wa_14017073508: mtl */
drivers/gpu/drm/i915/gt/intel_gt_pm.c:  /* Wa_14017073508: mtl */
drivers/gpu/drm/i915/gt/intel_gt_pm.c:  /* Wa_14017073508: mtl */
drivers/gpu/drm/i915/gt/intel_gt_pm.c:  /* Wa_14017073508: mtl */
drivers/gpu/drm/i915/gt/uc/intel_guc_rc.c:       * 
Wa_14017073508: mtl
drivers/gpu/drm/i915/i915_reg.h:/* Wa_14017210380: mtl */

I was going with the platform names to avoid the false
negatives and because I was entertaining the idea of only doing 
this for
latest platforms where we do have the "Wa_[[:number:]]" form

Maybe..

Matt recently said he has this worked planned, but more
importantly - I gather then that the WA lookup tool
definitely does not output these strings?

Whatever it does it's true only at the time it's called. It
simply tells what
are the platforms and steppings the Wa applies to. We can change the
output to whatever we want, but that is not the point.
Those comments get stale and bring no real value as they match 1:1
what the code is supposed to be doing. Several times a patch has to
update just that comment to "extend a workaround" to a next 
platform.
This is not always done, so we get a comment that doesn't match 
what is
supposed to be there.

Tl;dr; version - lets park this until January and discuss once
everyone is back.

I'll leave my comment here since I will be out until mid January.

Longer version. I've been trying to get us talking about this a
couple times before and I'd really like to close with an explicit
consensus, discussion points addressed instead of skipped and just
moving ahead with patches.

References:
 3fcf71b9-337f-6186-7b00-27cbfd116743@xxxxxxxxxxxxxxx
 Y5j0b/bykbitCa4Q@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

So point I wanted to discuss is whether these comments are truly
useless or maybe they can help during review. If the tool can
actually output them then I am leaning towards that they can be.

I consider "can the tool output xyz?" asking the wrong question.
"The tool", which is our own small python script querying a 
database can
output anything like that if we want to. The database has 
information of
what are the platforms/steppings for each the WA is known to be 
applied
*today*. And that can change and do change often, particularly for 
early
steppings and recent platforms.

Thought is, when a patch comes for review adding a new platform,
stepping, whatever, to an existing if condition, if it contains the
comments reviewer can more easily spot a hyphotetical logic
inversion error or similar. It is also trivial to check that both
condition and comment have been updated. (So lets not be rash and
remove something maybe useful just because it can go stale *only if*
reviewers are not giving sufficient attention that changes are made
in tandem.)

I can argue to the other side too. We don't have comments in the 
kernel
like

     /* Add 1 to i */
     i += 1;

This is exactly what these comments are doing. And they are misleading

I'll file this under "Reductio ad absurdum", kind of. :)

which actually proves my point?

I don't think so, but lets move on.

and may introduce bugs rather than helping reviewing:

     Wa_12345:tgl[a0,c0)
     if (IS_TGL_GRAPHICS_STEP(STEP_A0, STEP_B0)

One might read the comment, skipping over the condition and thinking
"ok, we already extended this WA to B* steppings, which doesn't match
the code.

That would be reviewer error to assume B0 is the last B stepping, 
without
actually checking. Equally as reviewer error would be to assume any WA
adding patch is adding the correct conditions, again, without actually
checking. Which leads me to ...

From a slightly different angle - do we expect anyone reviewing
workaround patches will cross-check against the tool? Would it be
simpler and more efficient that they could just cross-check against
the comment output from the tool and put into the patch by the
author?

I think this is the source of the confusion; the comment lines in i915
are not something the 'wa' tool outputs directly; the comments are
written manually by the developer at the same time the code is written;
the wa tool is just a quick python script I wrote one day to dump
database information from the command line and avoid needing to fire up
a web browser and click through a series of slow website links.  Also,
since the wa tool queries internal databases, it spits out a bunch of
Intel-internal terminology that doesn't match the terminology used in
i915 code, and it also includes a bunch of garbage data that needs to be
filtered out (duplicated records, mangled/incomplete records, etc.).
Exactly how things are expressed (platform name like "DG2" or IP name
like "Xe_HPG" also varies from platform to platform according to how the
hardware guys decided to categorize things).

Right, I was going with what AFAIR Lucas wrote earlier in the thread, 
to paraprhase "tool can be made to output anything we want". Maybe 
wasn't confusion but misleading would be more obvious. :)

I stand by what I said.... the tool is just something that queries a
database and spits data in a better format. It can print anything you
want. If someone wants to print the exact code and comment to be added
in i915, it could be turned into that with a fair amount of look up
tables to translate platform names/steppings/etc, adapt to terminology
changes as they come, etc. The fact that would make someone's life 
miserable
and they would probably quit their job is implicit here. Next time I 
will be
explicit about "possible" vs having a good ROI.
>
Since the code and the comments are both something written by hand by
the developer, there's no reason to believe the comments will be more
accurate than the code.  They'll likely be far less accurate since they
can't be tested like the code is, and because the existing codebase is
wildly inconsistent with how we even format/represent them.

see above. Someone cross-checking the comment is cross-checking the
wrong thing. As I said, it happens more on early enabling of a 
platform.

... my point which seems to have been missed by both, well question 
really,
do you expect every reviewer to cross-check against the WA database 
when
reviewing WA changes? I don't see that was answered.

I guarantee that it won't happen and people will rubber stamp. So my
argument was that we could make it both easier for reviewers *and* 
decrease
the likelyhood of misses, if we kept platforms designators in comments.

Yes, reviewers are absolutely supposed to be checking the stepping
bounds of every workarounds change they review.  That's one of the most
important parts of reviewing a workaround and it should be very quick
and easy to do.  If people are giving rubber stamps without doing that,
then they're not really doing a full review.  But I'm also not aware of
any cases where we're getting rubber stamps; most of the non-comment
review misses we have seem to either come from misunderstanding the
semantics of platforms (especially cases like DG2 where different
G10/G11/G12 variants have different stepping schemes) or technical
misunderstanding of the implementation details (register reset
characteristics, masked vs non-masked registers, context membership,
etc.).

Yeah it is much easier to rip them out that to find and fix the ones 
which
went out of sync but that shouldn't be high on the list of criteria.

Argument that it is easy to overlook during review that comments and 
code do
not match I don't think holds. That describes a very sloppy review. 
And if
review is assumed to be that sloppy, do you really trust review to 
check
against the WA database?

It's the same reason people who write prose can't find their own
spelling/grammer mistakes.  The mistakes are "obvious," but since their
brain already knows what it's "supposed" to say, they just can't see the
error themselves.  Once you've reviewed the code, it just gets really
hard to see where the comment doesn't align, especially for the
workarounds that apply to a bunch of platforms.

To be pedantic here - that's a writer's handicap - shouldn't be 
reviewers. But anyway, carrying on.

it's also a reviewer's handicap. At some point, as illustrated by the
example below, the comment is harder to review than the code itself and
the reviewer is already biased by their review of the code itself.

I find this:

/* Wa_12345:tgl,dg1[a0],rkl,adls,dg2_g10,dg2_g12[a0..c0) */

Much easier to mentally parse that this:

     if (IS_TIGERLAKE(i915) || IS_DG1(i915) || IS_ROCKETLAKE(i915) ||
         IS_ALDERLAKE_P(i915) || IS_DG2_G10(i915) ||
         IS_DG2_GRAPHICS_STEP(i915, G12, STEP_A0, STEP_C0))

Let's call it a difference then.

And I really think writer's handicap cannot be used to discharge 
responsibility from reviewers becuase then you kind of nerf the whole 
concept of proof readers and reviewers. Yes mistakes can happen, but 
it's those people trained mindset to minimize that. Not to say of lets 
not bother becuase writers handicap.. Anyway, irrelevant, moving on still..

For example, if I'm reviewing a patch that adds:

    /* Wa_12345:tgl,dg1[a0],rkl,adls,dg2_g10,dg2_g12[a0..c0) */
    if (IS_TIGERLAKE(i915) || IS_DG1(i915) || IS_ROCKETLAKE(i915) ||
        IS_ALDERLAKE_P(i915) || IS_DG2_G10(i915) ||
        IS_DG2_GRAPHICS_STEP(i915, G12, STEP_A0, STEP_C0))

I'm always looking at the code first and comparing that to the
workaround database.  If I then review the comment second, I'm much less
likely to catch the subtle errors (there are two of them in this example
where the code and comment don't match!) because I mentally already know
what the bounds are "supposed" to be and the comment all just kind of
blends together.

So my argument is that it is trivial for reviewers to spot comments 
and code
do not match. Trivial and fast. And it's trivial (I hope) for the WA 
tool to
output the right format for pasting in comments.

Given how much terminology mismatch there is between the internal
database and how we refer to things in i915 code, it's not trivial.
It's also not super easy to even figure out which platforms to include
in the list.  The workaround database is going to include platforms for
which there is no i915 support (e.g., LKF) stuff like CNL (support
already removed from i915), and future/potential platforms we can't talk
about yet, etc.  Finally, when there are duplicated/conflicting records
(because the people inputting the information are just human too), it
takes a bit of human intelligence to read between the lines and figure
out what the reality is supposed to be.

Sure, these problems could probably all be solved with enough effort,
but given how often the internal formatting and behavior of the database
itself changes, it would be painful to keep it always working properly.

Those are the points I would like to have explicitly discounted before
proceeding. Maybe to be even clearer the workflow would be like this:

Patch author:

1. Runs the WA tool for a WA number. Tool outputs text.
2. Pastes text verbatim in the comment.

This one isn't correct today (and as noted above, not terribly easy to
accomplish).  It's

   2.  Developer manually writes code comment based on interpreting wa's
       output.

Given everything above lets say we concluded it is too costly to make 
the tool output the exact format wanted by i915 and we decide to 
proceed removing the platform designations. Two final questions.

agreed

Question 1)

Is everyone okay to remove from _all_, including old legacy platforms? 
ROI vs churn considerations? And yes it wouldn't be ideal either to 
have separate rules with some cutover point like Gen8 maybe.

maybe we could land first a patch changing all the Wa_<number> ones and
later if we find the lack of consistency is bad we extend to all of
them?

At this point, since that plan would affects the whole driver, I think 
you need to get acks from other maintainers. My view is that on balance, 
every potential option has pros and cons, and my main thought is that 
focus should be on any solution which minimizes the amount of missed or 
misplaces workarounds per platform. Be it during initial enablement or 
code base maintenance.

Question 2)

For Xe you still intend to have it a manual process and not have the 
tool output the macro section which could be directly pasted in the 
code? Would ROI of extending the tool be any better with the 
data-driven design like there?

I don't have any plans yet in that regard as the source for all the
workarounds in xe actually came from i915, not from the database.
We may play with the tool in future to output something close to the
macros.

Cool - so you'd agree the thought about someone quitting their job in 
this scenario was really a hyperbole, yes? :)

Regards,

Tvrtko