Re: [Intel-gfx] [PATCH 1/2] drm/i915/gt: Add GT oriented dmesg output

John Harrison <john.c.harrison@xxxxxxxxx> · Wed, 9 Nov 2022 09:46:18 -0800

On 11/9/2022 03:05, Tvrtko Ursulin wrote:
On 08/11/2022 20:15, John Harrison wrote:
On 11/8/2022 01:01, Tvrtko Ursulin wrote:
On 07/11/2022 19:14, John Harrison wrote:
On 11/7/2022 08:17, Tvrtko Ursulin wrote:
On 07/11/2022 09:33, Tvrtko Ursulin wrote:
On 05/11/2022 01:03, Ceraolo Spurio, Daniele wrote:
On 11/4/2022 10:25 AM, John.C.Harrison@xxxxxxxxx wrote:
From: John Harrison <John.C.Harrison@xxxxxxxxx>

When trying to analyse bug reports from CI, customers, etc. it 
can be
difficult to work out exactly what is happening on which GT in a
multi-GT system. So add GT oriented debug/error message 
wrappers. If
used instead of the drm_ equivalents, you get the same output 
but with
a GT# prefix on it.

Signed-off-by: John Harrison <John.C.Harrison@xxxxxxxxx>

The only downside to this is that we'll print "GT0: " even on 
single-GT devices. We could introduce a gt->info.name and print 
that, so we could have it different per-platform, but IMO it's 
not worth the effort.

Reviewed-by: Daniele Ceraolo Spurio 
<daniele.ceraolospurio@xxxxxxxxx>

I think it might be worth getting an ack from one of the 
maintainers to make sure we're all aligned on transitioning to 
these new logging macro for gt code.

Idea is I think a very good one. First I would suggest 
standardising to lowercase GT in logs because:

$ grep "GT%" i915/ -r
$ grep "gt%" i915/ -r
i915/gt/intel_gt_sysfs.c: gt->i915->sysfs_gt, "gt%d", gt->info.id))
i915/gt/intel_gt_sysfs.c:                "failed to initialize 
gt%d sysfs root\n", gt->info.id);
i915/gt/intel_gt_sysfs_pm.c:                     "failed to 
create gt%u RC6 sysfs files (%pe)\n",
i915/gt/intel_gt_sysfs_pm.c: "failed to create gt%u RC6p sysfs 
files (%pe)\n",
i915/gt/intel_gt_sysfs_pm.c:                     "failed to 
create gt%u RPS sysfs files (%pe)",
i915/gt/intel_gt_sysfs_pm.c:                     "failed to 
create gt%u punit_req_freq_mhz sysfs (%pe)",
i915/gt/intel_gt_sysfs_pm.c: "failed to create gt%u throttle 
sysfs files (%pe)",
i915/gt/intel_gt_sysfs_pm.c: "failed to create gt%u 
media_perf_power_attrs sysfs (%pe)\n",
i915/gt/intel_gt_sysfs_pm.c:                     "failed to add 
gt%u rps defaults (%pe)\n",
i915/i915_driver.c: drm_err(&gt->i915->drm, "gt%d: 
intel_pcode_init failed %d\n", id, ret);
i915/i915_hwmon.c: snprintf(ddat_gt->name, sizeof(ddat_gt->name), 
"i915_gt%u", i);

Just because there are 11 existing instances of one form doesn't 
mean that the 275 instances that are waiting to be converted should 
be done incorrectly. GT is an acronym and should be capitalised.

Okay just make it consistent then.

Besides:
grep -r "GT " i915 | grep '"'
i915/vlv_suspend.c:             drm_err(&i915->drm, "timeout 
disabling GT waking\n");
i915/vlv_suspend.c:                     "timeout waiting for GT 
wells to go %s\n",
i915/vlv_suspend.c:     drm_dbg(&i915->drm, "GT register access 
while GT waking disabled\n");
i915/i915_gpu_error.c:  err_printf(m, "GT awake: %s\n", 
str_yes_no(gt->awake));
i915/i915_debugfs.c:    seq_printf(m, "GT awake? %s [%d], %llums\n",
i915/selftests/i915_gem_evict.c: pr_err("Failed to idle GT (on 
%s)", engine->name);
i915/intel_uncore.c:                  "GT thread status wait timed 
out\n");
i915/gt/uc/selftest_guc_multi_lrc.c: drm_err(&gt->i915->drm, "GT 
failed to idle: %d\n", ret);
i915/gt/uc/selftest_guc.c: drm_err(&gt->i915->drm, "GT failed to 
idle: %d\n", ret);
i915/gt/uc/selftest_guc.c: drm_err(&gt->i915->drm, "GT failed to 
idle: %d\n", ret);
i915/gt/intel_gt_mcr.c: * Some GT registers are designed as 
"multicast" or "replicated" registers:
i915/gt/selftest_rps.c:                 pr_info("%s: rps counted %d 
C0 cycles [%lldns] in %lldns [%d cycles], using GT clock frequency 
of %uKHz\n",
i915/gt/selftest_hangcheck.c:                   pr_err("[%s] GT is 
wedged!\n", engine->name);
i915/gt/selftest_hangcheck.c:           pr_err("GT is wedged!\n");
i915/gt/intel_gt_clock_utils.c:                 "GT clock frequency 
changed, was %uHz, now %uHz!\n",
i915/gt/selftest_engine_pm.c:           pr_err("Unable to flush GT 
pm before test\n");
i915/gt/selftest_engine_pm.c: pr_err("GT failed to idle\n");
i915/i915_sysfs.c:                       "failed to register GT 
sysfs directory\n");
i915/intel_uncore.h:     * of the basic non-engine GT registers 
(referred to as "GSI" on
i915/intel_uncore.h:     * newer platforms, or "GT block" on older 
platforms)?  If so, we'll

Then there is a question of naming. Are we okay with GT_XXX or, 
do we want intel_gt_, or something completely different. I don't 
have a strong opinion at the moment so I'll add some more folks 
to Cc.

You mean GT_ERR("msg") vs intel_gt_err("msg")? Personally, I would 
prefer just gt_err("msg") to keep it as close to the official drm_* 
versions as possible. Print lines tend to be excessively long 
already. Taking a 'gt' parameter instead of a '&gt->i915->drm' 
parameter does help with that but it seems like calling the wrapper 
intel_gt_* is shooting ourselves in the foot on that one. And 
GT_ERR vs gt_err just comes down to the fact that it is a macro 
wrapper and therefore is required to be in upper case.

There was a maintainer level mini-discussion on this topic which I 
will try to summarise.

Main contention point was the maintenance cost and generally an 
undesirable pattern of needing to add many 
subsystem/component/directory specific macros. Which then 
typically need extra flavours and so on. But over verbosity of the 
How many versions are you expecting to add? Beyond the tile 
instance, what further addressing requirements are there? The card 
instance is already printed as part of the PCI address. The only 
other reason to add per component wrappers would be to wrap the 
mechanism for getting from some random per component object back to 
the intel_gt structure. But that is hardware a new issue being 
added by this wrapper. It is also not a requirement. Much of the 
code has a gt pointer already. For the parts that don't, some of it 
would be a trivial engine->gt type dereference, some of it is a 
more complex container_of type construction. But for those, the 
given file will already have multiple instances of that already 
(usually as the first or second line of the function - 'intel_gt 
*gt = fancy_access_method(my_obj)' so adding one or two more of 
those as necessary is not making the code harder to read.

code is obviously also bad, so one compromise idea was to add a 
macro which builds the GT string and use drm logging helpers 
directly. This would be something like:

 drm_err(GT_LOG("something went wrong ret=%d\n", gt), ret);
 drm_info(GT_LOG(...same...));
Seriously? As above, some of these lines are already way too long, 
this version makes them even longer with no obvious benefit. Worse, 
it makes it harder to read what is going on. It is much less 
intuitive to read than just replacing the drm_err itself. And 
having two sets of parenthesis with some parameters inside the 
first and some only inside the second is really horrid! Also, 
putting the 'gt' parameter in the middle just confuses it with the 
rest of the printf arguments even though there is no %d in the 
string for it. So now a quick glances tells you that your code is 
wrong because you have three format specifiers but four parameters.

Whereas, just replacing drm_err with gt_err (or GT_ERR or 
intel_gt_err) keeps everything else consistent. The first parameter 
changes from 'drm' to 'gt' but is still the master object parameter 
and it matches the function/macro prefix so inherently looks 
correct. Then you have your message plus parameters. No confusing 
orders, no confusing parenthesis, no excessive macro levels, no 
confusion at all. Just nice simple, easy to read, easy to maintain 
code.

I am personally okay with gt_err/GT_ERR some other folks might 
object though. And I can also understand the argument why it is 
better to not have to define gt_err, gt_warn, gt_info, gt_notice, 
gt_debug, gt_err_ratelimited, gt_warn_once.. and instead have only 
one macro.
A small set of trivial macro definitions vs a complicated and 
unreadable construct on every single print? Erm, isn't that the very 
definition of abstracting to helpers as generally required by every 
code review ever?

And what 'other folks might object'? People already CC'd? People 
outside of i915?

Because of that I was passing on to you the compromise option.

It maybe still has net space savings since we wouldn't have to be 
repeating the gt->i915->drm whatever and gt->info.id on every line.

You are free to try the most compact one and see how hard those 
objections will be.
Um. I already did. This patch. And you are the only person to have 
objected in any manner at all.

Where I have objected?
Only in everything you have written. Or are you saying that actually I 
could have just taken the r-b from Daniele and merged it while 
completely ignoring everything you have said because you didn't say 
'NACK' at the top of your email?

I was a) asking to convert all gt/ within one kernel release, b) 
transferring the maintainer discussion from IRC to this email chain to 
outlay one alternative, for which I said I could see the pros and cons 
of both, and c) raised the naming question early since that can 
usually become a churn point later on when we have large scale code 
transformations.

As said, FWIW you have my ack for GT_XXX naming and approach, but 
please do convert the whole of gt/ so we don't ship with a mish-mash 
of log messages.

That sounds like a quite a lot of objections to me - don't do it that 
way, do it this way; fine do it that way but expect lots of complaints 
when you do; oh all right, do it that way but re-write the entire driver 
all at once.

I really don't get the 'we must not ship with a mish-mash of log 
messages' argument. It's hardly like the current total mish-mash of 
DRM_DEBUG, DRM_DEBUG_DRIVER, pr_dbg, drm_dbg, ... is even remotely 
consistent. It's also not like the patch is massively changing the 
format of the output messages. So some prints get a useful extra three 
characters in advance of others. Does anyone care? Would anyone except 
us even notice?

Is it really a problem to merge this patch now to get the process 
started? And other sub-components get updated as and when people get the 
time to do them? You could maybe even help rather than posting 
completely conflicting patch sets that basically duplicate all the 
effort for no actual benefit.

John.

Regards,

Tvrtko