Re: [PATCH 1/1] drm/i915/guc: Fix GuC error capture sizing estimation and reporting

John Harrison <john.c.harrison@xxxxxxxxx> · Mon, 3 Oct 2022 12:47:19 -0700

On 10/3/2022 11:28, Teres Alexis, Alan Previn wrote:
On Fri, 2022-09-30 at 15:35 -0700, Harrison, John C wrote:
On 9/30/2022 14:08, Teres Alexis, Alan Previn wrote:
I disagree because its unlikely that all engines can reset all at once (we probably have bigger problems at the at
point) and if they all occurred within the same G2H handler scheduled worker run, our current gpu_coredump framework
would just discard the ones after the first one and so it wouldnt even matter if we did catch it.
So min_size is not actually the minimal size for a meaningful capture?
So what is? And remember that for compute class engines, there is
dependent engine reset. So a reset of CCS2 also means a reset of RCS,
CCS0, CCS1 and CCS3. So having at least four engines per capture is not
unreasonable.

Alan: min_size is a meaningful size for the worst case scenario of collecting the guc-report. However due to gpu-core-
dump, its not meaningful in terms of reporting that information out to the user. Thats not a limitation of this
subsystem.
I'm not following what you mean about gpu-core-dump. The point of the 
warning is to let the user know that the interface they want to use 
(error capture via sysfs) might not be complete in the majority of 
cases. If there is some catastrophic worst case scenario that overflows 
but 99% of instances are fine then that's what the drm_notice about 
'spare_size' is for. But if the 99% of instances of a capture are going 
to overflow then that is what the drm_warn about 'min_size' should be for.

It seems pointless to go through a lot of effort to calculate the
minimum and recommend sizes only to basically ignore them by just
whispering very, very quietly that there might be a problem.

Alan: I already responded that i will re-rev as per your recommendation and switch back to drm_notice.

It also
seems pointless to complain about a minimum size that actually isn't the
minimum size. That's sort of worse - now you are telling the user there
is a problem when really there isn't.

Alan: the min size is accurate - but whether we report to the use about them possibly facing a problem is where it gets
a bit unclear because of the limitation in the gpu-core-dump framework. We could drop the message completely if you like
- but then we'd have to remember to re-add it if we fix gpu-core-dump in future. For now, as i mentioned in the last
What limitation in the 'gpu-core-dump framework'? What is the fix required?

reply, i have already changed it back to "notice" as per your last comment. Perhaps you should have looked at rev2 which
was posted before your responses above. As mentioned in last reply, i disagreed but i am committing to your request
which was fixed in rev2 as per your request.
The point of a code review is not "I say X so you must do X, immediately 
post a new revision now". I asked some questions. I stated my opinion 
about what the end user should see. If you think your implementation 
already matches that or you disagree because you think I am basing my 
comments on incorrect information, or even just that you disagree with 
my reasoning, then you should not blindly post a new revision saying 
"here are your changes, I don't like it because Y but just r-b it and 
I'll merge". You should reply to the comments with your thoughts and 
suggestions. Maybe the reviewer is wrong!

IMHO, the min_size check should be meaningful and should be visible to
the user if it fails.

Also, are we still hitting the minimum size failure message? Now that
the calculation has been fixed, what sizes does it come up with for min
and spare? Are they within the allocation now or not?

Yes - we are within the allocation with this patch (the fix of removing the redundant register-struct-size
multiplication brought the number down significantly).
Bringing in comment from other thread:
> Some context: with the calculation fix we are allocating 4MB but we 
only need 78k as min-est.

Wow! That is quite a big smaller. And much more plausible! So if 
min_size is 78KB, what is spare_size now?

Sounds like we can drop it down to a 1MB allocation. And even if 
min_size is not the absolute minimum but quite a bit more worst case 
kind of size, it still seems reasonable to keep it as a warning rather 
than a notice. Given that it is so much smaller than the allocation 
size, if it does somehow overflow on some platform/configuration then 
that sounds like something we should know about because it likely means 
something is broken.

John.
So how would you like to proceed? Could we reply on rev2 btw?
I would like to answer the questions/concerns before jumping in to 
writing/re-writing code.

Why split the email thread? The discussion is already happening here. 
There is no point splitting a single discussion across multiple patch 
sets just because a new patch has been posted if that patch does not 
actually change the discussion.

John.