Re: [PATCH 1/1] drm/i915/guc: Fix GuC error capture sizing estimation and reporting

John Harrison <john.c.harrison@xxxxxxxxx> · Mon, 3 Oct 2022 16:51:26 -0700

On 10/3/2022 14:10, Teres Alexis, Alan Previn wrote:
On Mon, 2022-10-03 at 12:47 -0700, Harrison, John C wrote:
On 10/3/2022 11:28, Teres Alexis, Alan Previn wrote:
On Fri, 2022-09-30 at 15:35 -0700, Harrison, John C wrote:
On 9/30/2022 14:08, Teres Alexis, Alan Previn wrote:
I disagree because its unlikely that all engines can reset all at once (we probably have bigger problems at the at
point) and if they all occurred within the same G2H handler scheduled worker run, our current gpu_coredump framework
would just discard the ones after the first one and so it wouldnt even matter if we did catch it.
So min_size is not actually the minimal size for a meaningful capture?
So what is? And remember that for compute class engines, there is
dependent engine reset. So a reset of CCS2 also means a reset of RCS,
CCS0, CCS1 and CCS3. So having at least four engines per capture is not
unreasonable.

Alan: min_size is a meaningful size for the worst case scenario of collecting the guc-report. However due to gpu-core-
dump, its not meaningful in terms of reporting that information out to the user. Thats not a limitation of this
subsystem.
I'm not following what you mean about gpu-core-dump. The point of the
warning is to let the user know that the interface they want to use
(error capture via sysfs) might not be complete in the majority of
cases. If there is some catastrophic worst case scenario that overflows
but 99% of instances are fine then that's what the drm_notice about
'spare_size' is for. But if the 99% of instances of a capture are going
to overflow then that is what the drm_warn about 'min_size' should be for.

It seems pointless to go through a lot of effort to calculate the
minimum and recommend sizes only to basically ignore them by just
whispering very, very quietly that there might be a problem.

Alan: I already responded that i will re-rev as per your recommendation and switch back to drm_notice.

It also
seems pointless to complain about a minimum size that actually isn't the
minimum size. That's sort of worse - now you are telling the user there
is a problem when really there isn't.

Alan: the min size is accurate - but whether we report to the use about them possibly facing a problem is where it gets
a bit unclear because of the limitation in the gpu-core-dump framework. We could drop the message completely if you like
- but then we'd have to remember to re-add it if we fix gpu-core-dump in future. For now, as i mentioned in the last
What limitation in the 'gpu-core-dump framework'? What is the fix required?

Unless it has changed since i last looked at it, gpu_coredump framework will store the first error-state captured and
wait until the user extracts it via sysfs. Any other error-state capture that gets created before that prior one was
cleared by the user gets dropped. I'm not sure if that limitation needs to be "fixed". I am not sure of its history -
maybe it was designed that way for a reason. ATM I dont have the bandwidth to chase that history down.
That is by design. Generally speaking, you want to see the first error 
that occurred. If there were a bunch back to back then likely the first 
one is the real problem and the rest are either symptoms or repeat 
offences of the first.

Also, in the case where a user doesn't know what an error capture is but 
they are running (or developing) some app that keeps dying and produces 
core dumps, you can't just swallow all system memory in a linked list of 
error captures that won't ever be read out and freed up.

Hence the design is to save the first and hold on to it until read out. 
And if any further hangs happen before the current one is cleared, they 
are discarded as uninteresting if not plain old spam.

reply, i have already changed it back to "notice" as per your last comment. Perhaps you should have looked at rev2 which
was posted before your responses above. As mentioned in last reply, i disagreed but i am committing to your request
which was fixed in rev2 as per your request.
The point of a code review is not "I say X so you must do X, immediately
post a new revision now". I asked some questions. I stated my opinion
about what the end user should see. If you think your implementation
already matches that or you disagree because you think I am basing my
comments on incorrect information, or even just that you disagree with
my reasoning, then you should not blindly post a new revision saying
"here are your changes, I don't like it because Y but just r-b it and
I'll merge". You should reply to the comments with your thoughts and
suggestions. Maybe the reviewer is wrong!

I think this comes down to preference as there is no clear rule about what is right vs what is wrong in this case:

	- we have a possible issue but its a corner case when all engine instances suffer a reset all at once.
	- in the event of that corner case occurring, the end-user could go back and increase the guc-log-capture-
region size and thus the warning will go away and GuC will have space to report the error-state dump for all engine
instances if they got reset all at once (within the same G2H IRQ event).
	- assuming the issue is reproducible, the user, now happy that the warning is resolved, reruns the workload.
The
user dumps the first error-state-capture via sysfs which only contains info for the first engine that was reset (as it
appeared in the G2H-CTB). the user inspects the same sysfs and none of the other dmesg-reported engine-hang error-state-
dumps seem to be made available in the sysfs after clearing that first one.

so the preference here becomes "should we warn the user about this possible event of running out of guc-error-state
capture region when resolving it still doesn't allow the user to get it anyway"... OR "should we just make a quiet debug
message about it since an i915 developer knows that the gpu_coredump framework design can only ever store 1 error-state-
capture-dump and throws the rest away until its cleared by the user via sysfs".

My preference is the latter. However, just to be clear, with the fixed calculation, we'll probably never see the
warning(or-debug) message with the HW that we have today and the foreseeable future.

The warning isn't meant to be for i915 developers who understand bizarre 
internal workings of the driver. The warning is meant to be for end 
users who might not be able to debug their applications.

Given that the error capture interface is fundamentally only really 
concerned about obtaining a single valid capture at a time. I would 
argue that supporting a dozen back-to-back captures is no more than a 
nice-to-have convenience for a very rare situation (such as the user has 
explicitly modified their kernel to save all the captures in a list). 
And that back-to-back scenario is the 'spare_size', not 'min_size' isn't 
it? So there is no need for a warning about that. A notice is fine but 
not a warning.

The thing that really, really, really matters to an end user is that the 
first capture of a session is complete and useful. So my view is that 
any warning message should be focused on that. Can we get one complete 
error capture for one complete set of engines? If not, that is something 
we should warn about. And that is what 'min_size' should be.

Note that 'one complete set of engines' really only means five on 
current hardware - RCS + CCS0-3 is the most that can reset at one time 
(due to dependent engine resets) given that GuC managed resets are 
context based and a context can only run on one engine. Although I think 
it is technically possible for more engines to be involved but that 
would also count as hard to hit a corner case. And given the small size 
involved, it might be simplest just to base min_size on all engines anyway.

IMHO, the min_size check should be meaningful and should be visible to
the user if it fails.

Also, are we still hitting the minimum size failure message? Now that
the calculation has been fixed, what sizes does it come up with for min
and spare? Are they within the allocation now or not?

Yes - we are within the allocation with this patch (the fix of removing the redundant register-struct-size
multiplication brought the number down significantly).
Bringing in comment from other thread:
  > Some context: with the calculation fix we are allocating 4MB but we
only need 78k as min-est.

Wow! That is quite a big smaller. And much more plausible! So if
min_size is 78KB, what is spare_size now?

Around 230K. (the min_est was exactly 78504 bytes or 76.664Kb - that was based on a DG2 - i think it was a 384 - cant
recall for sure)
Might be worth checking on PVC. That could be getting close to 1MB with 
all the extra blitter engines.

Sounds like we can drop it down to a 1MB allocation.

Okay - will do a 3rd rev with a 1MB change.
Don't start writing anything yet. Let's get to a final decision on the 
plan first.

And even if
min_size is not the absolute minimum but quite a bit more worst case
kind of size, it still seems reasonable to keep it as a warning rather
than a notice. Given that it is so much smaller than the allocation
size, if it does somehow overflow on some platform/configuration then
that sounds like something we should know about because it likely means
something is broken.
Okay - this works fine too since we probably will likely never trigger that message until we have significantly more
engine instances than todays hw.
See above about PVC.

John.