Re: [PATCH 1/1] drm/i915/guc: Fix GuC error capture sizing estimation and reporting

"Teres Alexis, Alan Previn" <alan.previn.teres.alexis@xxxxxxxxx> · Mon, 3 Oct 2022 21:10:38 +0000

On Mon, 2022-10-03 at 12:47 -0700, Harrison, John C wrote:
> On 10/3/2022 11:28, Teres Alexis, Alan Previn wrote:
> > On Fri, 2022-09-30 at 15:35 -0700, Harrison, John C wrote:
> > > On 9/30/2022 14:08, Teres Alexis, Alan Previn wrote:
> > > > I disagree because its unlikely that all engines can reset all at once (we probably have bigger problems at the at
> > > > point) and if they all occurred within the same G2H handler scheduled worker run, our current gpu_coredump framework
> > > > would just discard the ones after the first one and so it wouldnt even matter if we did catch it.
> > > So min_size is not actually the minimal size for a meaningful capture?
> > > So what is? And remember that for compute class engines, there is
> > > dependent engine reset. So a reset of CCS2 also means a reset of RCS,
> > > CCS0, CCS1 and CCS3. So having at least four engines per capture is not
> > > unreasonable.
> > > 
> > Alan: min_size is a meaningful size for the worst case scenario of collecting the guc-report. However due to gpu-core-
> > dump, its not meaningful in terms of reporting that information out to the user. Thats not a limitation of this
> > subsystem.
> I'm not following what you mean about gpu-core-dump. The point of the 
> warning is to let the user know that the interface they want to use 
> (error capture via sysfs) might not be complete in the majority of 
> cases. If there is some catastrophic worst case scenario that overflows 
> but 99% of instances are fine then that's what the drm_notice about 
> 'spare_size' is for. But if the 99% of instances of a capture are going 
> to overflow then that is what the drm_warn about 'min_size' should be for.
> 
> > > It seems pointless to go through a lot of effort to calculate the
> > > minimum and recommend sizes only to basically ignore them by just
> > > whispering very, very quietly that there might be a problem.
> > > 
> > Alan: I already responded that i will re-rev as per your recommendation and switch back to drm_notice.
> > 
> > > It also
> > > seems pointless to complain about a minimum size that actually isn't the
> > > minimum size. That's sort of worse - now you are telling the user there
> > > is a problem when really there isn't.
> > > 
> > Alan: the min size is accurate - but whether we report to the use about them possibly facing a problem is where it gets
> > a bit unclear because of the limitation in the gpu-core-dump framework. We could drop the message completely if you like
> > - but then we'd have to remember to re-add it if we fix gpu-core-dump in future. For now, as i mentioned in the last
> What limitation in the 'gpu-core-dump framework'? What is the fix required?
> 
Unless it has changed since i last looked at it, gpu_coredump framework will store the first error-state captured and
wait until the user extracts it via sysfs. Any other error-state capture that gets created before that prior one was
cleared by the user gets dropped. I'm not sure if that limitation needs to be "fixed". I am not sure of its history -
maybe it was designed that way for a reason. ATM I dont have the bandwidth to chase that history down.

> > reply, i have already changed it back to "notice" as per your last comment. Perhaps you should have looked at rev2 which
> > was posted before your responses above. As mentioned in last reply, i disagreed but i am committing to your request
> > which was fixed in rev2 as per your request.
> The point of a code review is not "I say X so you must do X, immediately 
> post a new revision now". I asked some questions. I stated my opinion 
> about what the end user should see. If you think your implementation 
> already matches that or you disagree because you think I am basing my 
> comments on incorrect information, or even just that you disagree with 
> my reasoning, then you should not blindly post a new revision saying 
> "here are your changes, I don't like it because Y but just r-b it and 
> I'll merge". You should reply to the comments with your thoughts and 
> suggestions. Maybe the reviewer is wrong!
> 
I think this comes down to preference as there is no clear rule about what is right vs what is wrong in this case:

	- we have a possible issue but its a corner case when all engine instances suffer a reset all at once.
	- in the event of that corner case occurring, the end-user could go back and increase the guc-log-capture-
region size and thus the warning will go away and GuC will have space to report the error-state dump for all engine
instances if they got reset all at once (within the same G2H IRQ event).
	- assuming the issue is reproducible, the user, now happy that the warning is resolved, reruns the workload.
The
user dumps the first error-state-capture via sysfs which only contains info for the first engine that was reset (as it
appeared in the G2H-CTB). the user inspects the same sysfs and none of the other dmesg-reported engine-hang error-state-
dumps seem to be made available in the sysfs after clearing that first one.

so the preference here becomes "should we warn the user about this possible event of running out of guc-error-state
capture region when resolving it still doesn't allow the user to get it anyway"... OR "should we just make a quiet debug
message about it since an i915 developer knows that the gpu_coredump framework design can only ever store 1 error-state-
capture-dump and throws the rest away until its cleared by the user via sysfs".

My preference is the latter. However, just to be clear, with the fixed calculation, we'll probably never see the
warning(or-debug) message with the HW that we have today and the foreseeable future.

> > > IMHO, the min_size check should be meaningful and should be visible to
> > > the user if it fails.
> > > 
> > > Also, are we still hitting the minimum size failure message? Now that
> > > the calculation has been fixed, what sizes does it come up with for min
> > > and spare? Are they within the allocation now or not?
> > > 
> > Yes - we are within the allocation with this patch (the fix of removing the redundant register-struct-size
> > multiplication brought the number down significantly).
> Bringing in comment from other thread:
>  > Some context: with the calculation fix we are allocating 4MB but we 
> only need 78k as min-est.
> 
> Wow! That is quite a big smaller. And much more plausible! So if 
> min_size is 78KB, what is spare_size now?
> 
Around 230K. (the min_est was exactly 78504 bytes or 76.664Kb - that was based on a DG2 - i think it was a 384 - cant
recall for sure)

> Sounds like we can drop it down to a 1MB allocation.
> 
Okay - will do a 3rd rev with a 1MB change.

> And even if 
> min_size is not the absolute minimum but quite a bit more worst case 
> kind of size, it still seems reasonable to keep it as a warning rather 
> than a notice. Given that it is so much smaller than the allocation 
> size, if it does somehow overflow on some platform/configuration then 
> that sounds like something we should know about because it likely means 
> something is broken.

Okay - this works fine too since we probably will likely never trigger that message until we have significantly more
engine instances than todays hw.
>