Re: [PATCH 1/1] drm/i915/guc: Fix GuC error capture sizing estimation and reporting

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 10/3/2022 11:28, Teres Alexis, Alan Previn wrote:
On Fri, 2022-09-30 at 15:35 -0700, Harrison, John C wrote:
On 9/30/2022 14:08, Teres Alexis, Alan Previn wrote:
I disagree because its unlikely that all engines can reset all at once (we probably have bigger problems at the at
point) and if they all occurred within the same G2H handler scheduled worker run, our current gpu_coredump framework
would just discard the ones after the first one and so it wouldnt even matter if we did catch it.
So min_size is not actually the minimal size for a meaningful capture?
So what is? And remember that for compute class engines, there is
dependent engine reset. So a reset of CCS2 also means a reset of RCS,
CCS0, CCS1 and CCS3. So having at least four engines per capture is not
unreasonable.

Alan: min_size is a meaningful size for the worst case scenario of collecting the guc-report. However due to gpu-core-
dump, its not meaningful in terms of reporting that information out to the user. Thats not a limitation of this
subsystem.
I'm not following what you mean about gpu-core-dump. The point of the warning is to let the user know that the interface they want to use (error capture via sysfs) might not be complete in the majority of cases. If there is some catastrophic worst case scenario that overflows but 99% of instances are fine then that's what the drm_notice about 'spare_size' is for. But if the 99% of instances of a capture are going to overflow then that is what the drm_warn about 'min_size' should be for.

It seems pointless to go through a lot of effort to calculate the
minimum and recommend sizes only to basically ignore them by just
whispering very, very quietly that there might be a problem.

Alan: I already responded that i will re-rev as per your recommendation and switch back to drm_notice.

It also
seems pointless to complain about a minimum size that actually isn't the
minimum size. That's sort of worse - now you are telling the user there
is a problem when really there isn't.

Alan: the min size is accurate - but whether we report to the use about them possibly facing a problem is where it gets
a bit unclear because of the limitation in the gpu-core-dump framework. We could drop the message completely if you like
- but then we'd have to remember to re-add it if we fix gpu-core-dump in future. For now, as i mentioned in the last
What limitation in the 'gpu-core-dump framework'? What is the fix required?

reply, i have already changed it back to "notice" as per your last comment. Perhaps you should have looked at rev2 which
was posted before your responses above. As mentioned in last reply, i disagreed but i am committing to your request
which was fixed in rev2 as per your request.
The point of a code review is not "I say X so you must do X, immediately post a new revision now". I asked some questions. I stated my opinion about what the end user should see. If you think your implementation already matches that or you disagree because you think I am basing my comments on incorrect information, or even just that you disagree with my reasoning, then you should not blindly post a new revision saying "here are your changes, I don't like it because Y but just r-b it and I'll merge". You should reply to the comments with your thoughts and suggestions. Maybe the reviewer is wrong!

IMHO, the min_size check should be meaningful and should be visible to
the user if it fails.

Also, are we still hitting the minimum size failure message? Now that
the calculation has been fixed, what sizes does it come up with for min
and spare? Are they within the allocation now or not?

Yes - we are within the allocation with this patch (the fix of removing the redundant register-struct-size
multiplication brought the number down significantly).
Bringing in comment from other thread:
> Some context: with the calculation fix we are allocating 4MB but we only need 78k as min-est.

Wow! That is quite a big smaller. And much more plausible! So if min_size is 78KB, what is spare_size now?

Sounds like we can drop it down to a 1MB allocation. And even if min_size is not the absolute minimum but quite a bit more worst case kind of size, it still seems reasonable to keep it as a warning rather than a notice. Given that it is so much smaller than the allocation size, if it does somehow overflow on some platform/configuration then that sounds like something we should know about because it likely means something is broken.


John.
So how would you like to proceed? Could we reply on rev2 btw?
I would like to answer the questions/concerns before jumping in to writing/re-writing code.

Why split the email thread? The discussion is already happening here. There is no point splitting a single discussion across multiple patch sets just because a new patch has been posted if that patch does not actually change the discussion.

John.




[Index of Archives]     [AMD Graphics]     [Linux USB Devel]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux