On 9/30/2022 14:08, Teres Alexis, Alan Previn wrote:
I disagree because its unlikely that all engines can reset all at once (we probably have bigger problems at the at
point) and if they all occurred within the same G2H handler scheduled worker run, our current gpu_coredump framework
would just discard the ones after the first one and so it wouldnt even matter if we did catch it.
So min_size is not actually the minimal size for a meaningful capture?
So what is? And remember that for compute class engines, there is
dependent engine reset. So a reset of CCS2 also means a reset of RCS,
CCS0, CCS1 and CCS3. So having at least four engines per capture is not
unreasonable.
It seems pointless to go through a lot of effort to calculate the
minimum and recommend sizes only to basically ignore them by just
whispering very, very quietly that there might be a problem. It also
seems pointless to complain about a minimum size that actually isn't the
minimum size. That's sort of worse - now you are telling the user there
is a problem when really there isn't.
IMHO, the min_size check should be meaningful and should be visible to
the user if it fails.
Also, are we still hitting the minimum size failure message? Now that
the calculation has been fixed, what sizes does it come up with for min
and spare? Are they within the allocation now or not?
John.
But I'll go ahead and re-rev this.
...alan
On Fri, 2022-09-30 at 10:48 -0700, Harrison, John C wrote:
Isn't min_size the bare minimum to get a valid capture? Surely this
still needs to be a warning not a debug. If we can't manage a basic
working error capture then there is a problem. This needs to be caught
by CI and logged as a bug if it is ever hit. And that means an end user
should never see it fire because we won't let a driver out the door
unless the default buffer size is sufficient.