Re: [PATCH 5/9] drm/i915: Enable i915 perf stream for Haswell OA unit

Daniel Vetter <daniel@xxxxxxxx> · Wed, 4 May 2016 15:51:48 +0200

On Wed, May 04, 2016 at 02:24:14PM +0100, Robert Bragg wrote:
> On Wed, May 4, 2016 at 1:24 PM, Daniel Vetter <daniel@xxxxxxxx> wrote:
> 
> > On Wed, May 04, 2016 at 10:49:53AM +0100, Robert Bragg wrote:
> > > On Wed, May 4, 2016 at 10:09 AM, Martin Peres <
> > martin.peres@xxxxxxxxxxxxxxx>
> > > wrote:
> > >
> > > > On 03/05/16 23:03, Robert Bragg wrote:
> > > >
> > > >>
> > > >>
> > > >> On Tue, May 3, 2016 at 8:34 PM, Robert Bragg <robert@xxxxxxxxxxxxx
> > > >> <mailto:robert@xxxxxxxxxxxxx>> wrote:
> > > >>
> > > >>     Sorry for the delay replying to this, I missed it.
> > > >>
> > > >>     On Sat, Apr 23, 2016 at 11:34 AM, Martin Peres <
> > martin.peres@xxxxxxx
> > > >>     <mailto:martin.peres@xxxxxxx>> wrote:
> > > >>
> > > >>         On 20/04/16 17:23, Robert Bragg wrote:
> > > >>
> > > >>             Gen graphics hardware can be set up to periodically write
> > > >>             snapshots of
> > > >>             performance counters into a circular buffer via its
> > > >> Observation
> > > >>             Architecture and this patch exposes that capability to
> > > >>             userspace via the
> > > >>             i915 perf interface.
> > > >>
> > > >>             Cc: Chris Wilson <chris@xxxxxxxxxxxxxxxxxx
> > > >>             <mailto:chris@xxxxxxxxxxxxxxxxxx>>
> > > >>             Signed-off-by: Robert Bragg <robert@xxxxxxxxxxxxx
> > > >>             <mailto:robert@xxxxxxxxxxxxx>>
> > > >>             Signed-off-by: Zhenyu Wang <zhenyuw@xxxxxxxxxxxxxxx
> > > >>             <mailto:zhenyuw@xxxxxxxxxxxxxxx>>
> > > >>
> > > >>             ---
> > > >>               drivers/gpu/drm/i915/i915_drv.h         |  56 +-
> > > >>               drivers/gpu/drm/i915/i915_gem_context.c |  24 +-
> > > >>               drivers/gpu/drm/i915/i915_perf.c        | 940
> > > >>             +++++++++++++++++++++++++++++++-
> > > >>               drivers/gpu/drm/i915/i915_reg.h         | 338
> > ++++++++++++
> > > >>               include/uapi/drm/i915_drm.h             |  70 ++-
> > > >>               5 files changed, 1408 insertions(+), 20 deletions(-)
> > > >>
> > > >>             +
> > > >>             +
> > > >>             +       /* It takes a fairly long time for a new MUX
> > > >>             configuration to
> > > >>             +        * be be applied after these register writes. This
> > > >> delay
> > > >>             +        * duration was derived empirically based on the
> > > >>             render_basic
> > > >>             +        * config but hopefully it covers the maximum
> > > >>             configuration
> > > >>             +        * latency...
> > > >>             +        */
> > > >>             +       mdelay(100);
> > > >>
> > > >>
> > > >>         With such a HW and SW design, how can we ever expose hope to
> > get
> > > >> any
> > > >>         kind of performance when we are trying to monitor different
> > > >>         metrics on each
> > > >>         draw call? This may be acceptable for system monitoring, but
> > it
> > > >>         is problematic
> > > >>         for the GL extensions :s
> > > >>
> > > >>
> > > >>         Since it seems like we are going for a perf API, it means that
> > > >>         for every change
> > > >>         of metrics, we need to flush the commands, wait for the GPU to
> > > >>         be done, then
> > > >>         program the new set of metrics via an IOCTL, wait 100 ms, and
> > > >>         then we may
> > > >>         resume rendering ... until the next change. We are talking
> > about
> > > >>         a latency of
> > > >>         6-7 frames at 60 Hz here... this is non-negligeable...
> > > >>
> > > >>
> > > >>         I understand that we have a ton of counters and we may hide
> > > >>         latency by not
> > > >>         allowing using more than half of the counters for every draw
> > > >>         call or frame, but
> > > >>         even then, this 100ms delay is killing this approach
> > altogether.
> > > >>
> > > >>
> > > >>
> > > >>
> > > >> So revisiting this to double check how things fail with my latest
> > > >> driver/tests without the delay, I apparently can't reproduce test
> > > >> failures without the delay any more...
> > > >>
> > > >> I think the explanation is that since first adding the delay to the
> > > >> driver I also made the the driver a bit more careful to not forward
> > > >> spurious reports that look invalid due to a zeroed report id field,
> > and
> > > >> that mechanism keeps the unit tests happy, even though there are still
> > > >> some number of invalid reports generated if we don't wait.
> > > >>
> > > >> One problem with simply having no delay is that the driver prints an
> > > >> error if it sees an invalid reports so I get a lot of 'Skipping
> > > >> spurious, invalid OA report' dmesg spam. Also this was intended more
> > as
> > > >> a last resort mechanism, and I wouldn't feel too happy about squashing
> > > >> the error message and potentially sweeping other error cases under the
> > > >> carpet.
> > > >>
> > > >> Experimenting to see if the delay can at least be reduced, I brought
> > the
> > > >> delay up in millisecond increments and found that although I still
> > see a
> > > >> lot of spurious reports only waiting 1 or 5 milliseconds, at 10
> > > >> milliseconds its reduced quite a bit and at 15 milliseconds I don't
> > seem
> > > >> to have any errors.
> > > >>
> > > >> 15 milliseconds is still a long time, but at least not as long as 100.
> > > >>
> > > >
> > > > OK, so the issue does not come from the HW after all, great!
> > > >
> > >
> > > Erm, I'm not sure that's a conclusion we can make here...
> > >
> > > The upshot here was really just reducing the delay from 100ms to 15ms.
> > > Previously I arrived at a workable delay by jumping the delay in orders
> > of
> > > magnitude with 10ms failing, 100ms passing and I didn't try and refine it
> > > further. Here I've looked at delays between 10 and 100ms.
> > >
> > > The other thing is observing that because the kernel is checking for
> > > invalid reports (generated by the hardware) before forwarding to
> > userspace
> > > the lack of a delay no longer triggers i-g-t failures because the invalid
> > > data won't reach i-g-t any more - though the invalid reports are still a
> > > thing to avoid.
> > >
> > >
> > > > Now, my main question is, why are spurious events generated when
> > changing
> > > > the MUX's value? I can understand that we would need to ignore the
> > reading
> > > > that came right after the change, but other than this,  I am a bit at a
> > > > loss.
> > > >
> > >
> > > The MUX selects 16 signals that the OA unit can turn into 16 separate
> > > counters by basically counting the signal changes. (there's some fancy
> > > fixed function logic that can affect this but that's the general idea).
> > >
> > > If the MUX is in the middle of being re-programmed then some subset of
> > > those 16 signals are for who knows what.
> > >
> > > After programming the MUX we will go on to configure the OA unit and the
> > > tests will enable periodic sampling which (if we have no delay) will
> > sample
> > > the OA counters that are currently being fed by undefined signals.
> > >
> > > So as far as that goes it makes sense to me to expect bad data if we
> > don't
> > > wait for the MUX config to land properly. Something I don't really know
> > is
> > > how come we're seemingly lucky to have the reports be cleanly invalid
> > with
> > > a zero report-id, instead of just having junk data that would be harder
> > to
> > > recognise.
> >
> > Yeah this mdelay story sounds realy scary. Few random comments:
> > - msleep instead of mdelay please
> >
> 
> yup this was a mistake I'd forgotten to fix in this series, but is fixed in
> the last series I sent after chris noticed too.
> 
> actually in my latest (yesterday after experimenting further with the delay
> requirements) I'm using usleep_range for a delay between 15 and 20
> milliseconds which seems to be enough.
> 
> 
> > - no dmesg noise above debug level for stuff that we know can happen -
> >   dmesg noise counts as igt failures
> >
> 
> okey. I don't think I have anything above debug level, unless things are
> going badly wrong.
> 
> Just double checking though has made me think twice about a WARN_ON in
> gen7_oa_read for !oa_buffer_addr, which would be a bad situation but should
> either be removed (never expected), be a BUG_ON (since the code would deref
> NULL anyway) or more gracefully return an error if seen.

WARN_ON + bail out, or BUG_ON. Silently fixing up without failing loud in
dmesg is imo the wrong approach for something that should never happen.

> I currently have some DRM_DRIVER_DEBUG errors for cases where userspace
> messes up what properties it gives to open a stream - hopefully that sound
> ok? I've found it quite helpful to have a readable error for otherwise
> vague EINVAL type errors.

Yeah, as long as it's debug-only it's perfectly fine. We actually try to
have such a line for each EINVAL, since it's so useful (but then userspace
always hits the one case we've missed to document with debug output!).

> I have a debug message I print if we see an invalid HW report, which
> *shouldn't* happen but can happen (e.g. if the MUX delay or tail margin
> aren't well tuned) and it's helpful to have the feedback, in case we end up
> in a situation where we see this kind of message too frequently which might
> indicate an issue to investigate.

That's fine too.

> > - reading 0 sounds more like bad synchronization.
> 
> 
> I suppose I haven't thoroughly considered if we should return zero in any
> case  - normally that would imply EOF so we get to choose what that implies
> here. I don't think the driver should ever return 0 from read() currently.
> 
> A few concious choices re: read() return values have been:
> 
> - never ever return partial records (or rather even if a partial record
> were literally copied into the userspace buffer, but an error were hit in
> the middle of copying a full sample then that record wouldn't be accounted
> for in the byte count returned.)
> 
> - Don't throw away records successfully copied, due to a later error. This
> simplifies error handling paths internally and reporting
> EAGAIN/ENOSPC/EFAULT errors and means data isn't lost. The precedence for
> what we return is 1) did we successfully copy some reports? report bytes
> copied for complete records. 2) did we encounter an error? report that if
> so. 3) return -EAGAIN. (though for a blocking fd this will be handled
> internally).
> 
> 
> 
> > Have you tried quiescing
> 
> the entire gpu (to make sure nothing is happen) and disabling OA, then
> >   updating, and then restarting? At least on a very quick look I didn't
> >   spot that. Random delays freak me out a bit, but wouldn't be surprised
> >   if really needed.
> >
> 
> Experimenting yesterday, it seems I can at least reduce the delay to around
> 15ms (granted that's still pretty huge), and it's also workable to have
> userspace sleep for this time (despite the mdelay I originally went with)
> 
> Haven't tried this, but yeah could be interesting to experiment if the MUX
> config lands faster in different situation such as when the HW is idle.

In either case I think it'd be good to excessively document what you've
discovered. Maybe even split out the msleep into a separate patch, so that
the commit message with all the details is easy to find again in the
future. Because 2 months down the road someone will read this and go wtf
;-)

Cheers, Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
_______________________________________________
Intel-gfx mailing list
Intel-gfx@xxxxxxxxxxxxxxxxxxxxx
https://lists.freedesktop.org/mailman/listinfo/intel-gfx