[RFC 0/8] Introduce framework to forward multi context OA snapshots

sourab.gupta@xxxxxxxxx · Wed, 5 Aug 2015 11:22:49 +0530

From: Sourab Gupta <sourab.gupta@xxxxxxxxx>

This is the updated patch series(v3 - changes listed at end), which adds
support for capturing OA counter snapshots for multiple contexts, by inserting
MI_REPORT_PERF_COUNT commands into CS, and forwarding these snapshots to
userspace using perf interface.

This work is based on Robert Bragg's perf event framework, the patches for
which were floated earlier:

http://lists.freedesktop.org/archives/intel-gfx/2015-May/066102.html

Robert's perf event framework enabled capture of periodic OA counter snapshots
by configuring the OA unit during perf event init. The raw OA reports generated
by HW are then forwarded to userspace using perf apis.

But, there may be usecases wherein we need more than the periodic OA capture
functionality which is supported by perf_event currently. Few such usecases are:
    - Ability to capture system wide metrics. The reports captured should be
      able to be mapped back to individual contexts.
    - Ability to inject tags for work, into the reports. This provides
      visibility into the multiple stages of work within single context.

This framework proposed here may also be seen as a way to overcome a limitation
of Haswell, which doesn't write out a context ID with OA reports and handling
this in the kernel makes sense when we plan for compatibility with Broadwell
which doesn't include context id in reports.

This can be achieved by inserting the commands into the ring, before and after
the batchbuffer, to dump the OA counter snapshots. The reports generated can
have an additional footer appended for capturing the metadata information such
as ctx id, pid, tags, etc. The specific issue of counter wraparound due to
large batchbuffers can be subverted by using them in conjunction with periodic
OA snapshots. Such per-BB data can give useful information to userspace tools
to analyze performance and timing information at batchbuffer level.

An application intending to profile its own contexts can do so by submitting
the MI_REPORT_PERF_COUNT commands into the CS from the userspace itself.
But consider the usecase of a system wide GPU profiler tool which needs the 
data for all the workloads being scheduled on GPU globally. The relative
complexity of doing this in kernel is significantly less than supporting such
a usecase through userspace.

This framework is intended to feed into the requirement of such system wide
GPU profilers, which may further utilize this data for usecases such as
performance analysis (at a global level), identifying optimization scenarios
for improving GPU utilization, CPU vs GPU timing analysis, etc. Again, this is
made possible by presence of metadata information with individual reports, which
is enabled by this framework.
One such system wide GPU profiler tool is MVP(Modular Video Profiler) tool,
used by media team for profiling media workloads.

The current implementation approach is to forward these samples through the
same PERF_SAMPLE_RAW sample type, as being done for periodic samples, with an
additional footer appended for metadata information. The userspace can then
distinguish these samples by filtering out on the basis of sample size.
One of the other approaches being contemplated right now is creating separate
sample types to handle these different kind of samples. There would be different
fd's associated with these different sample types, though they can be a part of
one event group. The userspace can listen to either or both these sample types
while specifying event attributes during event init.
But right now, I'm seeing this work as a future refinement, based on acceptance
of general framework as such. I'm looking, as of now, to get the feedback on
these initial patches, w.r.t. the usage of perf apis and the interaction with
i915.

Another feature introduced in these patches is execbuffer tagging. It is a
mechanism, whereby the reports collected are marked with a tag passed by
userspace during the execbuffer call. This way the userspace tool can associate
the reports collected with the corresponding execbuffers. This satifies the
requirement to have visibility into multiple stages (i.e. execbuffers) lying
within a single context. 
For e.g. for the media pipeline, CodecHAL encoding stage has a single context,
and involves multiple stages such as Scaling, ME, MBEnc, PAK for which there
are separate execbuffer calls. There is a need to have the granularity of these
multiple stages of a context for the reports generated. The presence of a
tag in report metadata fulfills this requirement.

One of the pre-requisite for this work is presence of globally unique id
associated with each context. The present context id is specific to drm fd.
As such, it can't uniquely be used to associate the reports generated with the
corresponding context scheduled from userspace in a global way.
In absence of globally unique context id, other metadata such as pid/tags in
conjunction with ctx id may be used to associate reports with the corresponding
contexts.

The first patch in the series introduces a global context id, and the
subsequent patches introduce the multi-context OA capture mode and the
mechanism to forward these snapshots using perf.

This patch set currently supports Haswell. Gen8+ support can be added when
the basic framework is agreed upon.

v2: This patch series has the following changes wrt the one floated earlier:
    - Removing synchronous waits during event stop/destroy
    - segregating the book-keeping data for the samples from destination buffer
      and collecting it into a separate list
    - managing the lifetime of destination buffer with the help of gem active
      reference tracking
    - having the scope of i915 device mutex limited to places of gem interaction
      and having the pmu data structures protected with a per pmu lock
    - userspace can now control the metadata it wants by requesting the same
      during event init. The sample is sent with the requested metadata in a
      packed format.
    - Some patches merged together and a few more introduced

v3: Changes made:
    - Global id for ctx allocated from a separate cyclic idr, and not trying to
      overload existing drm fd specific ctx id for this purpose.
    - Meeting semantics for flush (ensuring to flush samples before returning)
    - spin_locks used in place of spin_lock_irqsave
    - execbuffer tag now uses upper 32 bits of rsvd1 field.
    - Some code restructuring/optimization, better nomenclature and error
      handling.

Sourab Gupta (8):
  drm/i915: Introduce global id for contexts
  drm/i915: Introduce mode for capture of multi ctx OA reports
    synchronized with RCS
  drm/i915: Add mechanism for forwarding CS based OA counter snapshots
    through perf
  drm/i915: Forward periodic and CS based OA reports sorted acc to
    timestamps
  drm/i915: Handle event stop and destroy for commands in flight
  drm/i915: Insert commands for capture of OA counters in the ring
  drm/i915: Add support for having pid output with OA report
  drm/i915: Add support to add execbuffer tags to OA counter reports

 drivers/gpu/drm/i915/i915_drv.h            |  53 ++-
 drivers/gpu/drm/i915/i915_gem_context.c    |  19 +
 drivers/gpu/drm/i915/i915_gem_execbuffer.c |   6 +
 drivers/gpu/drm/i915/i915_oa_perf.c        | 604 +++++++++++++++++++++++++----
 include/uapi/drm/i915_drm.h                |  25 +-
 5 files changed, 635 insertions(+), 72 deletions(-)

-- 
1.8.5.1

_______________________________________________
Intel-gfx mailing list
Intel-gfx@xxxxxxxxxxxxxxxxxxxxx
http://lists.freedesktop.org/mailman/listinfo/intel-gfx