Re: [RFC 0/6] Non perf based Gen Graphics OA unit driver

Peter Zijlstra <peterz@xxxxxxxxxxxxx> · Fri, 16 Oct 2015 11:43:06 +0200

On Tue, Sep 29, 2015 at 03:39:03PM +0100, Robert Bragg wrote:
> - We're bridging two complex architectures
> 
>     To review this work I think it will be relevant to have a good
>     general familiarity with Gen graphics (e.g. thinking about the OA
>     unit's interaction with the command streamer and execlist
>     scheduling) as well as our userspace architecture and how we're
>     consuming OA data within Mesa to implement the
>     INTEL_performance_query extension.
>
>     On the flip side here, its necessary to understand the perf
>     userspace interface (for most this is hidden by tools so the details
>     aren't common knowledge) as well as the internal design, considering
>     that the PMU we're looking at seems to break several current design
>     assumptions. I can only claim a limited familiarity with perf's
>     design, just as a result of this work.

Right; but a little effort and patience on both sides should get us
there I think. At worst we'll both learn something new ;-)

> - The current OA PMU driver breaks some significant design assumptions.
> 
>     Existing perf pmus are used for profiling work on a cpu and we're
>     introducing the idea of _IS_DEVICE pmus with different security
>     implications, the need to fake cpu-related data (such as user/kernel
>     registers) to fit with perf's current design, and adding _DEVICE
>     records as a way to forward device-specific status records.

There are more devices with counters on than GPUs, so I think it might
make sense to look at extending perf to better deal with this.

>     The OA unit writes reports of counters into a circular buffer,
>     without involvement from the CPU, making our PMU driver the first of
>     a kind.

Agreed, this is somewhat 'odd' from where we are today.

>     Perf supports groups of counters and allows those to be read via
>     transactions internally but transactions currently seem designed to
>     be explicitly initiated from the cpu (say in response to a userspace
>     read()) and while we could pull a report out of the OA buffer we
>     can't trigger a report from the cpu on demand.
>
>     Related to being report based; the OA counters are configured in HW
>     as a set while perf generally expects counter configurations to be
>     orthogonal. Although counters can be associated with a group leader
>     as they are opened, there's no clear precedent for being able to
>     provide group-wide configuration attributes and no obvious solution
>     as yet that's expected to be acceptable to upstream and meets our
>     userspace needs.

I'm not entirely sure what you mean with group-wide configuration
attributes; could you elaborate?

>     We currently avoid using perf's grouping feature
>     and forward OA reports to userspace via perf's 'raw' sample field.
>     This suits our userspace well considering how coupled the counters
>     are when dealing with normalizing. It would be inconvenient to split
>     counters up into separate events, only to require userspace to
>     recombine them. 

So IF you were using a group, a single read from the leader can return
you a vector of all values (PERF_FORMAT_GROUP), this avoids having to
do that recombine.

Another option would be to view the arrival of an OA vector in the
datastream as an 'event' and generate a PERF_RECORD_READ in the perf
buffer (which again can use the GROUP vector format).

>     Related to counter orthogonality; we can't time share the OA unit,
>     while event scheduling is a central design idea within perf for
>     allowing userspace to open + enable more events than can be
>     configured in HW at any one time.

So we have other PMUs that cannot do this; Gen OA would not be unique in
this. Intel PT for example only allows a single active event.

That said; earlier today I saw:

  https://www.youtube.com/watch?v=9J3BQcAeHpI&list=PLe6I3NKr-I4J2oLGXhGOeBMEjh8h10jT3&index=7

where exactly this feature was mentioned as not fitting well into the
existing GPU performance interfaces (GL_AMD_performance_monitor /
GL_INTEL_performance_query).

So there is hardware (Nvidia) out there that does support this. Also
mentioned was that this hardware has global and local counters, where
the local ones are specific to a rendering context. That is not unlike
the per-cpu / per-task stuff perf does.

>     The OA unit is not designed to
>     allow re-configuration while in use. We can't reconfigure the OA
>     unit without loosing internal OA unit state which we can't access
>     explicitly to save and restore. Reconfiguring the OA unit is also
>     relatively slow, involving ~100 register writes. From userspace Mesa
>     also depends on a stable OA configuration when emitting
>     MI_REPORT_PERF_COUNT commands and importantly the OA unit can't be
>     disabled while there are outstanding MI_RPC commands lest we hang
>     the command streamer.

Right; see the PERF_PMU_CAP_EXCLUSIVE stuff.

> - We may be making some technical compromises a.t.m for the sake of
>   using perf.
> 
>     perf_event_open() requires events to either relate to a pid or a
>     specific cpu core, while our device pmu relates to neither.  Events
>     opened with a pid will be automatically enabled/disabled according
>     to the scheduling of that process - so not appropriate for us.

Right; the traditional cpu/pid mapping doesn't work well for devices;
but maybe, with some work, we can create something like that
global/local render context from it; although I've no clue what form
that would need at this time.

>     When
>     an event is related to a cpu id, perf ensures pmu methods will be
>     invoked via an inter process interrupt on that core. To avoid
>     invasive changes our userspace opens OA perf events for a specific
>     cpu. 

Some of that might still make sense in the sense that GPUs are subject
to the NUMA topology of machines. I would think you would want most
such things to be done on the node the device is attached to.

Granted, this might not be a concern for Intel graphics, but it might be
relevant for some of the discrete GPUs.

> - I'm not confident our use case benefits much from building on perf:
> 
>     We aren't using existing perf based tooling with our PMU. Existing
>     tools typically assume you're profiling work running on a cpu, e.g.
>     expecting samples to be associated with instruction pointers and
>     user/kernel registers and aiming to represent metrics in relation
>     to application source code. We're forwarding fake register values
>     and userspace needs needs to know how to decode the raw OA reports
>     before anything can be reported to a user.
>     
>     With the buffering done by the OA unit I don't think we currently
>     benefit from perf's mmapped circular buffer interface. We already
>     have a decoupled producer and consumer and since we have to copy out
>     of the OA buffer, it would work well for us to hide that copy in
>     a simpler read() based interface.
> 
> 
> - Logistically it might be more practical to contain this to the
>   graphics stack.
> 
>     It seems fair to consider that if we can't see a very compelling
>     benefit to building on perf, then containing this work to
>     drivers/gpu/drm/i915 may simplify the review process as well as
>     future maintenance and development.

> Peter; I wonder if you would tend to agree too that it could make sense
> for us to go with our own interface here?

Sorry this took so long; this wanted a well considered response and
those tend to get delayed in light of 'urgent' stuff.

While I can certainly see the pain points and why you would rather not
deal with them. I think it would make Linux a better place if we could
manage to come up with a generic interface that would work for 'all'
GPUs (and possibly more devices).
--
To unsubscribe from this list: send the line "unsubscribe linux-api" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html