On Tue, Sep 29, 2015 at 03:39:03PM +0100, Robert Bragg wrote: > - We're bridging two complex architectures > > To review this work I think it will be relevant to have a good > general familiarity with Gen graphics (e.g. thinking about the OA > unit's interaction with the command streamer and execlist > scheduling) as well as our userspace architecture and how we're > consuming OA data within Mesa to implement the > INTEL_performance_query extension. > > On the flip side here, its necessary to understand the perf > userspace interface (for most this is hidden by tools so the details > aren't common knowledge) as well as the internal design, considering > that the PMU we're looking at seems to break several current design > assumptions. I can only claim a limited familiarity with perf's > design, just as a result of this work. Right; but a little effort and patience on both sides should get us there I think. At worst we'll both learn something new ;-) > - The current OA PMU driver breaks some significant design assumptions. > > Existing perf pmus are used for profiling work on a cpu and we're > introducing the idea of _IS_DEVICE pmus with different security > implications, the need to fake cpu-related data (such as user/kernel > registers) to fit with perf's current design, and adding _DEVICE > records as a way to forward device-specific status records. There are more devices with counters on than GPUs, so I think it might make sense to look at extending perf to better deal with this. > The OA unit writes reports of counters into a circular buffer, > without involvement from the CPU, making our PMU driver the first of > a kind. Agreed, this is somewhat 'odd' from where we are today. > Perf supports groups of counters and allows those to be read via > transactions internally but transactions currently seem designed to > be explicitly initiated from the cpu (say in response to a userspace > read()) and while we could pull a report out of the OA buffer we > can't trigger a report from the cpu on demand. > > Related to being report based; the OA counters are configured in HW > as a set while perf generally expects counter configurations to be > orthogonal. Although counters can be associated with a group leader > as they are opened, there's no clear precedent for being able to > provide group-wide configuration attributes and no obvious solution > as yet that's expected to be acceptable to upstream and meets our > userspace needs. I'm not entirely sure what you mean with group-wide configuration attributes; could you elaborate? > We currently avoid using perf's grouping feature > and forward OA reports to userspace via perf's 'raw' sample field. > This suits our userspace well considering how coupled the counters > are when dealing with normalizing. It would be inconvenient to split > counters up into separate events, only to require userspace to > recombine them. So IF you were using a group, a single read from the leader can return you a vector of all values (PERF_FORMAT_GROUP), this avoids having to do that recombine. Another option would be to view the arrival of an OA vector in the datastream as an 'event' and generate a PERF_RECORD_READ in the perf buffer (which again can use the GROUP vector format). > Related to counter orthogonality; we can't time share the OA unit, > while event scheduling is a central design idea within perf for > allowing userspace to open + enable more events than can be > configured in HW at any one time. So we have other PMUs that cannot do this; Gen OA would not be unique in this. Intel PT for example only allows a single active event. That said; earlier today I saw: https://www.youtube.com/watch?v=9J3BQcAeHpI&list=PLe6I3NKr-I4J2oLGXhGOeBMEjh8h10jT3&index=7 where exactly this feature was mentioned as not fitting well into the existing GPU performance interfaces (GL_AMD_performance_monitor / GL_INTEL_performance_query). So there is hardware (Nvidia) out there that does support this. Also mentioned was that this hardware has global and local counters, where the local ones are specific to a rendering context. That is not unlike the per-cpu / per-task stuff perf does. > The OA unit is not designed to > allow re-configuration while in use. We can't reconfigure the OA > unit without loosing internal OA unit state which we can't access > explicitly to save and restore. Reconfiguring the OA unit is also > relatively slow, involving ~100 register writes. From userspace Mesa > also depends on a stable OA configuration when emitting > MI_REPORT_PERF_COUNT commands and importantly the OA unit can't be > disabled while there are outstanding MI_RPC commands lest we hang > the command streamer. Right; see the PERF_PMU_CAP_EXCLUSIVE stuff. > - We may be making some technical compromises a.t.m for the sake of > using perf. > > perf_event_open() requires events to either relate to a pid or a > specific cpu core, while our device pmu relates to neither. Events > opened with a pid will be automatically enabled/disabled according > to the scheduling of that process - so not appropriate for us. Right; the traditional cpu/pid mapping doesn't work well for devices; but maybe, with some work, we can create something like that global/local render context from it; although I've no clue what form that would need at this time. > When > an event is related to a cpu id, perf ensures pmu methods will be > invoked via an inter process interrupt on that core. To avoid > invasive changes our userspace opens OA perf events for a specific > cpu. Some of that might still make sense in the sense that GPUs are subject to the NUMA topology of machines. I would think you would want most such things to be done on the node the device is attached to. Granted, this might not be a concern for Intel graphics, but it might be relevant for some of the discrete GPUs. > - I'm not confident our use case benefits much from building on perf: > > We aren't using existing perf based tooling with our PMU. Existing > tools typically assume you're profiling work running on a cpu, e.g. > expecting samples to be associated with instruction pointers and > user/kernel registers and aiming to represent metrics in relation > to application source code. We're forwarding fake register values > and userspace needs needs to know how to decode the raw OA reports > before anything can be reported to a user. > > With the buffering done by the OA unit I don't think we currently > benefit from perf's mmapped circular buffer interface. We already > have a decoupled producer and consumer and since we have to copy out > of the OA buffer, it would work well for us to hide that copy in > a simpler read() based interface. > > > - Logistically it might be more practical to contain this to the > graphics stack. > > It seems fair to consider that if we can't see a very compelling > benefit to building on perf, then containing this work to > drivers/gpu/drm/i915 may simplify the review process as well as > future maintenance and development. > Peter; I wonder if you would tend to agree too that it could make sense > for us to go with our own interface here? Sorry this took so long; this wanted a well considered response and those tend to get delayed in light of 'urgent' stuff. While I can certainly see the pain points and why you would rather not deal with them. I think it would make Linux a better place if we could manage to come up with a generic interface that would work for 'all' GPUs (and possibly more devices). -- To unsubscribe from this list: send the line "unsubscribe linux-api" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html