On Fri, Oct 16, 2015 at 10:43 AM, Peter Zijlstra <peterz@xxxxxxxxxxxxx> wrote: > On Tue, Sep 29, 2015 at 03:39:03PM +0100, Robert Bragg wrote: >> - We're bridging two complex architectures >> >> To review this work I think it will be relevant to have a good >> general familiarity with Gen graphics (e.g. thinking about the OA >> unit's interaction with the command streamer and execlist >> scheduling) as well as our userspace architecture and how we're >> consuming OA data within Mesa to implement the >> INTEL_performance_query extension. >> >> On the flip side here, its necessary to understand the perf >> userspace interface (for most this is hidden by tools so the details >> aren't common knowledge) as well as the internal design, considering >> that the PMU we're looking at seems to break several current design >> assumptions. I can only claim a limited familiarity with perf's >> design, just as a result of this work. > > Right; but a little effort and patience on both sides should get us > there I think. At worst we'll both learn something new ;-) I suppose I'm also concerned time is an important factor too. When it comes to the OA metrics; we already have userspace tools that could be more widely used by developers once we have an upstream interface. Today perf isn't very well suited to our OA unit use case, and although we may be able to change that - and I can try to help with that - at this point I think I'd prefer not to block moving forward in the mean time with the alternative i915 interface. Although code-wise it didn't require any big changes to events/core to get an initial perf based driver working for our use case, we have raised a number of quite significant design questions and arguably cut some corners, which could take a long time to resolve properly. I also tend to think it's an open question at this stage whether it would really be in everyone's interest to take perf in this direction without a clear sense of the benefits it brings in comparison to the complexity it may add. It's also a bit awkward I had already started to move ahead with this idea of upstreaming a non-perf based driver for the OA unit after asking Daniel Vetter about this on IRC. There are some knock on effects here too; Sourab Gupta is looking at building on this OA driver and has now started adapting his work for this non-perf approach. > >> - The current OA PMU driver breaks some significant design assumptions. >> >> Existing perf pmus are used for profiling work on a cpu and we're >> introducing the idea of _IS_DEVICE pmus with different security >> implications, the need to fake cpu-related data (such as user/kernel >> registers) to fit with perf's current design, and adding _DEVICE >> records as a way to forward device-specific status records. > > There are more devices with counters on than GPUs, so I think it might > make sense to look at extending perf to better deal with this. I wonder if it could be good to look at exposing some of the mmio accessible Gen graphics counters before tackling a more complex case like the OA unit. We have a number of counters that could be interesting to sample periodically via a hrtimer, that require no configuration, are global (so no need to specify a gpu context) but as they relate to the GPU an _IS_DEVICE pmu would still be appropriate. Some of these seem like they could be better suited to being exposed via perf than OA unit counters so they might be a helpful stepping stone. > >> The OA unit writes reports of counters into a circular buffer, >> without involvement from the CPU, making our PMU driver the first of >> a kind. > > Agreed, this is somewhat 'odd' from where we are today. > >> Perf supports groups of counters and allows those to be read via >> transactions internally but transactions currently seem designed to >> be explicitly initiated from the cpu (say in response to a userspace >> read()) and while we could pull a report out of the OA buffer we >> can't trigger a report from the cpu on demand. >> >> Related to being report based; the OA counters are configured in HW >> as a set while perf generally expects counter configurations to be >> orthogonal. Although counters can be associated with a group leader >> as they are opened, there's no clear precedent for being able to >> provide group-wide configuration attributes and no obvious solution >> as yet that's expected to be acceptable to upstream and meets our >> userspace needs. > > I'm not entirely sure what you mean with group-wide configuration > attributes; could you elaborate? Here I'm thinking of configuration details that conceptually relate to a set of OA unit counters, not individual events/counters: - The choice of 'metric set' which represents a MUX configuration + boolean logic configuration for a set of counters that will be included in the reports written by the OA unit. - The OA unit exponent for periodic sampling applies to the whole group. - The choice of report layout which the OA unit writes all the counters in. - The choice to profile a single context or system-wide applies to the group, as well as the specification of a file descriptor + context ID in the single-context case. > >> We currently avoid using perf's grouping feature >> and forward OA reports to userspace via perf's 'raw' sample field. >> This suits our userspace well considering how coupled the counters >> are when dealing with normalizing. It would be inconvenient to split >> counters up into separate events, only to require userspace to >> recombine them. > > So IF you were using a group, a single read from the leader can return > you a vector of all values (PERF_FORMAT_GROUP), this avoids having to > do that recombine. Although recombining isn't necessary with _FORMAT_GROUP, this vector layout could be similarly inconvenient for userspace... afict we couldn't avoid also requesting the _ID to be included in the vector, so instead of the 32/40bits per counter we would now have 16bytes per counter which would seem to loose some benefit from a compact HW layout to minimize our use of memory bandwidth. Userspace has to be aware that we're placing 32bit or 40bit values within a 64bit field and the values will overflow as such. As userspace receives reports it calculates deltas between sequential reports to accumulate 64bit counters. For each event in the vector it needs to lookup the index of the accumulated value, and a lookup based on a u64 event id for each counter isn't as direct as with a rigid report layout. (I don't know what assumptions can be made about the vector ordering from one sample to the next, but maybe a fixed mapping could be made after the first sample) When calculating the delta userspace needs to be careful not to treat the vector values as 64bit but as 32/40bit to account for overflow. Mesa will still have to be able to handle the raw OA report layouts to process reports collected via the command stream using MI_REPORT_PERF_COUNT commands, so any alternative at least represents some amount of extra layout handling code and needing to handle different combinations of raw or vector layouts when calculating a single delta. > > Another option would be to view the arrival of an OA vector in the > datastream as an 'event' and generate a PERF_RECORD_READ in the perf > buffer (which again can use the GROUP vector format). Tbh I couldn't really figure out what PERF_RECORD_READ is intended for - the userspace code I found currently ignores these. The extensible sample design seems more appropriate, but it could be good if samples were extensible by device drivers as an alternative to packing data into the raw field. Sourab who's been building on my base OA driver is exposing more data as part of samples, including a context ID, and we were extending what we included in the raw field (we wouldn't get this extra data via _RECORD_READ). Conceptually the aim was to expose event specific sample flags, to extend samples (mirroring the existing design for pre-defined sample flags, except driver extensible). We just used the raw field as the most convenient extension point to start with. Extending the raw field has some difficulties though because events/core currently only lets a pmu give a single raw data pointer plus len which will be copied into the ring buffer, so to include more than the OA report we'd have to copy the report into an intermediate larger buffer. I'd been considering allowing a vector of data+len values to be specified for copying the raw data. > >> Related to counter orthogonality; we can't time share the OA unit, >> while event scheduling is a central design idea within perf for >> allowing userspace to open + enable more events than can be >> configured in HW at any one time. > > So we have other PMUs that cannot do this; Gen OA would not be unique in > this. Intel PT for example only allows a single active event. > > That said; earlier today I saw: > > https://www.youtube.com/watch?v=9J3BQcAeHpI&list=PLe6I3NKr-I4J2oLGXhGOeBMEjh8h10jT3&index=7 > > where exactly this feature was mentioned as not fitting well into the > existing GPU performance interfaces (GL_AMD_performance_monitor / > GL_INTEL_performance_query). I think Samuel was generally commenting that these Intel/AMD GL extensions aren't a good fit for Nvidia hw since it's awkward to abstract the need to update the MUX configuration in-flight so they can derive higher level counters from more inputs than the HW can expose at the same time. Unlike Intel and AMD, it looks like instead of defining a GL extension for accessing performance counters Nvidia has a perfkit api which can be used in conjuction with OpenGL, CUDA or Direct3D which Samuel has also been implementing. I'm not sure what exactly about the AMD/Intel GL extensions makes it tricky to abstract round-robin updates of the MUXs internally vs what perfkit offers but since we wouldn't expect to round-robin the OA counter configuration the INTEL_performance_query spec authors wouldn't have given that any consideration. > > So there is hardware (Nvidia) out there that does support this. Also > mentioned was that this hardware has global and local counters, where > the local ones are specific to a rendering context. That is not unlike > the per-cpu / per-task stuff perf does. For reference, OA counters can be viewed as global or local counters. When opening an event an application can request a system-wide view vs single-context view. > >> The OA unit is not designed to >> allow re-configuration while in use. We can't reconfigure the OA >> unit without loosing internal OA unit state which we can't access >> explicitly to save and restore. Reconfiguring the OA unit is also >> relatively slow, involving ~100 register writes. From userspace Mesa >> also depends on a stable OA configuration when emitting >> MI_REPORT_PERF_COUNT commands and importantly the OA unit can't be >> disabled while there are outstanding MI_RPC commands lest we hang >> the command streamer. > > Right; see the PERF_PMU_CAP_EXCLUSIVE stuff. This looks like it's incompatible with the group mechanism re: the other question of exposing OA reports via the grouping mechanims vs raw sample data. > >> - We may be making some technical compromises a.t.m for the sake of >> using perf. >> >> perf_event_open() requires events to either relate to a pid or a >> specific cpu core, while our device pmu relates to neither. Events >> opened with a pid will be automatically enabled/disabled according >> to the scheduling of that process - so not appropriate for us. > > Right; the traditional cpu/pid mapping doesn't work well for devices; > but maybe, with some work, we can create something like that > global/local render context from it; although I've no clue what form > that would need at this time. Currently the way we identify a context is with the combination of a file descriptor and a u32 context handle (only unique to that fd). The use of a drm file descriptor here also relates to the security model for accessing single context metrics, in that we don't require root privileges to profile a context associated with a file descriptor that the process has open. Recently we've also been working with a globally unique context ID (not upstream yet) and it could be interesting to also allow single context profiling given a global ID, without any fd (comparable to passing a pid to perf_event_open). This form of selection would probably require root privileges by default. > >> When >> an event is related to a cpu id, perf ensures pmu methods will be >> invoked via an inter process interrupt on that core. To avoid >> invasive changes our userspace opens OA perf events for a specific >> cpu. > > Some of that might still make sense in the sense that GPUs are subject > to the NUMA topology of machines. I would think you would want most > such things to be done on the node the device is attached to. > > Granted, this might not be a concern for Intel graphics, but it might be > relevant for some of the discrete GPUs. I'm not sure whether Nouveau takes this kind of idea into consideration or not. > >> - I'm not confident our use case benefits much from building on perf: >> >> We aren't using existing perf based tooling with our PMU. Existing >> tools typically assume you're profiling work running on a cpu, e.g. >> expecting samples to be associated with instruction pointers and >> user/kernel registers and aiming to represent metrics in relation >> to application source code. We're forwarding fake register values >> and userspace needs needs to know how to decode the raw OA reports >> before anything can be reported to a user. >> >> With the buffering done by the OA unit I don't think we currently >> benefit from perf's mmapped circular buffer interface. We already >> have a decoupled producer and consumer and since we have to copy out >> of the OA buffer, it would work well for us to hide that copy in >> a simpler read() based interface. >> >> >> - Logistically it might be more practical to contain this to the >> graphics stack. >> >> It seems fair to consider that if we can't see a very compelling >> benefit to building on perf, then containing this work to >> drivers/gpu/drm/i915 may simplify the review process as well as >> future maintenance and development. > >> Peter; I wonder if you would tend to agree too that it could make sense >> for us to go with our own interface here? > > Sorry this took so long; this wanted a well considered response and > those tend to get delayed in light of 'urgent' stuff. > > While I can certainly see the pain points and why you would rather not > deal with them. I think it would make Linux a better place if we could > manage to come up with a generic interface that would work for 'all' > GPUs (and possibly more devices). I'm not sure a generic multi-vendor interface for gpu metrics is what's at stake or really needed here. The perf driver I worked on didn't attempt to provide a generic interface: E.g. counter normalizing is relatively complex and HW specific but makes sense to leave to userspace. Normalizing may combine periodic OA reports (from perf) with MI_REPORT_PERF_COUNT reports got via a command stream (not involving perf). Normalizing might also be done offline or remotely for a reduced runtime overhead). Although it could be good to work more closely with Samuel on this, I did raise the idea of using perf with him last year but I think his priority is still with accessing metrics synchronized with the command stream. We were looking at a perf-like interface to help support the periodic sampling feature of the OA unit, but I'm not aware that Nvidia HW has the same kind of feature. I think it's more the norm that GPU drivers don't share a lot of kernel interfaces given the significant architectural differences between vendors. A large proportion of a GPU driver is typically in userspace and interface standardisation is done at the OpenGL/CL level more so than with kernel interfaces. While perf may be adaptable to help access some gpu/device metrics it's not in a good position to be involved with capturing metrics via command streams in sync with specific commands submitted in userspace so I wouldn't expect to aim for perf to be the sole interface for gpu metrics. Okey, sorry to still be inclined to prioritize the non-perf interface at this stage, at least for OA unit metrics. That said; I think it may still be practical to look at exposing other mmio accessible counters of Gen graphics via perf and maybe these could serve as a stepping stone before attempting supporting the OA unit via perf. I'm still open to trying to help with adapting perf in this direction if you feel it could be worthwhile, but would like to decouple the effort for now. Regards, - Robert -- To unsubscribe from this list: send the line "unsubscribe linux-api" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html