From: Sourab Gupta <sourab.gupta@xxxxxxxxx> This is the updated patch series(v3 - changes listed at end), which adds support for capturing OA counter snapshots for multiple contexts, by inserting MI_REPORT_PERF_COUNT commands into CS, and forwarding these snapshots to userspace using perf interface. This work is based on Robert Bragg's perf event framework, the patches for which were floated earlier: http://lists.freedesktop.org/archives/intel-gfx/2015-May/066102.html Robert's perf event framework enabled capture of periodic OA counter snapshots by configuring the OA unit during perf event init. The raw OA reports generated by HW are then forwarded to userspace using perf apis. But, there may be usecases wherein we need more than the periodic OA capture functionality which is supported by perf_event currently. Few such usecases are: - Ability to capture system wide metrics. The reports captured should be able to be mapped back to individual contexts. - Ability to inject tags for work, into the reports. This provides visibility into the multiple stages of work within single context. This framework proposed here may also be seen as a way to overcome a limitation of Haswell, which doesn't write out a context ID with OA reports and handling this in the kernel makes sense when we plan for compatibility with Broadwell which doesn't include context id in reports. This can be achieved by inserting the commands into the ring, before and after the batchbuffer, to dump the OA counter snapshots. The reports generated can have an additional footer appended for capturing the metadata information such as ctx id, pid, tags, etc. The specific issue of counter wraparound due to large batchbuffers can be subverted by using them in conjunction with periodic OA snapshots. Such per-BB data can give useful information to userspace tools to analyze performance and timing information at batchbuffer level. An application intending to profile its own contexts can do so by submitting the MI_REPORT_PERF_COUNT commands into the CS from the userspace itself. But consider the usecase of a system wide GPU profiler tool which needs the data for all the workloads being scheduled on GPU globally. The relative complexity of doing this in kernel is significantly less than supporting such a usecase through userspace. This framework is intended to feed into the requirement of such system wide GPU profilers, which may further utilize this data for usecases such as performance analysis (at a global level), identifying optimization scenarios for improving GPU utilization, CPU vs GPU timing analysis, etc. Again, this is made possible by presence of metadata information with individual reports, which is enabled by this framework. One such system wide GPU profiler tool is MVP(Modular Video Profiler) tool, used by media team for profiling media workloads. The current implementation approach is to forward these samples through the same PERF_SAMPLE_RAW sample type, as being done for periodic samples, with an additional footer appended for metadata information. The userspace can then distinguish these samples by filtering out on the basis of sample size. One of the other approaches being contemplated right now is creating separate sample types to handle these different kind of samples. There would be different fd's associated with these different sample types, though they can be a part of one event group. The userspace can listen to either or both these sample types while specifying event attributes during event init. But right now, I'm seeing this work as a future refinement, based on acceptance of general framework as such. I'm looking, as of now, to get the feedback on these initial patches, w.r.t. the usage of perf apis and the interaction with i915. Another feature introduced in these patches is execbuffer tagging. It is a mechanism, whereby the reports collected are marked with a tag passed by userspace during the execbuffer call. This way the userspace tool can associate the reports collected with the corresponding execbuffers. This satifies the requirement to have visibility into multiple stages (i.e. execbuffers) lying within a single context. For e.g. for the media pipeline, CodecHAL encoding stage has a single context, and involves multiple stages such as Scaling, ME, MBEnc, PAK for which there are separate execbuffer calls. There is a need to have the granularity of these multiple stages of a context for the reports generated. The presence of a tag in report metadata fulfills this requirement. One of the pre-requisite for this work is presence of globally unique id associated with each context. The present context id is specific to drm fd. As such, it can't uniquely be used to associate the reports generated with the corresponding context scheduled from userspace in a global way. In absence of globally unique context id, other metadata such as pid/tags in conjunction with ctx id may be used to associate reports with the corresponding contexts. The first patch in the series introduces a global context id, and the subsequent patches introduce the multi-context OA capture mode and the mechanism to forward these snapshots using perf. This patch set currently supports Haswell. Gen8+ support can be added when the basic framework is agreed upon. v2: This patch series has the following changes wrt the one floated earlier: - Removing synchronous waits during event stop/destroy - segregating the book-keeping data for the samples from destination buffer and collecting it into a separate list - managing the lifetime of destination buffer with the help of gem active reference tracking - having the scope of i915 device mutex limited to places of gem interaction and having the pmu data structures protected with a per pmu lock - userspace can now control the metadata it wants by requesting the same during event init. The sample is sent with the requested metadata in a packed format. - Some patches merged together and a few more introduced v3: Changes made: - Global id for ctx allocated from a separate cyclic idr, and not trying to overload existing drm fd specific ctx id for this purpose. - Meeting semantics for flush (ensuring to flush samples before returning) - spin_locks used in place of spin_lock_irqsave - execbuffer tag now uses upper 32 bits of rsvd1 field. - Some code restructuring/optimization, better nomenclature and error handling. Sourab Gupta (8): drm/i915: Introduce global id for contexts drm/i915: Introduce mode for capture of multi ctx OA reports synchronized with RCS drm/i915: Add mechanism for forwarding CS based OA counter snapshots through perf drm/i915: Forward periodic and CS based OA reports sorted acc to timestamps drm/i915: Handle event stop and destroy for commands in flight drm/i915: Insert commands for capture of OA counters in the ring drm/i915: Add support for having pid output with OA report drm/i915: Add support to add execbuffer tags to OA counter reports drivers/gpu/drm/i915/i915_drv.h | 53 ++- drivers/gpu/drm/i915/i915_gem_context.c | 19 + drivers/gpu/drm/i915/i915_gem_execbuffer.c | 6 + drivers/gpu/drm/i915/i915_oa_perf.c | 604 +++++++++++++++++++++++++---- include/uapi/drm/i915_drm.h | 25 +- 5 files changed, 635 insertions(+), 72 deletions(-) -- 1.8.5.1 _______________________________________________ Intel-gfx mailing list Intel-gfx@xxxxxxxxxxxxxxxxxxxxx http://lists.freedesktop.org/mailman/listinfo/intel-gfx