From: Sourab Gupta <sourab.gupta@xxxxxxxxx> Cc: Robert Bragg <robert@xxxxxxxxxxxxx>, Zhenyu Wang <zhenyuw@xxxxxxxxxxxxxxx>, Jon Bloomfield <jon.bloomfield@xxxxxxxxx>, Peter Zijlstra <a.p.zijlstra@xxxxxxxxx>, Jabin Wu <jabin.wu@xxxxxxxxx>, Insoo Woo <insoo.woo@xxxxxxxxx> This patch series adds support for capturing OA counter snapshots at asynchronous points by inserting MI_REPORT_PERF_COUNT commands into CS, and forwarding these snapshots to userspace using perf interface. These commands can be inserted at asynchronous points during workload execution for e.g. at batch buffer boundaries. This work is based on Robert Bragg's perf event framework, the patches for which were floated earlier. Please see the link below: http://lists.freedesktop.org/archives/intel-gfx/2015-May/066102.html The perf event framework enabled capture of periodic OA counter snapshots by configuring the OA unit during perf event init. The raw OA reports generated by HW are then forwarded to userspace using perf apis. There may be usecases wherein we need more than the periodic OA capture functionality which is supported by perf_event currently. Few such usecases are: - Ability to capture system wide metrics. The reports captured should be able to be mapped back to individual contexts. - Ability to inject tags for work, into the reports. This provides visibility into the multiple stages of work within single context. This framework may also be seen as a way to overcome a limitation of Haswell, which doesn't write out a context ID with OA reports and handling this in the kernel makes sense when we plan for compatibility with Broadwell which doesn't include context id in reports. This can be achieved by inserting the commands into the ring to dump the OA counter snapshots at some asynchronous points during workload execution. The reports generated can have an additional footer appended for capturing the metadata information such as ctx id, pid, tags, etc. The specific issue of counter wraparound due to large batchbuffers can be subverted by using them in conjunction with periodic OA snapshots. Such per-BB data can give useful information to userspace tools to analyze performance and timing information at batchbuffer level. An application intending to profile its own contexts can do so by submitting the MI_REPORT_PERF_COUNT commands into the CS from the userspace itself. But consider the usecase of a system wide GPU profiler tool which needs the data for all the workloads being scheduled on GPU globally. The relative complexity of doing this in kernel is significantly less than supporting such a usecase through userspace. This framework is intended to feed into the requirement of such system wide GPU profilers, which may further utilize this data for usecases such as performance analysis (at a global level), identifying optimization scenarios for improving GPU utilization, CPU vs GPU timing analysis, etc. Again, this is made possible by presence of metadata information with individual reports, which is enabled by this framework. One such system wide GPU profiler tool is MVP(Modular Video Profiler) tool, used by media team for profiling media workloads. (Talks in progress for open sourcing of this tool) The current implementation approach is to forward these samples through the same PERF_SAMPLE_RAW sample type, as being done for periodic samples, with an additional footer appended for metadata information. The userspace can then distinguish these samples by filtering out on the basis of sample size. One of the other approaches being contemplated right now is creating seperate sample types to handle these different kind of samples. There would be different fd's associated with these different sample types, though they can be a part of one event group. The userspace can listen to either or both these sample types while specifying event attributes during event init. But right now, I'm seeing this work as a future refinement, based on acceptance of general framework as such. I'm looking, as of now, to get the feedback on these initial patches, w.r.t. the usage of perf apis and the interaction with i915. Another feature introduced in these patches is perftag. PerfTag is a mechanism, whereby the reports collected are marked with a perfTag passed by userspace during the execbuffer call. This way the userspace tool can associate the reports collected with the corresponding execbuffers. This satifies the requirement to have visibility into multiple stages (i.e. execbuffers) lying within a single context. For e.g. for the media pipeline, CodecHAL encoding stage has a single context, and involves multiple stages such as Scaling, ME, MBEnc, PAK for which there are seperate execbuffer calls. There is a need to have the granularity of these multiple stages of a context for the reports generated. The presence of a perftag in report metadata fulfills this requirement. This is done right now by using rsvd2 field of execbuffer ioctl structure, and introducing an additional bitfield in flags to inform KMD of the same. One of the pre-requisite for this work is presence of globally unique context id. The context id right now is specific to drm file instance. As such, it can't uniquely be used to associate the reports generated with the corresponding context scheduled from userspace in a global way. In absence of globally unique context id, other metadata such as pid/tags in conjunction with ctx id may be used to associate reports with their corresponding contexts. The first patch in the series proposes a way of implementing globally unique context id. I'm looking for comments on the pros & cons of having global ctx id. This implementation can be refined upon if this approach is acceptable. The subsequent patches introduce the asynchronous OA capture mode and the mechanism to forward these snapshots using perf. This patch set currently supports Haswell. Gen8+ support can be added when the basic framework is agreed upon. Sourab Gupta (8): drm/i915: Have globally unique context ids, as opposed to drm file specific drm/i915: Introduce mode for asynchronous capture of OA counters drm/i915: Add the data structures for async OA capture mode drm/i915: Add mechanism for forwarding async OA counter snapshots through perf drm/i915: Wait for GPU to finish before event stop, in async OA counter mode drm/i915: Routines for inserting OA capture commands in the ringbuffer drm/i915: Add commands in ringbuf for OA snapshot capture across Batchbuffer boundaries drm/i915: Add perfTag support for OA counter reports drivers/gpu/drm/i915/i915_debugfs.c | 2 +- drivers/gpu/drm/i915/i915_dma.c | 1 + drivers/gpu/drm/i915/i915_drv.h | 47 ++- drivers/gpu/drm/i915/i915_gem_context.c | 53 +++- drivers/gpu/drm/i915/i915_gem_execbuffer.c | 9 + drivers/gpu/drm/i915/i915_oa_perf.c | 451 +++++++++++++++++++++++++++-- include/uapi/drm/i915_drm.h | 24 +- 7 files changed, 538 insertions(+), 49 deletions(-) -- 1.8.5.1 _______________________________________________ Intel-gfx mailing list Intel-gfx@xxxxxxxxxxxxxxxxxxxxx http://lists.freedesktop.org/mailman/listinfo/intel-gfx