On 25 April 2017 at 18:10, Lionel Landwerlin <lionel.g.landwerlin@xxxxxxxxx> wrote: > On 25/04/17 09:42, Matthew Auld wrote: >> >> On 24 April 2017 at 19:49, Lionel Landwerlin >> <lionel.g.landwerlin@xxxxxxxxx> wrote: >>> >>> From: Robert Bragg <robert@xxxxxxxxxxxxx> >>> >>> Enables access to OA unit metrics for BDW, CHV, SKL and BXT which all >>> share (more-or-less) the same OA unit design. >>> >>> Of particular note in comparison to Haswell: some OA unit HW config >>> state has become per-context state and as a consequence it is somewhat >>> more complicated to manage synchronous state changes from the cpu while >>> there's no guarantee of what context (if any) is currently actively >>> running on the gpu. >>> >>> The periodic sampling frequency which can be particularly useful for >>> system-wide analysis (as opposed to command stream synchronised >>> MI_REPORT_PERF_COUNT commands) is perhaps the most surprising state to >>> have become per-context save and restored (while the OABUFFER >>> destination is still a shared, system-wide resource). >>> >>> This support for gen8+ takes care to consider a number of timing >>> challenges involved in synchronously updating per-context state >>> primarily by programming all config state from the cpu and updating all >>> current and saved contexts synchronously while the OA unit is still >>> disabled. >>> >>> The driver intentionally avoids depending on command streamer >>> programming to update OA state considering the lack of synchronization >>> between the automatic loading of OACTXCONTROL state (that includes the >>> periodic sampling state and enable state) on context restore and the >>> parsing of any general purpose BB the driver can control. I.e. this >>> implementation is careful to avoid the possibility of a context restore >>> temporarily enabling any out-of-date periodic sampling state. In >>> addition to the risk of transiently-out-of-date state being loaded >>> automatically; there are also internal HW latencies involved in the >>> loading of MUX configurations which would be difficult to account for >>> from the command streamer (and we only want to enable the unit when once >>> the MUX configuration is complete). >>> >>> Since the Gen8+ OA unit design no longer supports clock gating the unit >>> off for a single given context (which effectively stopped any progress >>> of counters while any other context was running) and instead supports >>> tagging OA reports with a context ID for filtering on the CPU, it means >>> we can no longer hide the system-wide progress of counters from a >>> non-privileged application only interested in metrics for its own >>> context. Although we could theoretically try and subtract the progress >>> of other contexts before forwarding reports via read() we aren't in a >>> position to filter reports captured via MI_REPORT_PERF_COUNT commands. >>> As a result, for Gen8+, we always require the >>> dev.i915.perf_stream_paranoid to be unset for any access to OA metrics >>> if not root. >>> >>> v5: Drain submitted requests when enabling metric set to ensure no >>> lite-restore erases the context image we just updated (Lionel) >>> >>> v6: In addition to drain, switch to kernel context & update all >>> context in place (Chris) >>> >>> Signed-off-by: Robert Bragg <robert@xxxxxxxxxxxxx> >>> Signed-off-by: Lionel Landwerlin <lionel.g.landwerlin@xxxxxxxxx> >>> Reviewed-by: Matthew Auld <matthew.auld@xxxxxxxxx> \o/ >>> --- >>> drivers/gpu/drm/i915/i915_drv.h | 45 +- >>> drivers/gpu/drm/i915/i915_perf.c | 957 >>> ++++++++++++++++++++++++++++++++++++--- >>> drivers/gpu/drm/i915/i915_reg.h | 22 + >>> drivers/gpu/drm/i915/intel_lrc.c | 2 + >>> include/uapi/drm/i915_drm.h | 19 +- >>> 5 files changed, 952 insertions(+), 93 deletions(-) >>> >>> diff --git a/drivers/gpu/drm/i915/i915_drv.h >>> b/drivers/gpu/drm/i915/i915_drv.h >>> index ffa1fc5eddfd..676b1227067c 100644 >>> --- a/drivers/gpu/drm/i915/i915_drv.h >>> +++ b/drivers/gpu/drm/i915/i915_drv.h >>> @@ -2064,9 +2064,17 @@ struct i915_oa_ops { >>> void (*init_oa_buffer)(struct drm_i915_private *dev_priv); >>> >>> /** >>> - * @enable_metric_set: Applies any MUX configuration to set up >>> the >>> - * Boolean and Custom (B/C) counters that are part of the counter >>> - * reports being sampled. May apply system constraints such as >>> + * @select_metric_set: The auto generated code that checks >>> whether a >>> + * requested OA config is applicable to the system and if so sets >>> up >>> + * the mux, oa and flex eu register config pointers according to >>> the >>> + * current dev_priv->perf.oa.metrics_set. >>> + */ >>> + int (*select_metric_set)(struct drm_i915_private *dev_priv); >>> + >>> + /** >>> + * @enable_metric_set: Selects and applies any MUX configuration >>> to set >>> + * up the Boolean and Custom (B/C) counters that are part of the >>> + * counter reports being sampled. May apply system constraints >>> such as >>> * disabling EU clock gating as required. >>> */ >>> int (*enable_metric_set)(struct drm_i915_private *dev_priv); >>> @@ -2097,20 +2105,13 @@ struct i915_oa_ops { >>> size_t *offset); >>> >>> /** >>> - * @oa_buffer_check: Check for OA buffer data + update tail >>> - * >>> - * This is either called via fops or the poll check hrtimer >>> (atomic >>> - * ctx) without any locks taken. >>> + * @oa_hw_tail_read: read the OA tail pointer register >>> * >>> - * It's safe to read OA config state here unlocked, assuming that >>> this >>> - * is only called while the stream is enabled, while the global >>> OA >>> - * configuration can't be modified. >>> - * >>> - * Efficiency is more important than avoiding some false >>> positives >>> - * here, which will be handled gracefully - likely resulting in >>> an >>> - * %EAGAIN error for userspace. >>> + * In particular this enables us to share all the fiddly code for >>> + * handling the OA unit tail pointer race that affects multiple >>> + * generations. >>> */ >>> - bool (*oa_buffer_check)(struct drm_i915_private *dev_priv); >>> + u32 (*oa_hw_tail_read)(struct drm_i915_private *dev_priv); >>> }; >>> >>> struct intel_cdclk_state { >>> @@ -2472,6 +2473,7 @@ struct drm_i915_private { >>> struct { >>> struct i915_vma *vma; >>> u8 *vaddr; >>> + u32 last_ctx_id; >>> int format; >>> int format_size; >>> >>> @@ -2541,6 +2543,14 @@ struct drm_i915_private { >>> } oa_buffer; >>> >>> u32 gen7_latched_oastatus1; >>> + u32 ctx_oactxctrl_offset; >>> + u32 ctx_flexeu0_offset; >>> + >>> + /* The RPT_ID/reason field for Gen8+ includes a >>> bit >>> + * to determine if the CTX ID in the report is >>> valid >>> + * but the specific bit differs between Gen 8 and >>> 9 >>> + */ >>> + u32 gen8_valid_ctx_bit; >>> >>> struct i915_oa_ops ops; >>> const struct i915_oa_format *oa_formats; >>> @@ -2851,6 +2861,8 @@ intel_info(const struct drm_i915_private *dev_priv) >>> #define IS_KBL_ULX(dev_priv) (INTEL_DEVID(dev_priv) == 0x590E || \ >>> INTEL_DEVID(dev_priv) == 0x5915 || \ >>> INTEL_DEVID(dev_priv) == 0x591E) >>> +#define IS_SKL_GT2(dev_priv) (IS_SKYLAKE(dev_priv) && \ >>> + (INTEL_DEVID(dev_priv) & 0x00F0) == >>> 0x0010) >>> #define IS_SKL_GT3(dev_priv) (IS_SKYLAKE(dev_priv) && \ >>> (INTEL_DEVID(dev_priv) & 0x00F0) == >>> 0x0020) >>> #define IS_SKL_GT4(dev_priv) (IS_SKYLAKE(dev_priv) && \ >>> @@ -3615,6 +3627,9 @@ i915_gem_context_lookup_timeline(struct >>> i915_gem_context *ctx, >>> >>> int i915_perf_open_ioctl(struct drm_device *dev, void *data, >>> struct drm_file *file); >>> +void i915_oa_init_reg_state(struct intel_engine_cs *engine, >>> + struct i915_gem_context *ctx, >>> + uint32_t *reg_state); >>> >>> /* i915_gem_evict.c */ >>> int __must_check i915_gem_evict_something(struct i915_address_space >>> *vm, >>> diff --git a/drivers/gpu/drm/i915/i915_perf.c >>> b/drivers/gpu/drm/i915/i915_perf.c >>> index 3277a52ce98e..f2688dee5a03 100644 >>> --- a/drivers/gpu/drm/i915/i915_perf.c >>> +++ b/drivers/gpu/drm/i915/i915_perf.c >>> @@ -196,6 +196,12 @@ >>> >>> #include "i915_drv.h" >>> #include "i915_oa_hsw.h" >>> +#include "i915_oa_bdw.h" >>> +#include "i915_oa_chv.h" >>> +#include "i915_oa_sklgt2.h" >>> +#include "i915_oa_sklgt3.h" >>> +#include "i915_oa_sklgt4.h" >>> +#include "i915_oa_bxt.h" >>> >>> /* HW requires this to be a power of two, between 128k and 16M, though >>> driver >>> * is currently generally designed assuming the largest 16M size is >>> used such >>> @@ -215,7 +221,7 @@ >>> * >>> * Although this can be observed explicitly while copying reports to >>> userspace >>> * by checking for a zeroed report-id field in tail reports, we want to >>> account >>> - * for this earlier, as part of the _oa_buffer_check to avoid lots of >>> redundant >>> + * for this earlier, as part of the oa_buffer_check to avoid lots of >>> redundant >>> * read() attempts. >>> * >>> * In effect we define a tail pointer for reading that lags the real >>> tail >>> @@ -237,7 +243,7 @@ >>> * indicates that an updated tail pointer is needed. >>> * >>> * Most of the implementation details for this workaround are in >>> - * gen7_oa_buffer_check_unlocked() and gen7_appand_oa_reports() >>> + * oa_buffer_check_unlocked() and _append_oa_reports() >>> * >>> * Note for posterity: previously the driver used to define an >>> effective tail >>> * pointer that lagged the real pointer by a 'tail margin' measured in >>> bytes >>> @@ -272,6 +278,13 @@ static u32 i915_perf_stream_paranoid = true; >>> >>> #define INVALID_CTX_ID 0xffffffff >>> >>> +/* On Gen8+ automatically triggered OA reports include a 'reason' >>> field... */ >>> +#define OAREPORT_REASON_MASK 0x3f >>> +#define OAREPORT_REASON_SHIFT 19 >>> +#define OAREPORT_REASON_TIMER (1<<0) >>> +#define OAREPORT_REASON_CTX_SWITCH (1<<3) >>> +#define OAREPORT_REASON_CLK_RATIO (1<<5) >>> + >>> >>> /* For sysctl proc_dointvec_minmax of i915_oa_max_sample_rate >>> * >>> @@ -303,6 +316,13 @@ static struct i915_oa_format >>> hsw_oa_formats[I915_OA_FORMAT_MAX] = { >>> [I915_OA_FORMAT_C4_B8] = { 7, 64 }, >>> }; >>> >>> +static struct i915_oa_format gen8_plus_oa_formats[I915_OA_FORMAT_MAX] = >>> { >>> + [I915_OA_FORMAT_A12] = { 0, 64 }, >>> + [I915_OA_FORMAT_A12_B8_C8] = { 2, 128 }, >>> + [I915_OA_FORMAT_A32u40_A4u32_B8_C8] = { 5, 256 }, >>> + [I915_OA_FORMAT_C4_B8] = { 7, 64 }, >>> +}; >>> + >>> #define SAMPLE_OA_REPORT (1<<0) >>> >>> /** >>> @@ -332,8 +352,20 @@ struct perf_open_properties { >>> int oa_period_exponent; >>> }; >>> >>> +static u32 gen8_oa_hw_tail_read(struct drm_i915_private *dev_priv) >>> +{ >>> + return I915_READ(GEN8_OATAILPTR) & GEN8_OATAILPTR_MASK; >>> +} >>> + >>> +static u32 gen7_oa_hw_tail_read(struct drm_i915_private *dev_priv) >>> +{ >>> + u32 oastatus1 = I915_READ(GEN7_OASTATUS1); >>> + >>> + return oastatus1 & GEN7_OASTATUS1_TAIL_MASK; >>> +} >>> + >>> /** >>> - * gen7_oa_buffer_check_unlocked - check for data and update tail ptr >>> state >>> + * oa_buffer_check_unlocked - check for data and update tail ptr state >>> * @dev_priv: i915 device instance >>> * >>> * This is either called via fops (for blocking reads in user ctx) or >>> the poll >>> @@ -356,12 +388,11 @@ struct perf_open_properties { >>> * >>> * Returns: %true if the OA buffer contains data, else %false >>> */ >>> -static bool gen7_oa_buffer_check_unlocked(struct drm_i915_private >>> *dev_priv) >>> +static bool oa_buffer_check_unlocked(struct drm_i915_private *dev_priv) >>> { >>> int report_size = dev_priv->perf.oa.oa_buffer.format_size; >>> unsigned long flags; >>> unsigned int aged_idx; >>> - u32 oastatus1; >>> u32 head, hw_tail, aged_tail, aging_tail; >>> u64 now; >>> >>> @@ -381,8 +412,7 @@ static bool gen7_oa_buffer_check_unlocked(struct >>> drm_i915_private *dev_priv) >>> aged_tail = dev_priv->perf.oa.oa_buffer.tails[aged_idx].offset; >>> aging_tail = >>> dev_priv->perf.oa.oa_buffer.tails[!aged_idx].offset; >>> >>> - oastatus1 = I915_READ(GEN7_OASTATUS1); >>> - hw_tail = oastatus1 & GEN7_OASTATUS1_TAIL_MASK; >>> + hw_tail = dev_priv->perf.oa.ops.oa_hw_tail_read(dev_priv); >>> >>> /* The tail pointer increases in 64 byte increments, >>> * not in report_size steps... >>> @@ -404,6 +434,7 @@ static bool gen7_oa_buffer_check_unlocked(struct >>> drm_i915_private *dev_priv) >>> if (aging_tail != INVALID_TAIL_PTR && >>> ((now - dev_priv->perf.oa.oa_buffer.aging_timestamp) > >>> OA_TAIL_MARGIN_NSEC)) { >>> + >>> aged_idx ^= 1; >>> dev_priv->perf.oa.oa_buffer.aged_tail_idx = aged_idx; >>> >>> @@ -553,6 +584,287 @@ static int append_oa_sample(struct i915_perf_stream >>> *stream, >>> * >>> * Returns: 0 on success, negative error code on failure. >>> */ >>> +static int gen8_append_oa_reports(struct i915_perf_stream *stream, >>> + char __user *buf, >>> + size_t count, >>> + size_t *offset) >>> +{ >>> + struct drm_i915_private *dev_priv = stream->dev_priv; >>> + int report_size = dev_priv->perf.oa.oa_buffer.format_size; >>> + u8 *oa_buf_base = dev_priv->perf.oa.oa_buffer.vaddr; >>> + u32 gtt_offset = >>> i915_ggtt_offset(dev_priv->perf.oa.oa_buffer.vma); >>> + u32 mask = (OA_BUFFER_SIZE - 1); >>> + size_t start_offset = *offset; >>> + unsigned long flags; >>> + unsigned int aged_tail_idx; >>> + u32 head, tail; >>> + u32 taken; >>> + int ret = 0; >>> + >>> + if (WARN_ON(!stream->enabled)) >>> + return -EIO; >>> + >>> + spin_lock_irqsave(&dev_priv->perf.oa.oa_buffer.ptr_lock, flags); >>> + >>> + head = dev_priv->perf.oa.oa_buffer.head; >>> + aged_tail_idx = dev_priv->perf.oa.oa_buffer.aged_tail_idx; >>> + tail = dev_priv->perf.oa.oa_buffer.tails[aged_tail_idx].offset; >>> + >>> + spin_unlock_irqrestore(&dev_priv->perf.oa.oa_buffer.ptr_lock, >>> flags); >>> + >>> + /* An invalid tail pointer here means we're still waiting for the >>> poll >>> + * hrtimer callback to give us a pointer >>> + */ >>> + if (tail == INVALID_TAIL_PTR) >>> + return -EAGAIN; >>> + >>> + /* NB: oa_buffer.head/tail include the gtt_offset which we don't >>> want >>> + * while indexing relative to oa_buf_base. >>> + */ >>> + head -= gtt_offset; >>> + tail -= gtt_offset; >>> + >>> + /* An out of bounds or misaligned head or tail pointer implies a >>> driver >>> + * bug since we validate + align the tail pointers we read from >>> the >>> + * hardware and we are in full control of the head pointer which >>> should >>> + * only be incremented by multiples of the report size (notably >>> also >>> + * all a power of two). >>> + */ >>> + if (WARN_ONCE(head > OA_BUFFER_SIZE || head % report_size || >>> + tail > OA_BUFFER_SIZE || tail % report_size, >>> + "Inconsistent OA buffer pointers: head = %u, tail = >>> %u\n", >>> + head, tail)) >>> + return -EIO; >>> + >>> + >>> + for (/* none */; >>> + (taken = OA_TAKEN(tail, head)); >>> + head = (head + report_size) & mask) { >>> + u8 *report = oa_buf_base + head; >>> + u32 *report32 = (void *)report; >>> + u32 ctx_id; >>> + u32 reason; >>> + >>> + /* All the report sizes factor neatly into the buffer >>> + * size so we never expect to see a report split >>> + * between the beginning and end of the buffer. >>> + * >>> + * Given the initial alignment check a misalignment >>> + * here would imply a driver bug that would result >>> + * in an overrun. >>> + */ >>> + if (WARN_ON((OA_BUFFER_SIZE - head) < report_size)) { >>> + DRM_ERROR("Spurious OA head ptr: non-integral >>> report offset\n"); >>> + break; >>> + } >>> + >>> + /* The reason field includes flags identifying what >>> + * triggered this specific report (mostly timer >>> + * triggered or e.g. due to a context switch). >>> + * >>> + * This field is never expected to be zero so we can >>> + * check that the report isn't invalid before copying >>> + * it to userspace... >>> + */ >>> + reason = ((report32[0] >> OAREPORT_REASON_SHIFT) & >>> + OAREPORT_REASON_MASK); >>> + if (reason == 0) { >>> + if >>> (__ratelimit(&dev_priv->perf.oa.spurious_report_rs)) >>> + DRM_NOTE("Skipping spurious, invalid OA >>> report\n"); >>> + continue; >>> + } >>> + >>> + /* XXX: Just keep the lower 21 bits for now since I'm not >>> + * entirely sure if the HW touches any of the higher bits >>> in >>> + * this field >>> + */ >>> + ctx_id = report32[2] & 0x1fffff; >>> + >>> + /* Squash whatever is in the CTX_ID field if it's marked >>> as >>> + * invalid to be sure we avoid false-positive, >>> single-context >>> + * filtering below... >>> + * >>> + * Note: that we don't clear the valid_ctx_bit so >>> userspace can >>> + * understand that the ID has been squashed by the >>> kernel. >>> + */ >>> + if (!(report32[0] & >>> dev_priv->perf.oa.gen8_valid_ctx_bit)) >>> + ctx_id = report32[2] = INVALID_CTX_ID; >>> + >>> + /* NB: For Gen 8 the OA unit no longer supports clock >>> gating >>> + * off for a specific context and the kernel can't >>> securely >>> + * stop the counters from updating as system-wide / >>> global >>> + * values. >>> + * >>> + * Automatic reports now include a context ID so reports >>> can be >>> + * filtered on the cpu but it's not worth trying to >>> + * automatically subtract/hide counter progress for other >>> + * contexts while filtering since we can't stop userspace >>> + * issuing MI_REPORT_PERF_COUNT commands which would >>> still >>> + * provide a side-band view of the real values. >>> + * >>> + * To allow userspace (such as >>> Mesa/GL_INTEL_performance_query) >>> + * to normalize counters for a single filtered context >>> then it >>> + * needs be forwarded bookend context-switch reports so >>> that it >>> + * can track switches in between MI_REPORT_PERF_COUNT >>> commands >>> + * and can itself subtract/ignore the progress of >>> counters >>> + * associated with other contexts. Note that the hardware >>> + * automatically triggers reports when switching to a new >>> + * context which are tagged with the ID of the newly >>> active >>> + * context. To avoid the complexity (and likely >>> fragility) of >>> + * reading ahead while parsing reports to try and >>> minimize >>> + * forwarding redundant context switch reports (i.e. >>> between >>> + * other, unrelated contexts) we simply elect to forward >>> them >>> + * all. >>> + * >>> + * We don't rely solely on the reason field to identify >>> context >>> + * switches since it's not-uncommon for periodic samples >>> to >>> + * identify a switch before any 'context switch' report. >>> + */ >>> + if (!dev_priv->perf.oa.exclusive_stream->ctx || >>> + dev_priv->perf.oa.specific_ctx_id == ctx_id || >>> + (dev_priv->perf.oa.oa_buffer.last_ctx_id == >>> + dev_priv->perf.oa.specific_ctx_id) || >>> + reason & OAREPORT_REASON_CTX_SWITCH) { >>> + >>> + /* While filtering for a single context we avoid >>> + * leaking the IDs of other contexts. >>> + */ >>> + if (dev_priv->perf.oa.exclusive_stream->ctx && >>> + dev_priv->perf.oa.specific_ctx_id != ctx_id) >>> { >>> + report32[2] = INVALID_CTX_ID; >>> + } >>> + >>> + ret = append_oa_sample(stream, buf, count, >>> offset, >>> + report); >>> + if (ret) >>> + break; >>> + >>> + dev_priv->perf.oa.oa_buffer.last_ctx_id = ctx_id; >>> + } >>> + >>> + /* The above reason field sanity check is based on >>> + * the assumption that the OA buffer is initially >>> + * zeroed and we reset the field after copying so the >>> + * check is still meaningful once old reports start >>> + * being overwritten. >>> + */ >>> + report32[0] = 0; >>> + } >>> + >>> + if (start_offset != *offset) { >>> + spin_lock_irqsave(&dev_priv->perf.oa.oa_buffer.ptr_lock, >>> flags); >>> + >>> + /* We removed the gtt_offset for the copy loop above, >>> indexing >>> + * relative to oa_buf_base so put back here... >>> + */ >>> + head += gtt_offset; >>> + >>> + I915_WRITE(GEN8_OAHEADPTR, head & GEN8_OAHEADPTR_MASK); >>> + dev_priv->perf.oa.oa_buffer.head = head; >>> + >>> + >>> spin_unlock_irqrestore(&dev_priv->perf.oa.oa_buffer.ptr_lock, flags); >>> + } >>> + >>> + return ret; >>> +} >>> + >>> +/** >>> + * gen8_oa_read - copy status records then buffered OA reports >>> + * @stream: An i915-perf stream opened for OA metrics >>> + * @buf: destination buffer given by userspace >>> + * @count: the number of bytes userspace wants to read >>> + * @offset: (inout): the current position for writing into @buf >>> + * >>> + * Checks OA unit status registers and if necessary appends >>> corresponding >>> + * status records for userspace (such as for a buffer full condition) >>> and then >>> + * initiate appending any buffered OA reports. >>> + * >>> + * Updates @offset according to the number of bytes successfully copied >>> into >>> + * the userspace buffer. >>> + * >>> + * NB: some data may be successfully copied to the userspace buffer >>> + * even if an error is returned, and this is reflected in the >>> + * updated @offset. >>> + * >>> + * Returns: zero on success or a negative error code >>> + */ >>> +static int gen8_oa_read(struct i915_perf_stream *stream, >>> + char __user *buf, >>> + size_t count, >>> + size_t *offset) >>> +{ >>> + struct drm_i915_private *dev_priv = stream->dev_priv; >>> + u32 oastatus; >>> + int ret; >>> + >>> + if (WARN_ON(!dev_priv->perf.oa.oa_buffer.vaddr)) >>> + return -EIO; >>> + >>> + oastatus = I915_READ(GEN8_OASTATUS); >>> + >>> + /* We treat OABUFFER_OVERFLOW as a significant error: >>> + * >>> + * Although theoretically we could handle this more gracefully >>> + * sometimes, some Gens don't correctly suppress certain >>> + * automatically triggered reports in this condition and so we >>> + * have to assume that old reports are now being trampled >>> + * over. >>> + * >>> + * Considering how we don't currently give userspace control >>> + * over the OA buffer size and always configure a large 16MB >>> + * buffer, then a buffer overflow does anyway likely indicate >>> + * that something has gone quite badly wrong. >>> + */ >>> + if (oastatus & GEN8_OASTATUS_OABUFFER_OVERFLOW) { >>> + ret = append_oa_status(stream, buf, count, offset, >>> + >>> DRM_I915_PERF_RECORD_OA_BUFFER_LOST); >>> + if (ret) >>> + return ret; >>> + >>> + DRM_DEBUG("OA buffer overflow (exponent = %d): force >>> restart\n", >>> + dev_priv->perf.oa.period_exponent); >>> + >>> + dev_priv->perf.oa.ops.oa_disable(dev_priv); >>> + dev_priv->perf.oa.ops.oa_enable(dev_priv); >>> + >>> + /* Note: .oa_enable() is expected to re-init the oabuffer >>> + * and reset GEN8_OASTATUS for us >>> + */ >>> + oastatus = I915_READ(GEN8_OASTATUS); >>> + } >>> + >>> + if (oastatus & GEN8_OASTATUS_REPORT_LOST) { >>> + ret = append_oa_status(stream, buf, count, offset, >>> + >>> DRM_I915_PERF_RECORD_OA_REPORT_LOST); >>> + if (ret) >>> + return ret; >>> + I915_WRITE(GEN8_OASTATUS, >>> + oastatus & ~GEN8_OASTATUS_REPORT_LOST); >>> + } >>> + >>> + return gen8_append_oa_reports(stream, buf, count, offset); >>> +} >>> + >>> +/** >>> + * Copies all buffered OA reports into userspace read() buffer. >>> + * @stream: An i915-perf stream opened for OA metrics >>> + * @buf: destination buffer given by userspace >>> + * @count: the number of bytes userspace wants to read >>> + * @offset: (inout): the current position for writing into @buf >>> + * >>> + * Notably any error condition resulting in a short read (-%ENOSPC or >>> + * -%EFAULT) will be returned even though one or more records may >>> + * have been successfully copied. In this case it's up to the caller >>> + * to decide if the error should be squashed before returning to >>> + * userspace. >>> + * >>> + * Note: reports are consumed from the head, and appended to the >>> + * tail, so the tail chases the head?... If you think that's mad >>> + * and back-to-front you're not alone, but this follows the >>> + * Gen PRM naming convention. >>> + * >>> + * Returns: 0 on success, negative error code on failure. >>> + */ >>> static int gen7_append_oa_reports(struct i915_perf_stream *stream, >>> char __user *buf, >>> size_t count, >>> @@ -733,7 +1045,8 @@ static int gen7_oa_read(struct i915_perf_stream >>> *stream, >>> if (ret) >>> return ret; >>> >>> - DRM_DEBUG("OA buffer overflow: force restart\n"); >>> + DRM_DEBUG("OA buffer overflow (exponent = %d): force >>> restart\n", >>> + dev_priv->perf.oa.period_exponent); >>> >>> dev_priv->perf.oa.ops.oa_disable(dev_priv); >>> dev_priv->perf.oa.ops.oa_enable(dev_priv); >>> @@ -776,7 +1089,7 @@ static int i915_oa_wait_unlocked(struct >>> i915_perf_stream *stream) >>> return -EIO; >>> >>> return wait_event_interruptible(dev_priv->perf.oa.poll_wq, >>> - >>> dev_priv->perf.oa.ops.oa_buffer_check(dev_priv)); >>> + >>> oa_buffer_check_unlocked(dev_priv)); >>> } >>> >>> /** >>> @@ -833,33 +1146,37 @@ static int i915_oa_read(struct i915_perf_stream >>> *stream, >>> static int oa_get_render_ctx_id(struct i915_perf_stream *stream) >>> { >>> struct drm_i915_private *dev_priv = stream->dev_priv; >>> - struct intel_engine_cs *engine = dev_priv->engine[RCS]; >>> - int ret; >>> >>> - ret = i915_mutex_lock_interruptible(&dev_priv->drm); >>> - if (ret) >>> - return ret; >>> + if (i915.enable_execlists) >>> + dev_priv->perf.oa.specific_ctx_id = stream->ctx->hw_id; >>> + else { >>> + struct intel_engine_cs *engine = dev_priv->engine[RCS]; >>> + int ret; >>> >>> - /* As the ID is the gtt offset of the context's vma we pin >>> - * the vma to ensure the ID remains fixed. >>> - * >>> - * NB: implied RCS engine... >>> - */ >>> - ret = engine->context_pin(engine, stream->ctx); >>> - if (ret) >>> - goto unlock; >>> + ret = i915_mutex_lock_interruptible(&dev_priv->drm); >>> + if (ret) >>> + return ret; >>> >>> - /* Explicitly track the ID (instead of calling i915_ggtt_offset() >>> - * on the fly) considering the difference with gen8+ and >>> - * execlists >>> - */ >>> - dev_priv->perf.oa.specific_ctx_id = >>> - i915_ggtt_offset(stream->ctx->engine[engine->id].state); >>> + /* As the ID is the gtt offset of the context's vma we >>> pin >>> + * the vma to ensure the ID remains fixed. >>> + */ >>> + ret = engine->context_pin(engine, stream->ctx); >>> + if (ret) { >>> + mutex_unlock(&dev_priv->drm.struct_mutex); >>> + return ret; >>> + } >>> >>> -unlock: >>> - mutex_unlock(&dev_priv->drm.struct_mutex); >>> + /* Explicitly track the ID (instead of calling >>> + * i915_ggtt_offset() on the fly) considering the >>> difference >>> + * with gen8+ and execlists >>> + */ >>> + dev_priv->perf.oa.specific_ctx_id = >>> + >>> i915_ggtt_offset(stream->ctx->engine[engine->id].state); >>> >>> - return ret; >>> + mutex_unlock(&dev_priv->drm.struct_mutex); >>> + } >>> + >>> + return 0; >>> } >>> >>> /** >>> @@ -874,12 +1191,16 @@ static void oa_put_render_ctx_id(struct >>> i915_perf_stream *stream) >>> struct drm_i915_private *dev_priv = stream->dev_priv; >>> struct intel_engine_cs *engine = dev_priv->engine[RCS]; >>> >>> - mutex_lock(&dev_priv->drm.struct_mutex); >>> + if (i915.enable_execlists) { >>> + dev_priv->perf.oa.specific_ctx_id = INVALID_CTX_ID; >>> + } else { >>> + mutex_lock(&dev_priv->drm.struct_mutex); >>> >>> - dev_priv->perf.oa.specific_ctx_id = INVALID_CTX_ID; >>> - engine->context_unpin(engine, stream->ctx); >>> + dev_priv->perf.oa.specific_ctx_id = INVALID_CTX_ID; >>> + engine->context_unpin(engine, stream->ctx); >>> >>> - mutex_unlock(&dev_priv->drm.struct_mutex); >>> + mutex_unlock(&dev_priv->drm.struct_mutex); >>> + } >>> } >>> >>> static void >>> @@ -969,6 +1290,61 @@ static void gen7_init_oa_buffer(struct >>> drm_i915_private *dev_priv) >>> dev_priv->perf.oa.pollin = false; >>> } >>> >>> +static void gen8_init_oa_buffer(struct drm_i915_private *dev_priv) >>> +{ >>> + u32 gtt_offset = >>> i915_ggtt_offset(dev_priv->perf.oa.oa_buffer.vma); >>> + unsigned long flags; >>> + >>> + spin_lock_irqsave(&dev_priv->perf.oa.oa_buffer.ptr_lock, flags); >>> + >>> + I915_WRITE(GEN8_OASTATUS, 0); >>> + I915_WRITE(GEN8_OAHEADPTR, gtt_offset); >>> + dev_priv->perf.oa.oa_buffer.head = gtt_offset; >>> + >>> + I915_WRITE(GEN8_OABUFFER_UDW, 0); >>> + >>> + /* PRM says: >>> + * >>> + * "This MMIO must be set before the OATAILPTR >>> + * register and after the OAHEADPTR register. This is >>> + * to enable proper functionality of the overflow >>> + * bit." >>> + */ >>> + I915_WRITE(GEN8_OABUFFER, gtt_offset | >>> + OABUFFER_SIZE_16M | OA_MEM_SELECT_GGTT); >>> + I915_WRITE(GEN8_OATAILPTR, gtt_offset & GEN8_OATAILPTR_MASK); >>> + >>> + /* Mark that we need updated tail pointers to read from... */ >>> + dev_priv->perf.oa.oa_buffer.tails[0].offset = INVALID_TAIL_PTR; >>> + dev_priv->perf.oa.oa_buffer.tails[1].offset = INVALID_TAIL_PTR; >>> + >>> + /* Reset state used to recognise context switches, affecting >>> which >>> + * reports we will forward to userspace while filtering for a >>> single >>> + * context. >>> + */ >>> + dev_priv->perf.oa.oa_buffer.last_ctx_id = INVALID_CTX_ID; >>> + >>> + spin_unlock_irqrestore(&dev_priv->perf.oa.oa_buffer.ptr_lock, >>> flags); >>> + >>> + /* NB: although the OA buffer will initially be allocated >>> + * zeroed via shmfs (and so this memset is redundant when >>> + * first allocating), we may re-init the OA buffer, either >>> + * when re-enabling a stream or in error/reset paths. >>> + * >>> + * The reason we clear the buffer for each re-init is for the >>> + * sanity check in gen8_append_oa_reports() that looks at the >>> + * reason field to make sure it's non-zero which relies on >>> + * the assumption that new reports are being written to zeroed >>> + * memory... >>> + */ >>> + memset(dev_priv->perf.oa.oa_buffer.vaddr, 0, OA_BUFFER_SIZE); >>> + >>> + /* Maybe make ->pollin per-stream state if we support multiple >>> + * concurrent streams in the future. >>> + */ >>> + dev_priv->perf.oa.pollin = false; >>> +} >>> + >>> static int alloc_oa_buffer(struct drm_i915_private *dev_priv) >>> { >>> struct drm_i915_gem_object *bo; >>> @@ -1113,6 +1489,280 @@ static void hsw_disable_metric_set(struct >>> drm_i915_private *dev_priv) >>> ~GT_NOA_ENABLE)); >>> } >>> >>> +/* >>> + * From Broadwell PRM, 3D-Media-GPGPU -> Register State Context >>> + * >>> + * MMIO reads or writes to any of the registers listed in the >>> + * “Register State Context image” subsections through HOST/IA >>> + * MMIO interface for debug purposes must follow the steps below: >>> + * >>> + * - SW should set the Force Wakeup bit to prevent GT from entering C6. >>> + * - Write 0x2050[31:0] = 0x00010001 (disable sequence). >>> + * - Disable IDLE messaging in CS (Write 0x2050[31:0] = 0x00010001). >>> + * - BDW: Poll/Wait for register bits of 0x22AC[6:0] turn to 0x30 >>> value. >>> + * - SKL+: Poll/Wait for register bits of 0x22A4[6:0] turn to 0x30 >>> value. >>> + * - Read/Write to desired MMIO registers. >>> + * - Enable IDLE messaging in CS (Write 0x2050[31:0] = 0x00010000). >>> + * - Force Wakeup bit should be reset to enable C6 entry. >>> + * >>> + * XXX: don't nest or overlap calls to this API, it has no ref >>> + * counting to track how many entities require the RCS to be >>> + * blocked from being idle. >>> + */ >>> +static int gen8_begin_ctx_mmio(struct drm_i915_private *dev_priv) >>> +{ >>> + i915_reg_t fsm_reg = INTEL_GEN(dev_priv) > 8 ? >>> + GEN9_RCS_FE_FSM2 : GEN6_RCS_PWR_FSM; >>> + int ret = 0; >>> + >>> + /* There's only no active context while idle in execlist mode >>> + * (though we shouldn't be using this in any other case) >>> + */ >>> + if (WARN_ON(!i915.enable_execlists)) >>> + return -EINVAL; >>> + >>> + intel_uncore_forcewake_get(dev_priv, FORCEWAKE_RENDER); >>> + >>> + I915_WRITE(GEN6_RC_SLEEP_PSMI_CONTROL, >>> + _MASKED_BIT_ENABLE(GEN6_PSMI_SLEEP_MSG_DISABLE)); >>> + >>> + /* Note: we don't currently have a good handle on the maximum >>> + * latency for this wake up so while we only need to hold rcs >>> + * busy from process context we at least keep the waiting >>> + * interruptible... >>> + */ >>> + ret = intel_wait_for_register(dev_priv, fsm_reg, 0x3f, 0x30, 50); >>> + if (ret == -ETIMEDOUT) >>> + DRM_ERROR("Timeout waiting to be able to begin per-ctx >>> mmio\n"); >>> + >>> + if (ret) >>> + intel_uncore_forcewake_put(dev_priv, FORCEWAKE_RENDER); >>> + >>> + return ret; >>> +} >>> + >>> +static void gen8_end_ctx_mmio(struct drm_i915_private *dev_priv) >>> +{ >>> + if (WARN_ON(!i915.enable_execlists)) >>> + return; >>> + >>> + I915_WRITE(GEN6_RC_SLEEP_PSMI_CONTROL, >>> + _MASKED_BIT_DISABLE(GEN6_PSMI_SLEEP_MSG_DISABLE)); >>> + intel_uncore_forcewake_put(dev_priv, FORCEWAKE_RENDER); >>> +} >>> + >>> +/* NB: It must always remain pointer safe to run this even if the OA >>> unit >>> + * has been disabled. >>> + * >>> + * It's fine to put out-of-date values into these per-context registers >>> + * in the case that the OA unit has been disabled. >>> + */ >>> +static void gen8_update_reg_state_unlocked(struct i915_gem_context *ctx, >>> + u32 *reg_state) >>> +{ >>> + struct drm_i915_private *dev_priv = ctx->i915; >>> + const struct i915_oa_reg *flex_regs = >>> dev_priv->perf.oa.flex_regs; >>> + int n_flex_regs = dev_priv->perf.oa.flex_regs_len; >>> + u32 ctx_oactxctrl = dev_priv->perf.oa.ctx_oactxctrl_offset; >>> + u32 ctx_flexeu0 = dev_priv->perf.oa.ctx_flexeu0_offset; >>> + /* The MMIO offsets for Flex EU registers aren't contiguous */ >>> + u32 flex_mmio[] = { >>> + i915_mmio_reg_offset(EU_PERF_CNTL0), >>> + i915_mmio_reg_offset(EU_PERF_CNTL1), >>> + i915_mmio_reg_offset(EU_PERF_CNTL2), >>> + i915_mmio_reg_offset(EU_PERF_CNTL3), >>> + i915_mmio_reg_offset(EU_PERF_CNTL4), >>> + i915_mmio_reg_offset(EU_PERF_CNTL5), >>> + i915_mmio_reg_offset(EU_PERF_CNTL6), >>> + }; >>> + int i; >>> + >>> + reg_state[ctx_oactxctrl] = >>> i915_mmio_reg_offset(GEN8_OACTXCONTROL); >>> + reg_state[ctx_oactxctrl+1] = (dev_priv->perf.oa.period_exponent >>> << >>> + GEN8_OA_TIMER_PERIOD_SHIFT) | >>> + (dev_priv->perf.oa.periodic ? >>> + GEN8_OA_TIMER_ENABLE : 0) | >>> + GEN8_OA_COUNTER_RESUME; >>> + >>> + for (i = 0; i < ARRAY_SIZE(flex_mmio); i++) { >>> + u32 state_offset = ctx_flexeu0 + i * 2; >>> + u32 mmio = flex_mmio[i]; >>> + >>> + /* This arbitrary default will select the 'EU FPU0 >>> Pipeline >>> + * Active' event. In the future it's anticipated that >>> there >>> + * will be an explicit 'No Event' we can select, but not >>> yet... >>> + */ >>> + u32 value = 0; >>> + int j; >>> + >>> + for (j = 0; j < n_flex_regs; j++) { >>> + if (i915_mmio_reg_offset(flex_regs[j].addr) == >>> mmio) { >>> + value = flex_regs[j].value; >>> + break; >>> + } >>> + } >>> + >>> + reg_state[state_offset] = mmio; >>> + reg_state[state_offset+1] = value; >>> + } >>> +} >>> + >>> +/* Manages updating the per-context aspects of the OA stream >>> + * configuration across all contexts. >>> + * >>> + * The awkward consideration here is that OACTXCONTROL controls the >>> + * exponent for periodic sampling which is primarily used for system >>> + * wide profiling where we'd like a consistent sampling period even in >>> + * the face of context switches. >>> + * >>> + * Our approach of updating the register state context (as opposed to >>> + * say using a workaround batch buffer) ensures that the hardware >>> + * won't automatically reload an out-of-date timer exponent even >>> + * transiently before a WA BB could be parsed. >>> + * >>> + * This function needs to: >>> + * - Ensure the currently running context's per-context OA state is >>> + * updated >>> + * - Ensure that all existing contexts will have the correct per-context >>> + * OA state if they are scheduled for use. >>> + * - Ensure any new contexts will be initialized with the correct >>> + * per-context OA state. >>> + * >>> + * Note: it's only the RCS/Render context that has any OA state. >>> + */ >>> +static int gen8_configure_all_contexts(struct drm_i915_private >>> *dev_priv) >>> +{ >>> + struct i915_gem_context *ctx; >>> + int ret; >>> + >>> + ret = i915_mutex_lock_interruptible(&dev_priv->drm); >>> + if (ret) >>> + return ret; >>> + >>> + /* Switch away from any user context. */ >>> + ret = i915_gem_switch_to_kernel_context(dev_priv); >>> + if (ret) >> >> mutex_unlock(&dev_priv->drm.struct_mutex); > > > Duh! Thanks! > >> >>> + return ret; >>> + >>> + /* The OA register config is setup through the context image. >>> This image >>> + * might be written to by the GPU on context switch (in >>> particular on >>> + * lite-restore). This means we can't safely update a context's >>> image, >>> + * if this context is scheduled/submitted to run on the GPU. >>> + * >>> + * We could emit the OA register config through the batch buffer >>> but >>> + * this might leave small interval of time where the OA unit is >>> + * configured at an invalid sampling period. >>> + * >>> + * So far the best way to work around this issue seems to be >>> draining >>> + * the GPU from any submitted work. >>> + */ >>> + ret = i915_gem_wait_for_idle(dev_priv, >>> + I915_WAIT_INTERRUPTIBLE | >>> + I915_WAIT_LOCKED); >>> + if (ret) { >>> + mutex_unlock(&dev_priv->drm.struct_mutex); >>> + return ret; >>> + } >>> + >>> + /* Update all contexts now that we've stalled the submission. */ >>> + list_for_each_entry(ctx, &dev_priv->context_list, link) { >>> + if (!ctx->engine[RCS].initialised) >>> + continue; >>> + >>> + gen8_update_reg_state_unlocked(ctx, >>> + >>> ctx->engine[RCS].lrc_reg_state); >>> + } >>> + >>> + mutex_unlock(&dev_priv->drm.struct_mutex); >>> + >>> + /* Now update the current context. >>> + * >>> + * Note: Using MMIO to update per-context registers requires >>> + * some extra care... >>> + */ >>> + ret = gen8_begin_ctx_mmio(dev_priv); >>> + if (ret) { >>> + DRM_ERROR("Failed to bring RCS out of idle to update >>> current ctx OA state\n"); >>> + return ret; >>> + } >> >> So do we still need the mmio for the current context, given that we >> now idle everything and then update the reg state? > > > I would say yes, because we don't have any guarantee that another context is > going to be scheduled after that. > Yet we want the config to be applied before we return. Okay, feel free to keep my r-b then. _______________________________________________ Intel-gfx mailing list Intel-gfx@xxxxxxxxxxxxxxxxxxxxx https://lists.freedesktop.org/mailman/listinfo/intel-gfx