Re: [PATCH v2] drm/i915: Handle RC6 counter wrap

Tvrtko Ursulin <tvrtko.ursulin@xxxxxxxxxxxxxxx> · Thu, 8 Feb 2018 08:50:58 +0000

On 08/02/2018 08:00, Chris Wilson wrote:
Quoting Tvrtko Ursulin (2018-02-08 07:04:37)
From: Tvrtko Ursulin <tvrtko.ursulin@xxxxxxxxx>

We can implement limited RC6 counter wrap-around protection under the
assumption that clients will be reading this value more frequently than
the wrap period on a given platform.

With the typical wrap-around period being ~90 minutes, even with the
exception of Baytrail which wraps every 13 seconds, this sounds like a
reasonable assumption.

Implementation works by storing a 64-bit software copy of a hardware RC6
counter, along with the previous HW counter snapshot. This enables it to
detect wrap is polled frequently enough and keep the software copy
monotonically incrementing.

v2:
  * Missed GEN6_GT_GFX_RC6_LOCKED when considering slot sizing and
    indexing.
  * Fixed off-by-one in wrap-around handling. (Chris Wilson)

Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@xxxxxxxxx>
Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=94852
Cc: Chris Wilson <chris@xxxxxxxxxxxxxxxxxx>
---
  drivers/gpu/drm/i915/i915_drv.h |  2 ++
  drivers/gpu/drm/i915/intel_pm.c | 57 ++++++++++++++++++++++++++++++++++-------
  2 files changed, 50 insertions(+), 9 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
index ad1fc845cd1b..28a2671a26c7 100644
--- a/drivers/gpu/drm/i915/i915_drv.h
+++ b/drivers/gpu/drm/i915/i915_drv.h
@@ -946,6 +946,8 @@ struct intel_rps {
  
  struct intel_rc6 {
         bool enabled;
+       u64 prev_hw_residency[4];
+       u64 cur_residency[4];
  };
  
  struct intel_llc_pstate {
diff --git a/drivers/gpu/drm/i915/intel_pm.c b/drivers/gpu/drm/i915/intel_pm.c
index 41f26ab46501..6b01700334a9 100644
--- a/drivers/gpu/drm/i915/intel_pm.c
+++ b/drivers/gpu/drm/i915/intel_pm.c
@@ -9417,15 +9417,16 @@ static u64 vlv_residency_raw(struct drm_i915_private *dev_priv,
                              const i915_reg_t reg)
  {
         u32 lower, upper, tmp;
-       unsigned long flags;
         int loop = 2;
  
-       /* The register accessed do not need forcewake. We borrow
+       /*
+        * The register accessed do not need forcewake. We borrow
          * uncore lock to prevent concurrent access to range reg.
          */
-       spin_lock_irqsave(&dev_priv->uncore.lock, flags);
+       lockdep_assert_held(&dev_priv->uncore.lock);
  
-       /* vlv and chv residency counters are 40 bits in width.
+       /*
+        * vlv and chv residency counters are 40 bits in width.
          * With a control bit, we can choose between upper or lower
          * 32bit window into this counter.
          *
@@ -9449,29 +9450,44 @@ static u64 vlv_residency_raw(struct drm_i915_private *dev_priv,
                 upper = I915_READ_FW(reg);
         } while (upper != tmp && --loop);
  
-       /* Everywhere else we always use VLV_COUNTER_CONTROL with the
+       /*
+        * Everywhere else we always use VLV_COUNTER_CONTROL with the
          * VLV_COUNT_RANGE_HIGH bit set - so it is safe to leave it set
          * now.
          */
  
-       spin_unlock_irqrestore(&dev_priv->uncore.lock, flags);
-
         return lower | (u64)upper << 8;
  }
  
  u64 intel_rc6_residency_ns(struct drm_i915_private *dev_priv,
                            const i915_reg_t reg)
  {
-       u64 time_hw;
+       u64 time_hw, prev_hw, overflow_hw;
+       unsigned int fw_domains;
+       unsigned long flags;
         u32 mul, div;
+       int i;
  
         if (!HAS_RC6(dev_priv))
                 return 0;
  
+       /* Counter wrap handling stores previous hw counter values. */
+       i = (i915_mmio_reg_offset(reg) -
+            i915_mmio_reg_offset(GEN6_GT_GFX_RC6_LOCKED)) / sizeof(u32);
+       if (WARN_ON_ONCE(i < 0 ||
+                        i >= ARRAY_SIZE(dev_priv->gt_pm.rc6.cur_residency)))

int i, but ARRAY_SIZE is unsigned, so just

if (WARN_ON_ONCE(i >= ARRAY_SIZE(dev_priv->gt_pm.rc6.cur_residency)))

would suffice.

I remain surprised that they are in a neat block across atom/big-core

#define VLV_GT_RENDER_RC6                       _MMIO(0x138108)
#define VLV_GT_MEDIA_RC6                        _MMIO(0x13810C)

#define GEN6_GT_GFX_RC6_LOCKED                  _MMIO(0x138104)
#define GEN6_GT_GFX_RC6                         _MMIO(0x138108)
#define GEN6_GT_GFX_RC6p                        _MMIO(0x13810C)
#define GEN6_GT_GFX_RC6pp                       _MMIO(0x138110)

are indeed just the 4 registers in use.

+       fw_domains = intel_uncore_forcewake_for_reg(dev_priv, reg, FW_REG_READ);
+
+       spin_lock_irqsave(&dev_priv->uncore.lock, flags);
+       intel_uncore_forcewake_get__locked(dev_priv, fw_domains);
+
         /* On VLV and CHV, residency time is in CZ units rather than 1.28us */
         if (IS_VALLEYVIEW(dev_priv) || IS_CHERRYVIEW(dev_priv)) {
                 mul = 1000000;
                 div = dev_priv->czclk_freq;
+               overflow_hw = BIT_ULL(40);
                 time_hw = vlv_residency_raw(dev_priv, reg);
         } else {
                 /* 833.33ns units on Gen9LP, 1.28us elsewhere. */
@@ -9483,9 +9499,32 @@ u64 intel_rc6_residency_ns(struct drm_i915_private *dev_priv,
                         div = 1;
                 }
  
-               time_hw = I915_READ(reg);
+               overflow_hw = BIT_ULL(32);
+               time_hw = I915_READ_FW(reg);
         }
  
+       /*
+        * Counter wrap handling.
+        *
+        * But relying on a sufficient frequency of queries otherwise counters
+        * can still wrap.
+        */
+       prev_hw = dev_priv->gt_pm.rc6.prev_hw_residency[i];
+       dev_priv->gt_pm.rc6.prev_hw_residency[i] = time_hw;
+
+       /* RC6 delta from last sample. */
+       if (time_hw >= prev_hw)
+               time_hw -= prev_hw;
+       else
+               time_hw += overflow_hw - prev_hw;
+
+       /* Add delta to RC6 extended raw driver copy. */
+       time_hw += dev_priv->gt_pm.rc6.cur_residency[i];
+       dev_priv->gt_pm.rc6.cur_residency[i] = time_hw;

Ok.

Reviewed-by: Chris Wilson <chris@xxxxxxxxxxxxxxxxxx>

The other part of the puzzle was perhaps just using a timer to read the
registers every 1/2 wrap period, but there was no clear picture on when
to start and stop that timer for sysfs. However, with perf we can create
and destroy a background timer on PMU open/destroy.

Fancy hooking up a background timer?

Could do, wasn't sure that the complication is worth it.

Idea would be to schedule a timer for instance, as you say, 1/2 wrap 
period, and keep pushing it forward on manual reads.

But then question of runtime suspend arises again. If we decide to 
estimate the counter, as in the case of PMU, then I need to move that 
code out of PMU and make intel_pm.c the one and only authoritative 
source of RC6 data.

Regards,

Tvrtko
_______________________________________________
Intel-gfx mailing list
Intel-gfx@xxxxxxxxxxxxxxxxxxxxx
https://lists.freedesktop.org/mailman/listinfo/intel-gfx