Re: [PATCH] drm/i915/guc: Disable PL1 power limit when loading GuC firmware

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 




On 3/24/2023 4:31 PM, Dixit, Ashutosh wrote:
On Fri, 24 Mar 2023 11:15:02 -0700, Belgaumkar, Vinay wrote:
Hi Vinay,

Thanks for the review. Comments inline below.
Sorry about asking the same questions all over again :) Didn't look at previous versions.

On 3/15/2023 8:59 PM, Ashutosh Dixit wrote:
On dGfx, the PL1 power limit being enabled and set to a low value results
in a low GPU operating freq. It also negates the freq raise operation which
is done before GuC firmware load. As a result GuC firmware load can time
out. Such timeouts were seen in the GL #8062 bug below (where the PL1 power
limit was enabled and set to a low value). Therefore disable the PL1 power
limit when allowed by HW when loading GuC firmware.
v3 label missing in subject.
v2:
   - Take mutex (to disallow writes to power1_max) across GuC reset/fw load
   - Add hwm_power_max_restore to error return code path

v3 (Jani N):
   - Add/remove explanatory comments
   - Function renames
   - Type corrections
   - Locking annotation

Link: https://gitlab.freedesktop.org/drm/intel/-/issues/8062
Signed-off-by: Ashutosh Dixit <ashutosh.dixit@xxxxxxxxx>
---
   drivers/gpu/drm/i915/gt/uc/intel_uc.c |  9 +++++++
   drivers/gpu/drm/i915/i915_hwmon.c     | 39 +++++++++++++++++++++++++++
   drivers/gpu/drm/i915/i915_hwmon.h     |  7 +++++
   3 files changed, 55 insertions(+)

diff --git a/drivers/gpu/drm/i915/gt/uc/intel_uc.c b/drivers/gpu/drm/i915/gt/uc/intel_uc.c
index 4ccb4be4c9cba..aa8e35a5636a0 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_uc.c
+++ b/drivers/gpu/drm/i915/gt/uc/intel_uc.c
@@ -18,6 +18,7 @@
   #include "intel_uc.h"
     #include "i915_drv.h"
+#include "i915_hwmon.h"
     static const struct intel_uc_ops uc_ops_off;
   static const struct intel_uc_ops uc_ops_on;
@@ -461,6 +462,7 @@ static int __uc_init_hw(struct intel_uc *uc)
	struct intel_guc *guc = &uc->guc;
	struct intel_huc *huc = &uc->huc;
	int ret, attempts;
+	bool pl1en;
Init to 'false' here
See next comment.


		GEM_BUG_ON(!intel_uc_supports_guc(uc));
	GEM_BUG_ON(!intel_uc_wants_guc(uc));
@@ -491,6 +493,9 @@ static int __uc_init_hw(struct intel_uc *uc)
	else
		attempts = 1;
   +	/* Disable a potentially low PL1 power limit to allow freq to be
raised */
+	i915_hwmon_power_max_disable(gt->i915, &pl1en);
+
	intel_rps_raise_unslice(&uc_to_gt(uc)->rps);
		while (attempts--) {
@@ -547,6 +552,8 @@ static int __uc_init_hw(struct intel_uc *uc)
		intel_rps_lower_unslice(&uc_to_gt(uc)->rps);
	}
   +	i915_hwmon_power_max_restore(gt->i915, pl1en);
+
	guc_info(guc, "submission %s\n", str_enabled_disabled(intel_uc_uses_guc_submission(uc)));
	guc_info(guc, "SLPC %s\n", str_enabled_disabled(intel_uc_uses_guc_slpc(uc)));
   @@ -563,6 +570,8 @@ static int __uc_init_hw(struct intel_uc *uc)
	/* Return GT back to RPn */
	intel_rps_lower_unslice(&uc_to_gt(uc)->rps);
   +	i915_hwmon_power_max_restore(gt->i915, pl1en);
if (pl1en)

     i915_hwmon_power_max_enable().
IMO it's better not to have checks in the main __uc_init_hw() function (if
we do this we'll need to add 2 checks in __uc_init_hw()). If you really
want we could do something like this inside
i915_hwmon_power_max_disable/i915_hwmon_power_max_restore. But for now I
am not making any changes.
ok.

(I can send a patch with the changes if you want to take a look but IMO it
will add more logic/code but without real benefits (it will save a rmw if
the limit was already disabled, but IMO this code is called so infrequently
(only during GuC resets) as to not have any significant impact)).

+
	__uc_sanitize(uc);
		if (!ret) {
diff --git a/drivers/gpu/drm/i915/i915_hwmon.c b/drivers/gpu/drm/i915/i915_hwmon.c
index ee63a8fd88fc1..769b5bda4d53f 100644
--- a/drivers/gpu/drm/i915/i915_hwmon.c
+++ b/drivers/gpu/drm/i915/i915_hwmon.c
@@ -444,6 +444,45 @@ hwm_power_write(struct hwm_drvdata *ddat, u32 attr, int chan, long val)
	}
   }
   +void i915_hwmon_power_max_disable(struct drm_i915_private *i915, bool
*old)
Shouldn't we call this i915_hwmon_package_pl1_disable()?
I did think of using "pl1" in the function name but then decided to retain
"power_max" because other hwmon functions for PL1 limit also use
"power_max" (hwm_power_max_read/hwm_power_max_write) and currently
"hwmon_power_max" is mapped to the PL1 limit. So "power_max" is used to
show that all these functions deal with the PL1 power limit.

There is a comment in __uc_init_hw() explaining "power_max" means the PL1
power limit.
ok.

+	__acquires(i915->hwmon->hwmon_lock)
+{
+	struct i915_hwmon *hwmon = i915->hwmon;
+	intel_wakeref_t wakeref;
+	u32 r;
+
+	if (!hwmon || !i915_mmio_reg_valid(hwmon->rg.pkg_rapl_limit))
+		return;
+
+	/* Take mutex to prevent concurrent hwm_power_max_write */
+	mutex_lock(&hwmon->hwmon_lock);
+
+	with_intel_runtime_pm(hwmon->ddat.uncore->rpm, wakeref)
+		r = intel_uncore_rmw(hwmon->ddat.uncore,
+				     hwmon->rg.pkg_rapl_limit,
+				     PKG_PWR_LIM_1_EN, 0);
Most of this code (lock and rmw parts) is already inside static void
hwm_locked_with_pm_intel_uncore_rmw() , can we reuse that here?
This was the case in v1 of the patch:

https://patchwork.freedesktop.org/patch/526393/?series=115003&rev=1

But now this cannot be done because if you notice we acquire the mutex in
i915_hwmon_power_max_disable() and release the mutex in
i915_hwmon_power_max_restore().

I explained the reason why this the mutex is handled this way in my reply
to Jani Nikula here:

https://patchwork.freedesktop.org/patch/526598/?series=115003&rev=2

Quoting below:

```
+	/* hwmon_lock mutex is unlocked in hwm_power_max_restore */
Not too happy about that... any better ideas?
Afais, taking the mutex is the only fully correct solution (when we disable
the power limit, userspace can go re-enable it). Examples of partly
incorrect solutions (which don't take the mutex) include:

a. Don't take the mutex, don't do anything, ignore any changes to the value
    if it has changed during GuC reset/fw load (just overwrite the changed
    value). Con: changed value is lost.

b. Detect if the value has changed (the limit has been re-enabled) after we
    have disabled the limit and in that case skip restoring the value. But
    then someone can say why do we allow enabling the PL1 limit since we
    want to disable it.

Both these are very unlikely scenarios so they might work. But I would
first like to explore if holding a mutex across GuC reset is prolebmatic
since that is /the/ correct solution. But if anyone comes up with a reason
why that cannot be done we can look at these other not completely correct
options.

Well, one reason is that this is adding a lot of duplicate/non-reusable code needlessly. If it gets re-used elsewhere, that could lead to some weird situations where the lock could be held for an extended period of time and introduce dependencies. Also, how/why would the user modify this PL1 during guc load? The sysfs interfaces are not even ready at this point? Even if we consider this during a resume, the terminal will not be available to the user.

Thanks,

Vinay.

```

+
+	*old = !!(r & PKG_PWR_LIM_1_EN);
+}
+
+void i915_hwmon_power_max_restore(struct drm_i915_private *i915, bool old)
+	__releases(i915->hwmon->hwmon_lock)
We can just call this i915_hwmon_power_max_enable() and call whenever the
old value was actually enabled. That way, we have proper mirror functions.
As I explained that would mean adding two checks in the main __uc_init_hw()
function which I am trying to avoid. So we have disable/restore pair.

+{
+	struct i915_hwmon *hwmon = i915->hwmon;
+	intel_wakeref_t wakeref;
+
+	if (!hwmon || !i915_mmio_reg_valid(hwmon->rg.pkg_rapl_limit))
+		return;
+
+	with_intel_runtime_pm(hwmon->ddat.uncore->rpm, wakeref)
+		intel_uncore_rmw(hwmon->ddat.uncore,
+				 hwmon->rg.pkg_rapl_limit,
+				 PKG_PWR_LIM_1_EN,
+				 old ? PKG_PWR_LIM_1_EN : 0);
3rd param should be 0 here, else we will end up clearing other bits.
No see intel_uncore_rmw(), it will only clear the PKG_PWR_LIM_1_EN bit, so
the code here is correct. intel_uncore_rmw() does:

         val = (old & ~clear) | set;
Ok, just confusing, since you are also setting it with the 4th param.

So for now I am not making any changes, if you feel strongly about
something one way or another let me know. Anyway these comments should help
you understand the patch better so take a look and we can go from there.

Thanks.
--
Ashutosh

+
+	mutex_unlock(&hwmon->hwmon_lock);
+}
+
   static umode_t
   hwm_energy_is_visible(const struct hwm_drvdata *ddat, u32 attr)
   {
diff --git a/drivers/gpu/drm/i915/i915_hwmon.h b/drivers/gpu/drm/i915/i915_hwmon.h
index 7ca9cf2c34c96..0fcb7de844061 100644
--- a/drivers/gpu/drm/i915/i915_hwmon.h
+++ b/drivers/gpu/drm/i915/i915_hwmon.h
@@ -7,14 +7,21 @@
   #ifndef __I915_HWMON_H__
   #define __I915_HWMON_H__
   +#include <linux/types.h>
+
   struct drm_i915_private;
+struct intel_gt;
     #if IS_REACHABLE(CONFIG_HWMON)
   void i915_hwmon_register(struct drm_i915_private *i915);
   void i915_hwmon_unregister(struct drm_i915_private *i915);
+void i915_hwmon_power_max_disable(struct drm_i915_private *i915, bool *old);
+void i915_hwmon_power_max_restore(struct drm_i915_private *i915, bool old);
   #else
   static inline void i915_hwmon_register(struct drm_i915_private *i915) { };
   static inline void i915_hwmon_unregister(struct drm_i915_private *i915) { };
+static inline void i915_hwmon_power_max_disable(struct drm_i915_private *i915, bool *old) { };
+static inline void i915_hwmon_power_max_restore(struct drm_i915_private *i915, bool old) { };
   #endif
     #endif /* __I915_HWMON_H__ */



[Index of Archives]     [Linux DRI Users]     [Linux Intel Graphics]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [XFree86]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Linux Kernel]     [Linux SCSI]     [XFree86]
  Powered by Linux