On Mon, Mar 11, 2024 at 07:14:09PM +0100, Janusz Krzysztofik wrote: > On Monday, 11 March 2024 18:35:43 CET Guenter Roeck wrote: > > On 3/11/24 09:58, Rodrigo Vivi wrote: > > > On Mon, Mar 11, 2024 at 09:06:46AM +0100, Janusz Krzysztofik wrote: > > >> In i915 hwmon sysfs getter path we now take a hwmon_lock, then acquire an > > >> rpm wakeref. That results in lock inversion: > > >> > > >> <4> [197.079335] ====================================================== > > >> <4> [197.085473] WARNING: possible circular locking dependency detected > > >> <4> [197.091611] 6.8.0-rc7-Patchwork_129026v7-gc4dc92fb1152+ #1 Not tainted > > >> <4> [197.098096] ------------------------------------------------------ > > >> <4> [197.104231] prometheus-node/839 is trying to acquire lock: > > >> <4> [197.109680] ffffffff82764d80 (fs_reclaim){+.+.}-{0:0}, at: __kmalloc+0x9a/0x350 > > >> <4> [197.116939] > > >> but task is already holding lock: > > >> <4> [197.122730] ffff88811b772a40 (&hwmon->hwmon_lock){+.+.}-{3:3}, at: hwm_energy+0x4b/0x100 [i915] > > >> <4> [197.131543] > > >> which lock already depends on the new lock. > > >> ... > > >> <4> [197.507922] Chain exists of: > > >> fs_reclaim --> >->reset.mutex --> &hwmon->hwmon_lock > > >> <4> [197.518528] Possible unsafe locking scenario: > > >> <4> [197.524411] CPU0 CPU1 > > >> <4> [197.528916] ---- ---- > > >> <4> [197.533418] lock(&hwmon->hwmon_lock); > > >> <4> [197.537237] lock(>->reset.mutex); > > >> <4> [197.543376] lock(&hwmon->hwmon_lock); > > >> <4> [197.549682] lock(fs_reclaim); > > >> ... > > >> <4> [197.632548] Call Trace: > > >> <4> [197.634990] <TASK> > > >> <4> [197.637088] dump_stack_lvl+0x64/0xb0 > > >> <4> [197.640738] check_noncircular+0x15e/0x180 > > >> <4> [197.652968] check_prev_add+0xe9/0xce0 > > >> <4> [197.656705] __lock_acquire+0x179f/0x2300 > > >> <4> [197.660694] lock_acquire+0xd8/0x2d0 > > >> <4> [197.673009] fs_reclaim_acquire+0xa1/0xd0 > > >> <4> [197.680478] __kmalloc+0x9a/0x350 > > >> <4> [197.689063] acpi_ns_internalize_name.part.0+0x4a/0xb0 > > >> <4> [197.694170] acpi_ns_get_node_unlocked+0x60/0xf0 > > >> <4> [197.720608] acpi_ns_get_node+0x3b/0x60 > > >> <4> [197.724428] acpi_get_handle+0x57/0xb0 > > >> <4> [197.728164] acpi_has_method+0x20/0x50 > > >> <4> [197.731896] acpi_pci_set_power_state+0x43/0x120 > > >> <4> [197.736485] pci_power_up+0x24/0x1c0 > > >> <4> [197.740047] pci_pm_default_resume_early+0x9/0x30 > > >> <4> [197.744725] pci_pm_runtime_resume+0x2d/0x90 > > >> <4> [197.753911] __rpm_callback+0x3c/0x110 > > >> <4> [197.762586] rpm_callback+0x58/0x70 > > >> <4> [197.766064] rpm_resume+0x51e/0x730 > > >> <4> [197.769542] rpm_resume+0x267/0x730 > > >> <4> [197.773020] rpm_resume+0x267/0x730 > > >> <4> [197.776498] rpm_resume+0x267/0x730 > > >> <4> [197.779974] __pm_runtime_resume+0x49/0x90 > > >> <4> [197.784055] __intel_runtime_pm_get+0x19/0xa0 [i915] > > >> <4> [197.789070] hwm_energy+0x55/0x100 [i915] > > >> <4> [197.793183] hwm_read+0x9a/0x310 [i915] > > >> <4> [197.797124] hwmon_attr_show+0x36/0x120 > > >> <4> [197.800946] dev_attr_show+0x15/0x60 > > >> <4> [197.804509] sysfs_kf_seq_show+0xb5/0x100 > > >> > > >> However, the lock is only intended to protect either a hwmon overflow > > >> counter or rmw hardware operations. There is no need to hold the lock, > > >> only the wakeref, while reading from hardware. > > >> > > >> Acquire the lock after hardware read under rpm wakeref. > > >> > > >> Fixes: c41b8bdcc297 ("drm/i915/hwmon: Show device level energy usage") > > >> Signed-off-by: Janusz Krzysztofik <janusz.krzysztofik@xxxxxxxxxxxxxxx> > > >> Cc: <stable@xxxxxxxxxxxxxxx> # v6.2+ > > >> --- > > >> drivers/gpu/drm/i915/i915_hwmon.c | 4 ++-- > > >> 1 file changed, 2 insertions(+), 2 deletions(-) > > >> > > >> diff --git a/drivers/gpu/drm/i915/i915_hwmon.c b/drivers/gpu/drm/i915/i915_hwmon.c > > >> index 8c3f443c8347e..faf7670de6e06 100644 > > >> --- a/drivers/gpu/drm/i915/i915_hwmon.c > > >> +++ b/drivers/gpu/drm/i915/i915_hwmon.c > > >> @@ -136,11 +136,11 @@ hwm_energy(struct hwm_drvdata *ddat, long *energy) > > >> else > > >> rgaddr = hwmon->rg.energy_status_all; > > >> > > >> - mutex_lock(&hwmon->hwmon_lock); > > >> - > > >> with_intel_runtime_pm(uncore->rpm, wakeref) > > >> reg_val = intel_uncore_read(uncore, rgaddr); > > >> > > >> + mutex_lock(&hwmon->hwmon_lock); > > >> + > > > > > > This is not enough. > > > check hwm_locked_with_pm_intel_uncore_rmw() > > > > > > It looks like we need to rethink this lock entirely here. > > > > > > > I would have assumed that the lock was supposed to ensure that > > reading the register value and the subsequent update of accum_energy > > and reg_val_prev was synchronized. This is no longer the case > > after this patch has been applied. Given that, I would agree that > > it would make sense to define what the lock is supposed to protect > > before changing its scope. > > Right. In that case, I propose to take the wakeref before the lock, and keep > it while the lock is held around the calculations. Would that be acceptable > as a quick fix? If yes then I can take the same approach to also fix other > places in i915_hwmon.c for now where similar lock inversion can happen, as > Rodrigo pointed out. Without that, we are stuck with another series that > cleans up excessive use of rpm wakerefs by other users, since those wakerefs > evidently help with the issue in hwmon by hiding it, even if not related, and > dropping them will expose it. That would work. It is what we have on drivers/gpu/drm/xe/xe_hwmon.c already. Please convert every case and ensure that we are using pm_runtime_get on every path. Likely same places already in xe_hwmon.c > > Thanks, > Janusz > > > > > Guenter > > > > > > > >