On Fri, Oct 21, 2022 at 09:42:53AM +0100, Tvrtko Ursulin wrote:
On 27/10/2021 01:48, Umesh Nerlige Ramappa wrote:
[snip]
+static void guc_timestamp_ping(struct work_struct *wrk)
+{
+ struct intel_guc *guc = container_of(wrk, typeof(*guc),
+ timestamp.work.work);
+ struct intel_uc *uc = container_of(guc, typeof(*uc), guc);
+ struct intel_gt *gt = guc_to_gt(guc);
+ intel_wakeref_t wakeref;
+ unsigned long flags;
+ int srcu, ret;
+
+ /*
+ * Synchronize with gt reset to make sure the worker does not
+ * corrupt the engine/guc stats.
+ */
+ ret = intel_gt_reset_trylock(gt, &srcu);
+ if (ret)
+ return;
+
+ spin_lock_irqsave(&guc->timestamp.lock, flags);
+
+ with_intel_runtime_pm(>->i915->runtime_pm, wakeref)
+ __update_guc_busyness_stats(guc);
Spotted one splat today: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_12268/bat-adlp-4/igt@i915_pm_rpm@xxxxxxxxxxxxxxxxxxxxxxx
Could be that reset lock needs to be inside the rpm get. Haven't really though about it much, could you please check?
<4> [300.214744]
<4> [300.214753] ======================================================
<4> [300.214755] WARNING: possible circular locking dependency detected
<4> [300.214758] 6.1.0-rc1-CI_DRM_12268-g86e8558e3283+ #1 Not tainted
<4> [300.214761] ------------------------------------------------------
<4> [300.214762] kworker/10:1H/265 is trying to acquire lock:
<4> [300.214765] ffffffff8275e560 (fs_reclaim){+.+.}-{0:0}, at: __kmem_cache_alloc_node+0x27/0x170
<4> [300.214780]
but task is already holding lock:
<4> [300.214782] ffffc900013e7e78 ((work_completion)(&(&guc->timestamp.work)->work)){+.+.}-{0:0}, at: process_one_work+0x1eb/0x5b0
<4> [300.214793]
which lock already depends on the new lock.
<4> [300.214794]
the existing dependency chain (in reverse order) is:
<4> [300.214796]
-> #2 ((work_completion)(&(&guc->timestamp.work)->work)){+.+.}-{0:0}:
<4> [300.214801] lock_acquire+0xd3/0x310
<4> [300.214806] __flush_work+0x77/0x4e0
<4> [300.214811] __cancel_work_timer+0x14e/0x1f0
<4> [300.214815] intel_guc_submission_reset_prepare+0x7a/0x420 [i915]
<4> [300.215119] intel_uc_reset_prepare+0x44/0x50 [i915]
<4> [300.215360] reset_prepare+0x21/0x80 [i915]
<4> [300.215561] intel_gt_reset+0x143/0x340 [i915]
<4> [300.215757] intel_gt_reset_global+0xeb/0x160 [i915]
<4> [300.215946] intel_gt_handle_error+0x2c2/0x410 [i915]
<4> [300.216137] intel_gt_debugfs_reset_store+0x59/0xc0 [i915]
<4> [300.216333] i915_wedged_set+0xc/0x20 [i915]
<4> [300.216513] simple_attr_write+0xda/0x100
<4> [300.216520] full_proxy_write+0x4e/0x80
<4> [300.216525] vfs_write+0xe3/0x4e0
<4> [300.216531] ksys_write+0x57/0xd0
<4> [300.216535] do_syscall_64+0x37/0x90
<4> [300.216542] entry_SYSCALL_64_after_hwframe+0x63/0xcd
<4> [300.216549]
-> #1 (>->reset.mutex){+.+.}-{3:3}:
<4> [300.216556] lock_acquire+0xd3/0x310
<4> [300.216559] i915_gem_shrinker_taints_mutex+0x2d/0x50 [i915]
i915_gem_shrinker_taints_mutex seems to have something to do with
fs_reclaim and so does the stack #0. Any idea what this early init is
doing? Can this code also result in a gt_wedged case because that might
explain the stack #2 which is a reset.
<4> [300.216799] intel_gt_init_reset+0x61/0x80 [i915]
<4> [300.217018] intel_gt_common_init_early+0x10c/0x190 [i915]
<4> [300.217227] intel_root_gt_init_early+0x44/0x60 [i915]
<4> [300.217434] i915_driver_probe+0x9ab/0xf30 [i915]
<4> [300.217615] i915_pci_probe+0xa5/0x240 [i915]
<4> [300.217796] pci_device_probe+0x95/0x110
<4> [300.217803] really_probe+0xd6/0x350
<4> [300.217811] __driver_probe_device+0x73/0x170
<4> [300.217816] driver_probe_device+0x1a/0x90
<4> [300.217821] __driver_attach+0xbc/0x190
<4> [300.217826] bus_for_each_dev+0x72/0xc0
<4> [300.217831] bus_add_driver+0x1bb/0x210
<4> [300.217835] driver_register+0x66/0xc0
<4> [300.217841] 0xffffffffa093001f
<4> [300.217844] do_one_initcall+0x53/0x2f0
<4> [300.217849] do_init_module+0x45/0x1c0
<4> [300.217855] load_module+0x1d5e/0x1e90
<4> [300.217859] __do_sys_finit_module+0xaf/0x120
<4> [300.217864] do_syscall_64+0x37/0x90
<4> [300.217869] entry_SYSCALL_64_after_hwframe+0x63/0xcd
<4> [300.217875]
-> #0 (fs_reclaim){+.+.}-{0:0}:
<4> [300.217880] validate_chain+0xb3d/0x2000
<4> [300.217884] __lock_acquire+0x5a4/0xb70
<4> [300.217888] lock_acquire+0xd3/0x310
<4> [300.217891] fs_reclaim_acquire+0xa1/0xd0
fs_reclaim ^
<4> [300.217896] __kmem_cache_alloc_node+0x27/0x170
<4> [300.217899] __kmalloc+0x43/0x1a0
<4> [300.217903] acpi_ns_internalize_name+0x44/0x9f
<4> [300.217909] acpi_ns_get_node_unlocked+0x6b/0xd7
<4> [300.217914] acpi_ns_get_node+0x3b/0x54
<4> [300.217918] acpi_get_handle+0x89/0xb7
<4> [300.217922] acpi_has_method+0x1c/0x40
<4> [300.217928] acpi_pci_set_power_state+0x42/0xf0
<4> [300.217935] pci_power_up+0x20/0x1a0
<4> [300.217940] pci_pm_default_resume_early+0x9/0x30
<4> [300.217945] pci_pm_runtime_resume+0x29/0x90
<4> [300.217948] __rpm_callback+0x3d/0x110
<4> [300.217954] rpm_callback+0x58/0x60
<4> [300.217959] rpm_resume+0x548/0x760
<4> [300.217963] __pm_runtime_resume+0x42/0x80
<4> [300.217968] __intel_runtime_pm_get+0x19/0x80 [i915]
<4> [300.218170] guc_timestamp_ping+0x63/0xc0 [i915]
<4> [300.218467] process_one_work+0x272/0x5b0
<4> [300.218472] worker_thread+0x37/0x370
<4> [300.218477] kthread+0xed/0x120
<4> [300.218481] ret_from_fork+0x1f/0x30
<4> [300.218485]
If the suspend has completed, not sure why guc_timestamp_ping is getting
called and resulting in pci_power_up in this stack. The park should have
synchronously canceled the ping worker. Strange.
other info that might help us debug this:
<4> [300.218487] Chain exists of:
fs_reclaim --> >->reset.mutex --> (work_completion)(&(&guc->timestamp.work)->work)
<4> [300.218495] Possible unsafe locking scenario:
<4> [300.218497] CPU0 CPU1
<4> [300.218499] ---- ----
<4> [300.218501] lock((work_completion)(&(&guc->timestamp.work)->work));
<4> [300.218505] lock(>->reset.mutex);
<4> [300.218509] lock((work_completion)(&(&guc->timestamp.work)->work));
<4> [300.218512] lock(fs_reclaim);
<4> [300.218515]
*** DEADLOCK ***
Still looking into it, could use some help with the above questions.
Thanks,
Umesh
Regards,
Tvrtko