On Fri, Aug 13, 2021 at 07:02:55PM +0000, Matthew Brost wrote: > On Fri, Aug 13, 2021 at 05:11:59PM +0200, Daniel Vetter wrote: > > On Thu, Aug 12, 2021 at 10:38:18PM +0000, Matthew Brost wrote: > > > On Thu, Aug 12, 2021 at 09:47:23PM +0200, Daniel Vetter wrote: > > > > On Thu, Aug 12, 2021 at 03:23:30PM +0000, Matthew Brost wrote: > > > > > On Thu, Aug 12, 2021 at 04:11:28PM +0200, Daniel Vetter wrote: > > > > > > On Wed, Aug 11, 2021 at 01:16:18AM +0000, Matthew Brost wrote: > > > > > > > Flush the work queue for GuC generated G2H messages durinr a GT reset. > > > > > > > This is accomplished by spinning on the the list of outstanding G2H to > > > > > > > go empty. > > > > > > > > > > > > > > Fixes: eb5e7da736f3 ("drm/i915/guc: Reset implementation for new GuC interface") > > > > > > > Signed-off-by: Matthew Brost <matthew.brost@xxxxxxxxx> > > > > > > > Cc: <stable@xxxxxxxxxxxxxxx> > > > > > > > --- > > > > > > > drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c | 5 +++++ > > > > > > > 1 file changed, 5 insertions(+) > > > > > > > > > > > > > > diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c > > > > > > > index 3cd2da6f5c03..e5eb2df11b4a 100644 > > > > > > > --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c > > > > > > > +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c > > > > > > > @@ -727,6 +727,11 @@ void intel_guc_submission_reset_prepare(struct intel_guc *guc) > > > > > > > wait_for_reset(guc, &guc->outstanding_submission_g2h); > > > > > > > } while (!list_empty(&guc->ct.requests.incoming)); > > > > > > > } > > > > > > > + > > > > > > > + /* Flush any GuC generated G2H */ > > > > > > > + while (!list_empty(&guc->ct.requests.incoming)) > > > > > > > + msleep(20); > > > > > > > > > > > > flush_work or flush_workqueue, beacuse that comes with lockdep > > > > > > annotations. Dont hand-roll stuff like this if at all possible. > > > > > > > > > > lockdep puked when used that. > > > > > > > > Lockdep tends to be right ... So we definitely want that, but maybe a > > > > different flavour, or there's something wrong with the workqueue setup. > > > > > > > > > > Here is a dependency chain that lockdep doesn't like. > > > > > > fs_reclaim_acquire -> >->reset.mutex (shrinker) > > > workqueue -> fs_reclaim_acquire (error capture in workqueue) > > > >->reset.mutex -> workqueue (reset) > > > > > > In practice I don't think we couldn't ever hit this but lockdep does > > > looks right here. Trying to work out how to fix this. We really need to > > > all G2H to done being processed before we proceed during a reset or we > > > have races. Have a few ideas of how to handle this but can't convince > > > myself any of them are fully safe. > > > > It might be false sharing due to a single workqueue, or a single-threaded > > workqueue. > > > > Essentially the lockdep annotations for work_struct track two things: > > - dependencies against the specific work item > > - dependencies against anything queued on that work queue, if you flush > > the entire queue, or if you flush a work item that's on a > > single-threaded queue. > > > > Because if guc/host communication is inverted like this here, you have a > > much bigger problem. > > > > Note that if you pick a different workqueue for your guc work stuff then > > you need to make sure that's all properly flushed on suspend and driver > > unload. > > > > It might also be that the reset work is on the wrong workqueue. > > > > Either way, this must be fixed, because I've seen too many of these "it > > never happens in practice" blow up, plus if your locking scheme is > > engineered with quicksand forget about anyone ever understanding it. > > The solution is to allocate memory for the error capture in an atomic > context if the error capture is being done from the G2H work queue. That > means this can possibly fail if the system is under memory pressure but > that is better than a lockdep splat. Ah yeah if this is for error capture then GFP_ATOMIC is the right option. -Daniel > > Matt > > > -Daniel > > > > > > > > Splat below: > > > > > > [ 154.625989] ====================================================== > > > [ 154.632195] WARNING: possible circular locking dependency detected > > > [ 154.638393] 5.14.0-rc5-guc+ #50 Tainted: G U > > > [ 154.643991] ------------------------------------------------------ > > > [ 154.650196] i915_selftest/1673 is trying to acquire lock: > > > [ 154.655621] ffff8881079cb918 ((work_completion)(&ct->requests.worker)){+.+.}-{0:0}, at: __flush_work+0x350/0x4d0 > > > [ 154.665826] > > > but task is already holding lock: > > > [ 154.671682] ffff8881079cbfb8 (>->reset.mutex){+.+.}-{3:3}, at: intel_gt_reset+0xf0/0x300 [i915] > > > [ 154.680659] > > > which lock already depends on the new lock. > > > > > > [ 154.688857] > > > the existing dependency chain (in reverse order) is: > > > [ 154.696365] > > > -> #2 (>->reset.mutex){+.+.}-{3:3}: > > > [ 154.702571] lock_acquire+0xd2/0x300 > > > [ 154.706695] i915_gem_shrinker_taints_mutex+0x2d/0x50 [i915] > > > [ 154.712959] intel_gt_init_reset+0x61/0x80 [i915] > > > [ 154.718258] intel_gt_init_early+0xe6/0x120 [i915] > > > [ 154.723648] i915_driver_probe+0x592/0xdc0 [i915] > > > [ 154.728942] i915_pci_probe+0x43/0x1c0 [i915] > > > [ 154.733891] pci_device_probe+0x9b/0x110 > > > [ 154.738362] really_probe+0x1a6/0x3a0 > > > [ 154.742568] __driver_probe_device+0xf9/0x170 > > > [ 154.747468] driver_probe_device+0x19/0x90 > > > [ 154.752114] __driver_attach+0x99/0x170 > > > [ 154.756492] bus_for_each_dev+0x73/0xc0 > > > [ 154.760870] bus_add_driver+0x14b/0x1f0 > > > [ 154.765248] driver_register+0x67/0xb0 > > > [ 154.769542] i915_init+0x18/0x8c [i915] > > > [ 154.773964] do_one_initcall+0x53/0x2e0 > > > [ 154.778343] do_init_module+0x56/0x210 > > > [ 154.782639] load_module+0x25fc/0x29f0 > > > [ 154.786934] __do_sys_finit_module+0xae/0x110 > > > [ 154.791835] do_syscall_64+0x38/0xc0 > > > [ 154.795958] entry_SYSCALL_64_after_hwframe+0x44/0xae > > > [ 154.801558] > > > -> #1 (fs_reclaim){+.+.}-{0:0}: > > > [ 154.807241] lock_acquire+0xd2/0x300 > > > [ 154.811361] fs_reclaim_acquire+0x9e/0xd0 > > > [ 154.815914] kmem_cache_alloc_trace+0x30/0x790 > > > [ 154.820899] i915_gpu_coredump_alloc+0x53/0x1a0 [i915] > > > [ 154.826649] i915_gpu_coredump+0x39/0x560 [i915] > > > [ 154.831866] i915_capture_error_state+0xa/0x70 [i915] > > > [ 154.837513] intel_guc_context_reset_process_msg+0x174/0x1f0 [i915] > > > [ 154.844383] ct_incoming_request_worker_func+0x130/0x1b0 [i915] > > > [ 154.850898] process_one_work+0x264/0x590 > > > [ 154.855451] worker_thread+0x4b/0x3a0 > > > [ 154.859655] kthread+0x147/0x170 > > > [ 154.863428] ret_from_fork+0x1f/0x30 > > > [ 154.867548] > > > -> #0 ((work_completion)(&ct->requests.worker)){+.+.}-{0:0}: > > > [ 154.875747] check_prev_add+0x90/0xc30 > > > [ 154.880042] __lock_acquire+0x1643/0x2110 > > > [ 154.884595] lock_acquire+0xd2/0x300 > > > [ 154.888715] __flush_work+0x373/0x4d0 > > > [ 154.892920] intel_guc_submission_reset_prepare+0xf3/0x340 [i915] > > > [ 154.899606] intel_uc_reset_prepare+0x40/0x50 [i915] > > > [ 154.905166] reset_prepare+0x55/0x60 [i915] > > > [ 154.909946] intel_gt_reset+0x11c/0x300 [i915] > > > [ 154.914984] do_device_reset+0x13/0x20 [i915] > > > [ 154.919936] check_whitelist_across_reset+0x166/0x250 [i915] > > > [ 154.926212] live_reset_whitelist.cold+0x6a/0x7a [i915] > > > [ 154.932037] __i915_subtests.cold+0x20/0x74 [i915] > > > [ 154.937428] __run_selftests.cold+0x96/0xee [i915] > > > [ 154.942816] i915_live_selftests+0x2c/0x60 [i915] > > > [ 154.948125] i915_pci_probe+0x93/0x1c0 [i915] > > > [ 154.953076] pci_device_probe+0x9b/0x110 > > > [ 154.957545] really_probe+0x1a6/0x3a0 > > > [ 154.961749] __driver_probe_device+0xf9/0x170 > > > [ 154.966653] driver_probe_device+0x19/0x90 > > > [ 154.971290] __driver_attach+0x99/0x170 > > > [ 154.975671] bus_for_each_dev+0x73/0xc0 > > > [ 154.980053] bus_add_driver+0x14b/0x1f0 > > > [ 154.984431] driver_register+0x67/0xb0 > > > [ 154.988725] i915_init+0x18/0x8c [i915] > > > [ 154.993149] do_one_initcall+0x53/0x2e0 > > > [ 154.997527] do_init_module+0x56/0x210 > > > [ 155.001822] load_module+0x25fc/0x29f0 > > > [ 155.006118] __do_sys_finit_module+0xae/0x110 > > > [ 155.011019] do_syscall_64+0x38/0xc0 > > > [ 155.015139] entry_SYSCALL_64_after_hwframe+0x44/0xae > > > [ 155.020729] > > > other info that might help us debug this: > > > > > > [ 155.028752] Chain exists of: > > > (work_completion)(&ct->requests.worker) --> fs_reclaim --> >->reset.mutex > > > > > > [ 155.041294] Possible unsafe locking scenario: > > > > > > [ 155.047240] CPU0 CPU1 > > > [ 155.051791] ---- ---- > > > [ 155.056344] lock(>->reset.mutex); > > > [ 155.060026] lock(fs_reclaim); > > > [ 155.065706] lock(>->reset.mutex); > > > [ 155.071912] lock((work_completion)(&ct->requests.worker)); > > > [ 155.077595] > > > *** DEADLOCK *** > > > > > > [ 155.083534] 2 locks held by i915_selftest/1673: > > > [ 155.088086] #0: ffff888103851240 (&dev->mutex){....}-{3:3}, at: __driver_attach+0x8e/0x170 > > > [ 155.096460] #1: ffff8881079cbfb8 (>->reset.mutex){+.+.}-{3:3}, at: intel_gt_reset+0xf0/0x300 [i915] > > > [ 155.105845] > > > stack backtrace: > > > [ 155.110223] CPU: 6 PID: 1673 Comm: i915_selftest Tainted: G U 5.14.0-rc5-guc+ #50 > > > [ 155.119035] Hardware name: Intel Corporation Tiger Lake Client Platform/TigerLake U LPDDR4/4x T4 RVP, BIOS TGLSFWI1.R00.3425.A00.2010162309 10/16/2020 > > > [ 155.132530] Call Trace: > > > [ 155.134999] dump_stack_lvl+0x57/0x7d > > > [ 155.138690] check_noncircular+0x12d/0x150 > > > [ 155.142814] check_prev_add+0x90/0xc30 > > > [ 155.146587] __lock_acquire+0x1643/0x2110 > > > [ 155.150618] lock_acquire+0xd2/0x300 > > > [ 155.154218] ? __flush_work+0x350/0x4d0 > > > [ 155.158073] ? __lock_acquire+0x5f3/0x2110 > > > [ 155.162194] __flush_work+0x373/0x4d0 > > > [ 155.165883] ? __flush_work+0x350/0x4d0 > > > [ 155.169739] ? mark_held_locks+0x49/0x70 > > > [ 155.173686] ? _raw_spin_unlock_irqrestore+0x50/0x70 > > > [ 155.178679] intel_guc_submission_reset_prepare+0xf3/0x340 [i915] > > > [ 155.184857] ? _raw_spin_unlock_irqrestore+0x50/0x70 > > > [ 155.189843] intel_uc_reset_prepare+0x40/0x50 [i915] > > > [ 155.194891] reset_prepare+0x55/0x60 [i915] > > > [ 155.199145] intel_gt_reset+0x11c/0x300 [i915] > > > [ 155.203657] ? _raw_spin_unlock_irqrestore+0x3d/0x70 > > > [ 155.208641] ? do_engine_reset+0x10/0x10 [i915] > > > [ 155.213243] do_device_reset+0x13/0x20 [i915] > > > [ 155.217664] check_whitelist_across_reset+0x166/0x250 [i915] > > > [ 155.223415] live_reset_whitelist.cold+0x6a/0x7a [i915] > > > [ 155.228725] __i915_subtests.cold+0x20/0x74 [i915] > > > [ 155.233593] ? __i915_live_teardown+0x50/0x50 [i915] > > > [ 155.238644] ? __intel_gt_live_setup+0x30/0x30 [i915] > > > [ 155.243773] __run_selftests.cold+0x96/0xee [i915] > > > [ 155.248646] i915_live_selftests+0x2c/0x60 [i915] > > > [ 155.253425] i915_pci_probe+0x93/0x1c0 [i915] > > > [ 155.257854] ? _raw_spin_unlock_irqrestore+0x3d/0x70 > > > [ 155.262839] pci_device_probe+0x9b/0x110 > > > [ 155.266785] really_probe+0x1a6/0x3a0 > > > [ 155.270467] __driver_probe_device+0xf9/0x170 > > > [ 155.274846] driver_probe_device+0x19/0x90 > > > [ 155.278968] __driver_attach+0x99/0x170 > > > [ 155.282824] ? __device_attach_driver+0xd0/0xd0 > > > [ 155.287376] ? __device_attach_driver+0xd0/0xd0 > > > [ 155.291928] bus_for_each_dev+0x73/0xc0 > > > [ 155.295792] bus_add_driver+0x14b/0x1f0 > > > [ 155.299648] driver_register+0x67/0xb0 > > > [ 155.303419] i915_init+0x18/0x8c [i915] > > > [ 155.307328] ? 0xffffffffa0188000 > > > [ 155.310669] do_one_initcall+0x53/0x2e0 > > > [ 155.314525] ? rcu_read_lock_sched_held+0x4d/0x80 > > > [ 155.319253] ? kmem_cache_alloc_trace+0x5ad/0x790 > > > [ 155.323982] do_init_module+0x56/0x210 > > > [ 155.327754] load_module+0x25fc/0x29f0 > > > [ 155.331528] ? __do_sys_finit_module+0xae/0x110 > > > [ 155.336081] __do_sys_finit_module+0xae/0x110 > > > [ 155.340462] do_syscall_64+0x38/0xc0 > > > [ 155.344060] entry_SYSCALL_64_after_hwframe+0x44/0xae > > > [ 155.349137] RIP: 0033:0x7efc4496389d > > > [ 155.352735] Code: 00 c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d c3 f5 0c 00 f7 d8 64 89 01 48 > > > [ 155.371530] RSP: 002b:00007ffeb3e23218 EFLAGS: 00000246 ORIG_RAX: 0000000000000139 > > > [ 155.379123] RAX: ffffffffffffffda RBX: 0000557664b72240 RCX: 00007efc4496389d > > > [ 155.386282] RDX: 0000000000000000 RSI: 0000557664b69610 RDI: 0000000000000004 > > > [ 155.393443] RBP: 0000000000000020 R08: 00007ffeb3e21ff0 R09: 0000557664b682f0 > > > [ 155.400603] R10: 00007ffeb3e23360 R11: 0000000000000246 R12: 0000557664b69610 > > > [ 155.407761] R13: 0000000000000000 R14: 0000557664b6ada0 R15: 0000557664b72240 > > > > > > > This is the major reason why inventing any wheels locally in the kernel > > > > isn't a good idea - aside from the duplicated code because likely there is > > > > a solution for whatever you need. There's all the validation tools, > > > > careful thought about semantics (especially around races) and all that > > > > stuff that you're always missing on your hand-rolled thing. Aside from > > > > anything hand-rolled is much harder to read, since intent isn't clear. > > > > -Daniel > > > > > > > > > > > > > > > > > > Matt > > > > > > > > > > > -Daniel > > > > > > > > > > > > > + > > > > > > > scrub_guc_desc_for_outstanding_g2h(guc); > > > > > > > } > > > > > > > > > > > > > > -- > > > > > > > 2.32.0 > > > > > > > > > > > > > > > > > > > -- > > > > > > Daniel Vetter > > > > > > Software Engineer, Intel Corporation > > > > > > http://blog.ffwll.ch > > > > > > > > -- > > > > Daniel Vetter > > > > Software Engineer, Intel Corporation > > > > http://blog.ffwll.ch > > > > -- > > Daniel Vetter > > Software Engineer, Intel Corporation > > http://blog.ffwll.ch -- Daniel Vetter Software Engineer, Intel Corporation http://blog.ffwll.ch