Applied. Thanks! Alex On Fri, Aug 19, 2022 at 6:07 AM Christian König <christian.koenig@xxxxxxx> wrote: > > Am 19.08.22 um 11:34 schrieb 李真能: > > 在 2022/8/17 19:40, Christian König 写道: > > Am 17.08.22 um 09:31 schrieb 李真能: > > > 在 2022/8/15 21:12, Christian König 写道: > > Am 15.08.22 um 09:34 schrieb 李真能: > > > 在 2022/8/12 18:55, Christian König 写道: > > Am 11.08.22 um 09:25 schrieb Zhenneng Li: > > Although radeon card fence and wait for gpu to finish processing current batch rings, > there is still a corner case that radeon lockup work queue may not be fully flushed, > and meanwhile the radeon_suspend_kms() function has called pci_set_power_state() to > put device in D3hot state. > > > If I'm not completely mistaken the reset worker uses the suspend/resume functionality as well to get the hardware into a working state again. > > So if I'm not completely mistaken this here would lead to a deadlock, please double check that. > > > We have tested many times, there are no deadlock. > > > Testing doesn't tells you anything, you need to audit the call paths. > > In which situation, there would lead to a deadlock? > > > GPU resets. > > > Although flush_delayed_work(&rdev->fence_drv[i].lockup_work) will wait for a lockup_work to finish executing the last queueing, but this kernel func haven't get any lock, and lockup_work will run in another kernel thread, so I think flush_delayed_work could not lead to a deadlock. > > Therefor if radeon_gpu_reset is called in another thread when radeon_suspend_kms is blocked on flush_delayed_work, there could not lead to a deadlock. > > > Ok sounds like you didn't go what I wanted to say. > > The key problem is that radeon_gpu_reset() calls radeon_suspend() which in turn calls rdev->asic->suspend(). > > And this function in turn could end up in radeon_suspend_kms() again, but I'm not 100% sure about that. > > Just double check the order of function called here (e.g. if radeon_suspend_kms() call radeon_suspend() or the other way around). Apart from that your patch looks correct to me as well. > > radeon_gpu_reset will call radeon_suspend, but radeon_suspend will not call radeon_suspend_kms, the more detail of call flow, we can see the attachement file: radeon-suspend-reset.png. > > Sorry I may have mistaken your meaning. > > > Ok in this case my memory of the call flow wasn't correct any more. > > Feel free to add an Acked-by: Christian König <christian.koenig@xxxxxxx> to the patch. > > Alex should then pick it up for upstreaming. > > Thanks, > Christian. > > > Regards, > Christian. > > > > Regards, > Christian. > > > > Regards, > Christian. > > Per PCI spec rev 4.0 on 5.3.1.4.1 D3hot State. > > Configuration and Message requests are the only TLPs accepted by a Function in > the D3hot state. All other received Requests must be handled as Unsupported Requests, > and all received Completions may optionally be handled as Unexpected Completions. > > This issue will happen in following logs: > Unable to handle kernel paging request at virtual address 00008800e0008010 > CPU 0 kworker/0:3(131): Oops 0 > pc = [<ffffffff811bea5c>] ra = [<ffffffff81240844>] ps = 0000 Tainted: G W > pc is at si_gpu_check_soft_reset+0x3c/0x240 > ra is at si_dma_is_lockup+0x34/0xd0 > v0 = 0000000000000000 t0 = fff08800e0008010 t1 = 0000000000010000 > t2 = 0000000000008010 t3 = fff00007e3c00000 t4 = fff00007e3c00258 > t5 = 000000000000ffff t6 = 0000000000000001 t7 = fff00007ef078000 > s0 = fff00007e3c016e8 s1 = fff00007e3c00000 s2 = fff00007e3c00018 > s3 = fff00007e3c00000 s4 = fff00007fff59d80 s5 = 0000000000000000 > s6 = fff00007ef07bd98 > a0 = fff00007e3c00000 a1 = fff00007e3c016e8 a2 = 0000000000000008 > a3 = 0000000000000001 a4 = 8f5c28f5c28f5c29 a5 = ffffffff810f4338 > t8 = 0000000000000275 t9 = ffffffff809b66f8 t10 = ff6769c5d964b800 > t11= 000000000000b886 pv = ffffffff811bea20 at = 0000000000000000 > gp = ffffffff81d89690 sp = 00000000aa814126 > Disabling lock debugging due to kernel taint > Trace: > [<ffffffff81240844>] si_dma_is_lockup+0x34/0xd0 > [<ffffffff81119610>] radeon_fence_check_lockup+0xd0/0x290 > [<ffffffff80977010>] process_one_work+0x280/0x550 > [<ffffffff80977350>] worker_thread+0x70/0x7c0 > [<ffffffff80977410>] worker_thread+0x130/0x7c0 > [<ffffffff80982040>] kthread+0x200/0x210 > [<ffffffff809772e0>] worker_thread+0x0/0x7c0 > [<ffffffff80981f8c>] kthread+0x14c/0x210 > [<ffffffff80911658>] ret_from_kernel_thread+0x18/0x20 > [<ffffffff80981e40>] kthread+0x0/0x210 > Code: ad3e0008 43f0074a ad7e0018 ad9e0020 8c3001e8 40230101 > <88210000> 4821ed21 > So force lockup work queue flush to fix this problem. > > Signed-off-by: Zhenneng Li <lizhenneng@xxxxxxxxxx> > --- > drivers/gpu/drm/radeon/radeon_device.c | 3 +++ > 1 file changed, 3 insertions(+) > > diff --git a/drivers/gpu/drm/radeon/radeon_device.c b/drivers/gpu/drm/radeon/radeon_device.c > index 15692cb241fc..e608ca26780a 100644 > --- a/drivers/gpu/drm/radeon/radeon_device.c > +++ b/drivers/gpu/drm/radeon/radeon_device.c > @@ -1604,6 +1604,9 @@ int radeon_suspend_kms(struct drm_device *dev, bool suspend, > if (r) { > /* delay GPU reset to resume */ > radeon_fence_driver_force_completion(rdev, i); > + } else { > + /* finish executing delayed work */ > + flush_delayed_work(&rdev->fence_drv[i].lockup_work); > } > } > > > > >