Re: [PATCH] radeon: add a force flush to delay work when radeon suspend

Christian König <christian.koenig@xxxxxxx> · Mon, 3 Jan 2022 11:20:44 +0100



    Am 25.12.21 um 03:56 schrieb 周雪梅:

    
          Although radeon card fence and wait for gpu to
            finish processing current batch rings,
          there is still a corner case that radeon lockup
            work queue may not be fully flushed,
          and meanwhile the radeon_suspend_kms() function
            has called pci_set_power_state() to
          put device in D3hot state.
          

          Per PCI spec rev 4.0 on 5.3.1.4.1 D3hot State.
          > Configuration and Message requests are the
            only TLPs accepted by a Function in
          > the D3hot state. All other received
            Requests must be handled as Unsupported Requests,
          > and all received Completions may optionally
            be handled as Unexpected Completions.
        
      
    Well first of all this is the completely wrong place for this. The
    flush belongs into the fence code and not here.

    
    Then I don't think that this is a good idea since it might cause
    deadlocks.

    
    Christian.

    
          This issue will happen in following logs:
          

          1Unable to handle kernel paging request at
            virtual address 00008800e0008010
          CPU 0 kworker/0:3(131): Oops 0
          pc = [<ffffffff811bea5c>]  ra =
            [<ffffffff81240844>]  ps = 0000 Tainted: G        W
          pc is at si_gpu_check_soft_reset+0x3c/0x240
          ra is at si_dma_is_lockup+0x34/0xd0
          v0 = 0000000000000000  t0 = fff08800e0008010  t1
            = 0000000000010000
          t2 = 0000000000008010  t3 = fff00007e3c00000  t4
            = fff00007e3c00258
          t5 = 000000000000ffff  t6 = 0000000000000001  t7
            = fff00007ef078000
          s0 = fff00007e3c016e8  s1 = fff00007e3c00000  s2
            = fff00007e3c00018
          s3 = fff00007e3c00000  s4 = fff00007fff59d80  s5
            = 0000000000000000
          s6 = fff00007ef07bd98
          a0 = fff00007e3c00000  a1 = fff00007e3c016e8  a2
            = 0000000000000008
          a3 = 0000000000000001  a4 = 8f5c28f5c28f5c29  a5
            = ffffffff810f4338
          t8 = 0000000000000275  t9 = ffffffff809b66f8 
            t10 = ff6769c5d964b800
          t11= 000000000000b886  pv = ffffffff811bea20  at
            = 0000000000000000
          gp = ffffffff81d89690  sp = 00000000aa814126
          4Disabling lock debugging due to kernel taint
          Trace:
          [<ffffffff81240844>]
            si_dma_is_lockup+0x34/0xd0
          [<ffffffff81119610>]
            radeon_fence_check_lockup+0xd0/0x290
          [<ffffffff80977010>]
            process_one_work+0x280/0x550
          [<ffffffff80977350>]
            worker_thread+0x70/0x7c0
          [<ffffffff80977410>]
            worker_thread+0x130/0x7c0
          [<ffffffff80982040>] kthread+0x200/0x210
          [<ffffffff809772e0>]
            worker_thread+0x0/0x7c0
          [<ffffffff80981f8c>] kthread+0x14c/0x210
          [<ffffffff80911658>]
            ret_from_kernel_thread+0x18/0x20
          [<ffffffff80981e40>] kthread+0x0/0x210
          

           Code: ad3e0008  43f0074a  ad7e0018  ad9e0020 
            8c3001e8  40230101
           <88210000> 4821ed21
          

          So force lockup work queue flush to fix this
            problem.
          

          Reviewed-by: Su Weiqiang
            <suweiqiang@xxxxxxxxx>
          Reviewed-by: Zhou Xuemei
            <zhouxuemei@xxxxxxxxx>
          Signed-off-by: Xu Chenjiao
            <xuchenjiao@xxxxxxxxx>
          ---
           drivers/gpu/drm/radeon/radeon_device.c | 3 +++
           1 file changed, 3 insertions(+)
          

          diff --git
            a/drivers/gpu/drm/radeon/radeon_device.c
            b/drivers/gpu/drm/radeon/radeon_device.c
          index 59c8a6647ff2..cc1c07963116 100644
          --- a/drivers/gpu/drm/radeon/radeon_device.c
          +++ b/drivers/gpu/drm/radeon/radeon_device.c
          @@ -1625,6 +1625,9 @@ int
            radeon_suspend_kms(struct drm_device *dev, bool suspend,
           		if (r) {
           			/*
            delay GPU reset to resume */
           			radeon_fence_driver_force_completion(rdev,
            i);
          +		} else {
          +			/*
            finish executing delayed work */
          +			flush_delayed_work(&rdev->fence_drv[i].lockup_work);
           		}
           	}
           
          -- 
          2.17.1