Re: [PATCH] Revert "drm/amdgpu: remove vm sanity check from amdgpu_vm_make_compute"

Christian König <ckoenig.leichtzumerken@xxxxxxxxx> · Tue, 24 Oct 2023 10:09:14 +0200



    Am 24.10.23 um 01:41 schrieb Felix Kuehling:

    
      [sorry, I hit send too early]
      

      On 2023-10-23 11:15, Christian König wrote:

      Am 23.10.23 um 15:06 schrieb Daniel Tang:

        That commit causes the screen to freeze
          a few moments after running

          clinfo on v6.6-rc7 and ROCm 5.6. Sometimes the rest of the
          computer

          including ssh also freezes. On v6.5-rc1, it only results in a
          NULL pointer

          deference message in dmesg and the process to become a zombie
          whose

          unkillableness prevents shutdown without REISUB. Although
          llama.cpp and

          hashcat were working in v6.2 and ROCm 5.6, broke, and are not
          fixed by

          this revert, pytorch-rocm is now working with stability and
          without

          whole-computer freezes caused by any accidental running of
          clinfo.

          
          This reverts commit 1d7776cc148b9f2f3ebaf1181662ba695a29f639.

        
        That result doesn't make much sense. Felix please correct me,
        but AFAIK the ATS stuff was completely removed by now.

        
        Are you sure that this is pure v6.6-rc7 and not some other
        patches applied? If yes than we must have missed something.

      
      This revert doesn't really affect systems with ATS. It moves the
      sanity check back out of the ATS-specific code.

    
    Ah! I've read the code wrong, that makes much more sense now.

    
      The Null pointer dereference in the bug report comes from the CPU
      page table update code:

      [10089.267556] BUG: kernel NULL pointer dereference, address: 0000000000000000
[10089.267563] #PF: supervisor write access in kernel mode
[10089.267566] #PF: error_code(0x0002) - not-present page
[10089.267569] PGD 0 P4D 0 
[10089.267574] Oops: 0002 [#1] PREEMPT SMP NOPTI
[10089.267578] CPU: 23 PID: 18191 Comm: clinfo Tainted: G           OE      6.5.0-9-generic #9-Ubuntu
[10089.267582] Hardware name: Micro-Star International Co., Ltd. MS-7C37/X570-A PRO (MS-7C37), BIOS H.I0 08/10/2022
[10089.267585] RIP: 0010:amdgpu_gmc_set_pte_pde+0x23/0x40 [amdgpu]
[10089.267820] Code: 90 90 90 90 90 90 90 0f 1f 44 00 00 48 b8 00 f0 ff ff ff ff 00 00 55 48 21 c1 8d 04 d5 00 00 00 00 4c 09 c1 48 01 c6 48 89 e5 <48> 89 0e 31 c0 5d 31 d2 31 c9 31 f6 45 31 c0 e9 89 7e 27 fb 66 0f
[10089.267823] RSP: 0018:ffffb49805eeb8b0 EFLAGS: 00010246
[10089.267827] RAX: 0000000000000000 RBX: 0000000000200000 RCX: 0040000000000480
[10089.267830] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff9890d4380000
[10089.267832] RBP: ffffb49805eeb8b0 R08: 0040000000000480 R09: 0000000000200000
[10089.267835] R10: 0000000800100200 R11: 0000000800100200 R12: ffffb49805eeba98
[10089.267837] R13: 0000000000000001 R14: 0000000000200000 R15: 0000000000000001
[10089.267840] FS:  00007f8ca9f09740(0000) GS:ffff9897befc0000(0000) knlGS:0000000000000000
[10089.267843] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[10089.267846] CR2: 0000000000000000 CR3: 00000002e0746000 CR4: 0000000000750ee0
[10089.267849] PKRU: 55555554
[10089.267851] Call Trace:
[10089.267853]  <TASK>
[10089.267858]  ? show_regs+0x6d/0x80
[10089.267865]  ? __die+0x24/0x80
[10089.267870]  ? page_fault_oops+0x99/0x1b0
[10089.267876]  ? do_user_addr_fault+0x316/0x6b0
[10089.267879]  ? srso_alias_return_thunk+0x5/0x7f
[10089.267884]  ? scsi_dispatch_cmd+0x91/0x240
[10089.267891]  ? exc_page_fault+0x83/0x1b0
[10089.267896]  ? asm_exc_page_fault+0x27/0x30
[10089.267904]  ? amdgpu_gmc_set_pte_pde+0x23/0x40 [amdgpu]
[10089.268140]  amdgpu_vm_cpu_update+0xa9/0x130 [amdgpu]
...

      This revert is just a roundabout way of disabling CPU page table
      updates for compute VMs. But I don't think it really addresses the
      root cause.

    
    Yeah, completely agree. Looks like some page tables isn't CPU
    accessible for some reason.

    
    Going to take a look when I have time.

    
    Regards,

    Christian.

    
      Regards,

        Felix
      

        Regards,

        Christian.

        
          Closes: https://github.com/RadeonOpenCompute/ROCm/issues/2596

          Signed-off-by: Daniel Tang <danielzgtg.opensource@xxxxxxxxx>

          ---

            drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c | 12 ++++++------

            1 file changed, 6 insertions(+), 6 deletions(-)

          
          diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
          b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c

          index 82f25996ff5e..602f311ab766 100644

          --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c

          +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c

          @@ -2243,16 +2243,16 @@ int amdgpu_vm_make_compute(struct
          amdgpu_device *adev, struct amdgpu_vm *vm)

                if (r)

                    return r;

            +    /* Sanity checks */

          +    if (!amdgpu_vm_pt_is_root_clean(adev, vm)) {

          +        r = -EINVAL;

          +        goto unreserve_bo;

          +    }

          +

                /* Check if PD needs to be reinitialized and do it
          before

                 * changing any other state, in case it fails.

                 */

                if (pte_support_ats != vm->pte_support_ats) {

          -        /* Sanity checks */

          -        if (!amdgpu_vm_pt_is_root_clean(adev, vm)) {

          -            r = -EINVAL;

          -            goto unreserve_bo;

          -        }

          -

                    vm->pte_support_ats = pte_support_ats;

                    r = amdgpu_vm_pt_clear(adev, vm,
          to_amdgpu_bo_vm(vm->root.bo),

                                   false);

          -- 2.40.1