Am 09.02.20 um 03:52 schrieb Pan,
Xinhui:
> hit panic when we update the page tables.
>
> <1>[ 122.103290] BUG: kernel NULL pointer
dereference, address: 0000000000000008
> <1>[ 122.103348] #PF: supervisor read access
in kernel mode
> <1>[ 122.103376] #PF: error_code(0x0000) -
not-present page
> <6>[ 122.103403] PGD 0 P4D 0
> <4>[ 122.103421] Oops: 0000 [#1] SMP PTI
> <4>[ 122.103442] CPU: 13 PID: 2133 Comm:
kfdtest Tainted: G OE 5.4.0-rc7+ #7
> <4>[ 122.103480] Hardware name: Supermicro
SYS-7048GR-TR/X10DRG-Q, BIOS 3.0b 03/09/2018
> <4>[ 122.103657] RIP:
0010:amdgpu_vm_update_pdes+0x140/0x330 [amdgpu]
> <4>[ 122.103689] Code: 03 4c 89 73 08 49 89
9d c8 00 00 00 48 8b 7b f0 c6 43 10 00 45 31 c0 48 8b 87
28 04 00 00 48 85 c0 74 07 4c 8b 80 20 04 00 00
<4d> 8b 70 08 31 f6 49 8b 86 28 04 00 00 48 85 c0
74 0f 48 8b 80 28
> <4>[ 122.103769] RSP: 0018:ffffb49a0a6a3a98
EFLAGS: 00010246
> <4>[ 122.103797] RAX: 0000000000000000 RBX:
ffff9020f823c148 RCX: dead000000000122
> <4>[ 122.103831] RDX: ffff9020ece70018 RSI:
ffff9020f823c0c8 RDI: ffff9010ca31c800
> <4>[ 122.103865] RBP: ffffb49a0a6a3b38 R08:
0000000000000000 R09: 0000000000000001
> <4>[ 122.103899] R10: 000000006044f994 R11:
00000000df57fb58 R12: ffff9020f823c000
> <4>[ 122.103933] R13: ffff9020f823c000 R14:
ffff9020f823c0c8 R15: ffff9010d5d20000
> <4>[ 122.103968] FS: 00007f32c83dc780(0000)
GS:ffff9020ff380000(0000) knlGS:0000000000000000
> <4>[ 122.104006] CS: 0010 DS: 0000 ES: 0000
CR0: 0000000080050033
> <4>[ 122.104035] CR2: 0000000000000008 CR3:
0000002036bba005 CR4: 00000000003606e0
> <4>[ 122.104069] DR0: 0000000000000000 DR1:
0000000000000000 DR2: 0000000000000000
> <4>[ 122.104103] DR3: 0000000000000000 DR6:
00000000fffe0ff0 DR7: 0000000000000400
> <4>[ 122.104137] Call Trace:
> <4>[ 122.104241] vm_update_pds+0x31/0x50
[amdgpu]
> <4>[ 122.104347]
amdgpu_amdkfd_gpuvm_map_memory_to_gpu+0x2ef/0x690
[amdgpu]
> <4>[ 122.104466]
kfd_process_alloc_gpuvm+0x98/0x190 [amdgpu]
> <4>[ 122.104576]
kfd_process_device_init_vm.part.8+0xf3/0x1f0 [amdgpu]
> <4>[ 122.104688]
kfd_process_device_init_vm+0x24/0x30 [amdgpu]
> <4>[ 122.104794]
kfd_ioctl_acquire_vm+0xa4/0xc0 [amdgpu]
> <4>[ 122.104900] kfd_ioctl+0x277/0x500
[amdgpu]
> <4>[ 122.105001] ?
kfd_ioctl_free_memory_of_gpu+0xc0/0xc0 [amdgpu]
> <4>[ 122.105039] ?
rcu_read_lock_sched_held+0x4f/0x80
> <4>[ 122.105068] ?
kmem_cache_free+0x2ba/0x300
> <4>[ 122.105093] ? vm_area_free+0x18/0x20
> <4>[ 122.105117] ? find_held_lock+0x35/0xa0
> <4>[ 122.105143] do_vfs_ioctl+0xa9/0x6f0
> <4>[ 122.106001] ksys_ioctl+0x75/0x80
> <4>[ 122.106802] ? do_syscall_64+0x17/0x230
> <4>[ 122.107605] __x64_sys_ioctl+0x1a/0x20
> <4>[ 122.108378] do_syscall_64+0x5f/0x230
> <4>[ 122.109118]
entry_SYSCALL_64_after_hwframe+0x49/0xbe
> <4>[ 122.109842] RIP: 0033:0x7f32c6b495d7
>
> Signed-off-by: xinhui pan
<xinhui.pan@xxxxxxx>
> ---
> drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
> index 3195bc90985a..3c388fdf335c 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
> @@ -2619,7 +2619,7 @@ void
amdgpu_vm_bo_invalidate(struct amdgpu_device *adev,
> continue;
> bo_base->moved = true;
>
> - if (bo->tbo.type ==
ttm_bo_type_kernel)
> + if (bo->tbo.type ==
ttm_bo_type_kernel && bo->parent)
Good catch, but that would mean that we move the root PD
to the moved
state which in turn is illegal as well.
Maybe better adjust amdgpu_vm_bo_relocated() to move the
root PD to the
idle state instead.
Christian.
>
amdgpu_vm_bo_relocated(bo_base);
> else if (bo->tbo.base.resv ==
vm->root.base.bo->tbo.base.resv)
> amdgpu_vm_bo_moved(bo_base);