It looks like the multi-level page table changes have been submitted. They're causing problems when we're trying to integrate them into our KFD branch. We resolved the obvious changes and it's working on older ASICs without problems. But we're getting hangs on Vega10. With my patch to enable UTCL2 interrupts, I'm seeing lots of VM faults (see below). The VM_L2_PROTECTION_FAULT_STATUS indicates a WALKER_ERROR (3 = PDE1 value). If I set adev->vm_manager.num_level = 1 in gmc_v9_0_vm_init, the problem goes away (basically reverting b98e6b5 drm/amdgpu: enable four level VMPT for gmc9). I suspect an issue that's exposed by how the KFD Thunk library manages shared virtual address space? We typically start at fairly high virtual addresses and reserve the lower 1/4 of our address space for coherent mappings (aperture-based scheme for pre-gfx9). The address in the fault below is 0x0000001000d80000, so a bit above 64GB, near the start of our non-coherent range. Simple KFD tests that don't use the non-coherent (high) address range seem to be working fine. That tells me that the multi-level page table code has a problem with high addresses. I'll keep digging ... Regards, Felix [ 24.768477] amdgpu 0000:03:00.0: [gfxhub] VMC page fault (src_id:0 ring:0 vm_id:8 pas_id:1) [ 24.777361] amdgpu 0000:03:00.0: at page 0x0000001000d80000 from 27 [ 24.784204] amdgpu 0000:03:00.0: VM_L2_PROTECTION_FAULT_STATUS:0x00841157 [ 24.791418] amdgpu 0000:03:00.0: IH ring buffer overflow (0x00083E00, 0x00000740, 0x00003E20) [ 24.791421] amdgpu 0000:03:00.0: [gfxhub] VMC page fault (src_id:0 ring:0 vm_id:8 pas_id:1) [ 24.800299] amdgpu 0000:03:00.0: at page 0x0000001000d80000 from 27 [ 24.807154] amdgpu 0000:03:00.0: VM_L2_PROTECTION_FAULT_STATUS:0x00841157 [ 24.814370] amdgpu 0000:03:00.0: [gfxhub] VMC page fault (src_id:0 ring:0 vm_id:8 pas_id:1) [ 24.823251] amdgpu 0000:03:00.0: at page 0x0000001000d80000 from 27 [ 24.830098] amdgpu 0000:03:00.0: VM_L2_PROTECTION_FAULT_STATUS:0x00841157 [ 24.837312] amdgpu 0000:03:00.0: [gfxhub] VMC page fault (src_id:0 ring:0 vm_id:8 pas_id:1) [ 24.846190] amdgpu 0000:03:00.0: at page 0x0000001000d80000 from 27 [ 24.853056] amdgpu 0000:03:00.0: VM_L2_PROTECTION_FAULT_STATUS:0x00841157 [ 24.860273] amdgpu 0000:03:00.0: [gfxhub] VMC page fault (src_id:0 ring:0 vm_id:8 pas_id:1) [ 24.869151] amdgpu 0000:03:00.0: at page 0x0000001000d80000 from 27 [ 24.875994] amdgpu 0000:03:00.0: VM_L2_PROTECTION_FAULT_STATUS:0x00841157 [ 24.883209] amdgpu 0000:03:00.0: [gfxhub] VMC page fault (src_id:0 ring:0 vm_id:8 pas_id:1) [ 24.892087] amdgpu 0000:03:00.0: at page 0x0000001000d80000 from 27 [ 24.898933] amdgpu 0000:03:00.0: VM_L2_PROTECTION_FAULT_STATUS:0x00841157 [ 24.906170] amdgpu 0000:03:00.0: [gfxhub] VMC page fault (src_id:0 ring:0 vm_id:8 pas_id:1) [ 24.915059] amdgpu 0000:03:00.0: at page 0x0000001000d80000 from 27 [ 24.921910] amdgpu 0000:03:00.0: VM_L2_PROTECTION_FAULT_STATUS:0x00841157 [ 24.929143] amdgpu 0000:03:00.0: [gfxhub] VMC page fault (src_id:0 ring:0 vm_id:8 pas_id:1) [ 24.938021] amdgpu 0000:03:00.0: at page 0x0000001000d80000 from 27 [ 24.944874] amdgpu 0000:03:00.0: VM_L2_PROTECTION_FAULT_STATUS:0x00841157 [ 24.952089] amdgpu 0000:03:00.0: [gfxhub] VMC page fault (src_id:0 ring:0 vm_id:8 pas_id:1) [ 24.960967] amdgpu 0000:03:00.0: at page 0x0000001000d80000 from 27 [ 24.967810] amdgpu 0000:03:00.0: VM_L2_PROTECTION_FAULT_STATUS:0x00841157 [ 29.610925] gmc_v9_0_process_interrupt: 3402060 callbacks suppressed [ 29.610926] amdgpu 0000:03:00.0: [gfxhub] VMC page fault (src_id:0 ring:0 vm_id:8 pas_id:1) [ 29.628202] amdgpu 0000:03:00.0: at page 0x0000001000d80000 from 27 [ 29.641520] amdgpu 0000:03:00.0: VM_L2_PROTECTION_FAULT_STATUS:0x00000000 On 17-03-27 01:53 AM, Chunming Zhou wrote: > *** BLURB HERE *** > From Vega, ascis start to support multiple level vmpt, the series is to implement it. > > Tested successfully with 2/3/4 levels. > > V2: address Christian comments. > > Max vm size 256TB tested ok. > > > Christian König (10): > drm/amdgpu: rename page_directory_fence to last_dir_update > drm/amdgpu: add the VM pointer to the amdgpu_pte_update_params as well > drm/amdgpu: add num_level to the VM manager > drm/amdgpu: generalize page table level > drm/amdgpu: handle multi level PD size calculation > drm/amdgpu: handle multi level PD during validation > drm/amdgpu: handle multi level PD in the LRU > drm/amdgpu: handle multi level PD updates V2 > drm/amdgpu: handle multi level PD during PT updates > drm/amdgpu: add alloc/free for multi level PDs V2 > > Chunming Zhou (5): > drm/amdgpu: abstract block size to one function > drm/amdgpu: limit block size to one page > drm/amdgpu: adapt vm size for multi vmpt > drm/amdgpu: set page table depth by num_level > drm/amdgpu: enable four level VMPT for gmc9 > > drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c | 6 +- > drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 67 ++-- > drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c | 2 +- > drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c | 474 +++++++++++++++++++---------- > drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h | 16 +- > drivers/gpu/drm/amd/amdgpu/gfxhub_v1_0.c | 3 +- > drivers/gpu/drm/amd/amdgpu/gmc_v6_0.c | 1 + > drivers/gpu/drm/amd/amdgpu/gmc_v7_0.c | 1 + > drivers/gpu/drm/amd/amdgpu/gmc_v8_0.c | 1 + > drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c | 7 + > drivers/gpu/drm/amd/amdgpu/mmhub_v1_0.c | 2 +- > 11 files changed, 380 insertions(+), 200 deletions(-) >