> -----Original Message----- > From: Kuehling, Felix > Sent: Tuesday, March 28, 2017 4:15 PM > To: amd-gfx at lists.freedesktop.org; Koenig, Christian; Zhou, > David(ChunMing); Deucher, Alexander > Cc: Russell, Kent > Subject: Multilevel page tables broken for high addresses > > It looks like the multi-level page table changes have been submitted. > They're causing problems when we're trying to integrate them into our > KFD branch. > > We resolved the obvious changes and it's working on older ASICs without > problems. But we're getting hangs on Vega10. With my patch to enable > UTCL2 interrupts, I'm seeing lots of VM faults (see below). The > VM_L2_PROTECTION_FAULT_STATUS indicates a WALKER_ERROR (3 = PDE1 > value). > > If I set adev->vm_manager.num_level = 1 in gmc_v9_0_vm_init, the > problem > goes away (basically reverting b98e6b5 drm/amdgpu: enable four level > VMPT for gmc9). > > I suspect an issue that's exposed by how the KFD Thunk library manages > shared virtual address space? We typically start at fairly high virtual > addresses and reserve the lower 1/4 of our address space for coherent > mappings (aperture-based scheme for pre-gfx9). The address in the fault > below is 0x0000001000d80000, so a bit above 64GB, near the start of our > non-coherent range. > > Simple KFD tests that don't use the non-coherent (high) address range > seem to be working fine. That tells me that the multi-level page table > code has a problem with high addresses. > > I'll keep digging ... Do you have multiple GPUs in the system? There might be issues since some of the vm related settings come from global variables. Alex > > Regards, > Felix > > [ 24.768477] amdgpu 0000:03:00.0: [gfxhub] VMC page fault (src_id:0 ring:0 > vm_id:8 pas_id:1) > [ 24.777361] amdgpu 0000:03:00.0: at page 0x0000001000d80000 from 27 > [ 24.784204] amdgpu 0000:03:00.0: > VM_L2_PROTECTION_FAULT_STATUS:0x00841157 > [ 24.791418] amdgpu 0000:03:00.0: IH ring buffer overflow (0x00083E00, > 0x00000740, 0x00003E20) > [ 24.791421] amdgpu 0000:03:00.0: [gfxhub] VMC page fault (src_id:0 ring:0 > vm_id:8 pas_id:1) > [ 24.800299] amdgpu 0000:03:00.0: at page 0x0000001000d80000 from 27 > [ 24.807154] amdgpu 0000:03:00.0: > VM_L2_PROTECTION_FAULT_STATUS:0x00841157 > [ 24.814370] amdgpu 0000:03:00.0: [gfxhub] VMC page fault (src_id:0 ring:0 > vm_id:8 pas_id:1) > [ 24.823251] amdgpu 0000:03:00.0: at page 0x0000001000d80000 from 27 > [ 24.830098] amdgpu 0000:03:00.0: > VM_L2_PROTECTION_FAULT_STATUS:0x00841157 > [ 24.837312] amdgpu 0000:03:00.0: [gfxhub] VMC page fault (src_id:0 ring:0 > vm_id:8 pas_id:1) > [ 24.846190] amdgpu 0000:03:00.0: at page 0x0000001000d80000 from 27 > [ 24.853056] amdgpu 0000:03:00.0: > VM_L2_PROTECTION_FAULT_STATUS:0x00841157 > [ 24.860273] amdgpu 0000:03:00.0: [gfxhub] VMC page fault (src_id:0 ring:0 > vm_id:8 pas_id:1) > [ 24.869151] amdgpu 0000:03:00.0: at page 0x0000001000d80000 from 27 > [ 24.875994] amdgpu 0000:03:00.0: > VM_L2_PROTECTION_FAULT_STATUS:0x00841157 > [ 24.883209] amdgpu 0000:03:00.0: [gfxhub] VMC page fault (src_id:0 ring:0 > vm_id:8 pas_id:1) > [ 24.892087] amdgpu 0000:03:00.0: at page 0x0000001000d80000 from 27 > [ 24.898933] amdgpu 0000:03:00.0: > VM_L2_PROTECTION_FAULT_STATUS:0x00841157 > [ 24.906170] amdgpu 0000:03:00.0: [gfxhub] VMC page fault (src_id:0 ring:0 > vm_id:8 pas_id:1) > [ 24.915059] amdgpu 0000:03:00.0: at page 0x0000001000d80000 from 27 > [ 24.921910] amdgpu 0000:03:00.0: > VM_L2_PROTECTION_FAULT_STATUS:0x00841157 > [ 24.929143] amdgpu 0000:03:00.0: [gfxhub] VMC page fault (src_id:0 ring:0 > vm_id:8 pas_id:1) > [ 24.938021] amdgpu 0000:03:00.0: at page 0x0000001000d80000 from 27 > [ 24.944874] amdgpu 0000:03:00.0: > VM_L2_PROTECTION_FAULT_STATUS:0x00841157 > [ 24.952089] amdgpu 0000:03:00.0: [gfxhub] VMC page fault (src_id:0 ring:0 > vm_id:8 pas_id:1) > [ 24.960967] amdgpu 0000:03:00.0: at page 0x0000001000d80000 from 27 > [ 24.967810] amdgpu 0000:03:00.0: > VM_L2_PROTECTION_FAULT_STATUS:0x00841157 > [ 29.610925] gmc_v9_0_process_interrupt: 3402060 callbacks suppressed > [ 29.610926] amdgpu 0000:03:00.0: [gfxhub] VMC page fault (src_id:0 ring:0 > vm_id:8 pas_id:1) > [ 29.628202] amdgpu 0000:03:00.0: at page 0x0000001000d80000 from 27 > [ 29.641520] amdgpu 0000:03:00.0: > VM_L2_PROTECTION_FAULT_STATUS:0x00000000 > > > On 17-03-27 01:53 AM, Chunming Zhou wrote: > > *** BLURB HERE *** > > From Vega, ascis start to support multiple level vmpt, the series is to > implement it. > > > > Tested successfully with 2/3/4 levels. > > > > V2: address Christian comments. > > > > Max vm size 256TB tested ok. > > > > > > Christian König (10): > > drm/amdgpu: rename page_directory_fence to last_dir_update > > drm/amdgpu: add the VM pointer to the amdgpu_pte_update_params as > well > > drm/amdgpu: add num_level to the VM manager > > drm/amdgpu: generalize page table level > > drm/amdgpu: handle multi level PD size calculation > > drm/amdgpu: handle multi level PD during validation > > drm/amdgpu: handle multi level PD in the LRU > > drm/amdgpu: handle multi level PD updates V2 > > drm/amdgpu: handle multi level PD during PT updates > > drm/amdgpu: add alloc/free for multi level PDs V2 > > > > Chunming Zhou (5): > > drm/amdgpu: abstract block size to one function > > drm/amdgpu: limit block size to one page > > drm/amdgpu: adapt vm size for multi vmpt > > drm/amdgpu: set page table depth by num_level > > drm/amdgpu: enable four level VMPT for gmc9 > > > > drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c | 6 +- > > drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 67 ++-- > > drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c | 2 +- > > drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c | 474 > +++++++++++++++++++---------- > > drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h | 16 +- > > drivers/gpu/drm/amd/amdgpu/gfxhub_v1_0.c | 3 +- > > drivers/gpu/drm/amd/amdgpu/gmc_v6_0.c | 1 + > > drivers/gpu/drm/amd/amdgpu/gmc_v7_0.c | 1 + > > drivers/gpu/drm/amd/amdgpu/gmc_v8_0.c | 1 + > > drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c | 7 + > > drivers/gpu/drm/amd/amdgpu/mmhub_v1_0.c | 2 +- > > 11 files changed, 380 insertions(+), 200 deletions(-) > >