Multilevel page tables broken for high addresses

Alexander.Deucher@xxxxxxx (Deucher, Alexander) · Tue, 28 Mar 2017 20:25:10 +0000

> -----Original Message-----
> From: Kuehling, Felix
> Sent: Tuesday, March 28, 2017 4:15 PM
> To: amd-gfx at lists.freedesktop.org; Koenig, Christian; Zhou,
> David(ChunMing); Deucher, Alexander
> Cc: Russell, Kent
> Subject: Multilevel page tables broken for high addresses
> 
> It looks like the multi-level page table changes have been submitted.
> They're causing problems when we're trying to integrate them into our
> KFD branch.
> 
> We resolved the obvious changes and it's working on older ASICs without
> problems. But we're getting hangs on Vega10. With my patch to enable
> UTCL2 interrupts, I'm seeing lots of VM faults (see below). The
> VM_L2_PROTECTION_FAULT_STATUS indicates a WALKER_ERROR (3 = PDE1
> value).
> 
> If I set adev->vm_manager.num_level = 1 in gmc_v9_0_vm_init, the
> problem
> goes away (basically reverting b98e6b5 drm/amdgpu: enable four level
> VMPT for gmc9).
> 
> I suspect an issue that's exposed by how the KFD Thunk library manages
> shared virtual address space? We typically start at fairly high virtual
> addresses and reserve the lower 1/4 of our address space for coherent
> mappings (aperture-based scheme for pre-gfx9). The address in the fault
> below is 0x0000001000d80000, so a bit above 64GB, near the start of our
> non-coherent range.
> 
> Simple KFD tests that don't use the non-coherent (high) address range
> seem to be working fine. That tells me that the multi-level page table
> code has a problem with high addresses.
> 
> I'll keep digging ...

Do you have multiple GPUs in the system?  There might be issues since some of the vm related settings come from global variables.

Alex

> 
> Regards,
>   Felix
> 
> [   24.768477] amdgpu 0000:03:00.0: [gfxhub] VMC page fault (src_id:0 ring:0
> vm_id:8 pas_id:1)
> [   24.777361] amdgpu 0000:03:00.0:   at page 0x0000001000d80000 from 27
> [   24.784204] amdgpu 0000:03:00.0:
> VM_L2_PROTECTION_FAULT_STATUS:0x00841157
> [   24.791418] amdgpu 0000:03:00.0: IH ring buffer overflow (0x00083E00,
> 0x00000740, 0x00003E20)
> [   24.791421] amdgpu 0000:03:00.0: [gfxhub] VMC page fault (src_id:0 ring:0
> vm_id:8 pas_id:1)
> [   24.800299] amdgpu 0000:03:00.0:   at page 0x0000001000d80000 from 27
> [   24.807154] amdgpu 0000:03:00.0:
> VM_L2_PROTECTION_FAULT_STATUS:0x00841157
> [   24.814370] amdgpu 0000:03:00.0: [gfxhub] VMC page fault (src_id:0 ring:0
> vm_id:8 pas_id:1)
> [   24.823251] amdgpu 0000:03:00.0:   at page 0x0000001000d80000 from 27
> [   24.830098] amdgpu 0000:03:00.0:
> VM_L2_PROTECTION_FAULT_STATUS:0x00841157
> [   24.837312] amdgpu 0000:03:00.0: [gfxhub] VMC page fault (src_id:0 ring:0
> vm_id:8 pas_id:1)
> [   24.846190] amdgpu 0000:03:00.0:   at page 0x0000001000d80000 from 27
> [   24.853056] amdgpu 0000:03:00.0:
> VM_L2_PROTECTION_FAULT_STATUS:0x00841157
> [   24.860273] amdgpu 0000:03:00.0: [gfxhub] VMC page fault (src_id:0 ring:0
> vm_id:8 pas_id:1)
> [   24.869151] amdgpu 0000:03:00.0:   at page 0x0000001000d80000 from 27
> [   24.875994] amdgpu 0000:03:00.0:
> VM_L2_PROTECTION_FAULT_STATUS:0x00841157
> [   24.883209] amdgpu 0000:03:00.0: [gfxhub] VMC page fault (src_id:0 ring:0
> vm_id:8 pas_id:1)
> [   24.892087] amdgpu 0000:03:00.0:   at page 0x0000001000d80000 from 27
> [   24.898933] amdgpu 0000:03:00.0:
> VM_L2_PROTECTION_FAULT_STATUS:0x00841157
> [   24.906170] amdgpu 0000:03:00.0: [gfxhub] VMC page fault (src_id:0 ring:0
> vm_id:8 pas_id:1)
> [   24.915059] amdgpu 0000:03:00.0:   at page 0x0000001000d80000 from 27
> [   24.921910] amdgpu 0000:03:00.0:
> VM_L2_PROTECTION_FAULT_STATUS:0x00841157
> [   24.929143] amdgpu 0000:03:00.0: [gfxhub] VMC page fault (src_id:0 ring:0
> vm_id:8 pas_id:1)
> [   24.938021] amdgpu 0000:03:00.0:   at page 0x0000001000d80000 from 27
> [   24.944874] amdgpu 0000:03:00.0:
> VM_L2_PROTECTION_FAULT_STATUS:0x00841157
> [   24.952089] amdgpu 0000:03:00.0: [gfxhub] VMC page fault (src_id:0 ring:0
> vm_id:8 pas_id:1)
> [   24.960967] amdgpu 0000:03:00.0:   at page 0x0000001000d80000 from 27
> [   24.967810] amdgpu 0000:03:00.0:
> VM_L2_PROTECTION_FAULT_STATUS:0x00841157
> [   29.610925] gmc_v9_0_process_interrupt: 3402060 callbacks suppressed
> [   29.610926] amdgpu 0000:03:00.0: [gfxhub] VMC page fault (src_id:0 ring:0
> vm_id:8 pas_id:1)
> [   29.628202] amdgpu 0000:03:00.0:   at page 0x0000001000d80000 from 27
> [   29.641520] amdgpu 0000:03:00.0:
> VM_L2_PROTECTION_FAULT_STATUS:0x00000000
> 
> 
> On 17-03-27 01:53 AM, Chunming Zhou wrote:
> > *** BLURB HERE ***
> > From Vega, ascis start to support multiple level vmpt, the series is to
> implement it.
> >
> > Tested successfully with 2/3/4 levels.
> >
> > V2: address Christian comments.
> >
> > Max vm size 256TB tested ok.
> >
> >
> > Christian KÃ¶nig (10):
> >   drm/amdgpu: rename page_directory_fence to last_dir_update
> >   drm/amdgpu: add the VM pointer to the amdgpu_pte_update_params as
> well
> >   drm/amdgpu: add num_level to the VM manager
> >   drm/amdgpu: generalize page table level
> >   drm/amdgpu: handle multi level PD size calculation
> >   drm/amdgpu: handle multi level PD during validation
> >   drm/amdgpu: handle multi level PD in the LRU
> >   drm/amdgpu: handle multi level PD updates V2
> >   drm/amdgpu: handle multi level PD during PT updates
> >   drm/amdgpu: add alloc/free for multi level PDs V2
> >
> > Chunming Zhou (5):
> >   drm/amdgpu: abstract block size to one function
> >   drm/amdgpu: limit block size to one page
> >   drm/amdgpu: adapt vm size for multi vmpt
> >   drm/amdgpu: set page table depth by num_level
> >   drm/amdgpu: enable four level VMPT for gmc9
> >
> >  drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c     |   6 +-
> >  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c |  67 ++--
> >  drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c    |   2 +-
> >  drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c     | 474
> +++++++++++++++++++----------
> >  drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h     |  16 +-
> >  drivers/gpu/drm/amd/amdgpu/gfxhub_v1_0.c   |   3 +-
> >  drivers/gpu/drm/amd/amdgpu/gmc_v6_0.c      |   1 +
> >  drivers/gpu/drm/amd/amdgpu/gmc_v7_0.c      |   1 +
> >  drivers/gpu/drm/amd/amdgpu/gmc_v8_0.c      |   1 +
> >  drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c      |   7 +
> >  drivers/gpu/drm/amd/amdgpu/mmhub_v1_0.c    |   2 +-
> >  11 files changed, 380 insertions(+), 200 deletions(-)
> >