Multilevel page tables broken for high addresses

felix.kuehling@xxxxxxx (Felix Kuehling) · Tue, 28 Mar 2017 16:14:42 -0400

It looks like the multi-level page table changes have been submitted.
They're causing problems when we're trying to integrate them into our
KFD branch.

We resolved the obvious changes and it's working on older ASICs without
problems. But we're getting hangs on Vega10. With my patch to enable
UTCL2 interrupts, I'm seeing lots of VM faults (see below). The
VM_L2_PROTECTION_FAULT_STATUS indicates a WALKER_ERROR (3 = PDE1 value).

If I set adev->vm_manager.num_level = 1 in gmc_v9_0_vm_init, the problem
goes away (basically reverting b98e6b5 drm/amdgpu: enable four level
VMPT for gmc9).

I suspect an issue that's exposed by how the KFD Thunk library manages
shared virtual address space? We typically start at fairly high virtual
addresses and reserve the lower 1/4 of our address space for coherent
mappings (aperture-based scheme for pre-gfx9). The address in the fault
below is 0x0000001000d80000, so a bit above 64GB, near the start of our
non-coherent range.

Simple KFD tests that don't use the non-coherent (high) address range
seem to be working fine. That tells me that the multi-level page table
code has a problem with high addresses.

I'll keep digging ...

Regards,
  Felix

[   24.768477] amdgpu 0000:03:00.0: [gfxhub] VMC page fault (src_id:0 ring:0 vm_id:8 pas_id:1)
[   24.777361] amdgpu 0000:03:00.0:   at page 0x0000001000d80000 from 27
[   24.784204] amdgpu 0000:03:00.0: VM_L2_PROTECTION_FAULT_STATUS:0x00841157
[   24.791418] amdgpu 0000:03:00.0: IH ring buffer overflow (0x00083E00, 0x00000740, 0x00003E20)
[   24.791421] amdgpu 0000:03:00.0: [gfxhub] VMC page fault (src_id:0 ring:0 vm_id:8 pas_id:1)
[   24.800299] amdgpu 0000:03:00.0:   at page 0x0000001000d80000 from 27
[   24.807154] amdgpu 0000:03:00.0: VM_L2_PROTECTION_FAULT_STATUS:0x00841157
[   24.814370] amdgpu 0000:03:00.0: [gfxhub] VMC page fault (src_id:0 ring:0 vm_id:8 pas_id:1)
[   24.823251] amdgpu 0000:03:00.0:   at page 0x0000001000d80000 from 27
[   24.830098] amdgpu 0000:03:00.0: VM_L2_PROTECTION_FAULT_STATUS:0x00841157
[   24.837312] amdgpu 0000:03:00.0: [gfxhub] VMC page fault (src_id:0 ring:0 vm_id:8 pas_id:1)
[   24.846190] amdgpu 0000:03:00.0:   at page 0x0000001000d80000 from 27
[   24.853056] amdgpu 0000:03:00.0: VM_L2_PROTECTION_FAULT_STATUS:0x00841157
[   24.860273] amdgpu 0000:03:00.0: [gfxhub] VMC page fault (src_id:0 ring:0 vm_id:8 pas_id:1)
[   24.869151] amdgpu 0000:03:00.0:   at page 0x0000001000d80000 from 27
[   24.875994] amdgpu 0000:03:00.0: VM_L2_PROTECTION_FAULT_STATUS:0x00841157
[   24.883209] amdgpu 0000:03:00.0: [gfxhub] VMC page fault (src_id:0 ring:0 vm_id:8 pas_id:1)
[   24.892087] amdgpu 0000:03:00.0:   at page 0x0000001000d80000 from 27
[   24.898933] amdgpu 0000:03:00.0: VM_L2_PROTECTION_FAULT_STATUS:0x00841157
[   24.906170] amdgpu 0000:03:00.0: [gfxhub] VMC page fault (src_id:0 ring:0 vm_id:8 pas_id:1)
[   24.915059] amdgpu 0000:03:00.0:   at page 0x0000001000d80000 from 27
[   24.921910] amdgpu 0000:03:00.0: VM_L2_PROTECTION_FAULT_STATUS:0x00841157
[   24.929143] amdgpu 0000:03:00.0: [gfxhub] VMC page fault (src_id:0 ring:0 vm_id:8 pas_id:1)
[   24.938021] amdgpu 0000:03:00.0:   at page 0x0000001000d80000 from 27
[   24.944874] amdgpu 0000:03:00.0: VM_L2_PROTECTION_FAULT_STATUS:0x00841157
[   24.952089] amdgpu 0000:03:00.0: [gfxhub] VMC page fault (src_id:0 ring:0 vm_id:8 pas_id:1)
[   24.960967] amdgpu 0000:03:00.0:   at page 0x0000001000d80000 from 27
[   24.967810] amdgpu 0000:03:00.0: VM_L2_PROTECTION_FAULT_STATUS:0x00841157
[   29.610925] gmc_v9_0_process_interrupt: 3402060 callbacks suppressed
[   29.610926] amdgpu 0000:03:00.0: [gfxhub] VMC page fault (src_id:0 ring:0 vm_id:8 pas_id:1)
[   29.628202] amdgpu 0000:03:00.0:   at page 0x0000001000d80000 from 27
[   29.641520] amdgpu 0000:03:00.0: VM_L2_PROTECTION_FAULT_STATUS:0x00000000

On 17-03-27 01:53 AM, Chunming Zhou wrote:
> *** BLURB HERE ***
> From Vega, ascis start to support multiple level vmpt, the series is to implement it.
>
> Tested successfully with 2/3/4 levels. 
>
> V2: address Christian comments.
>
> Max vm size 256TB tested ok.
>
>
> Christian KÃ¶nig (10):
>   drm/amdgpu: rename page_directory_fence to last_dir_update
>   drm/amdgpu: add the VM pointer to the amdgpu_pte_update_params as well
>   drm/amdgpu: add num_level to the VM manager
>   drm/amdgpu: generalize page table level
>   drm/amdgpu: handle multi level PD size calculation
>   drm/amdgpu: handle multi level PD during validation
>   drm/amdgpu: handle multi level PD in the LRU
>   drm/amdgpu: handle multi level PD updates V2
>   drm/amdgpu: handle multi level PD during PT updates
>   drm/amdgpu: add alloc/free for multi level PDs V2
>
> Chunming Zhou (5):
>   drm/amdgpu: abstract block size to one function
>   drm/amdgpu: limit block size to one page
>   drm/amdgpu: adapt vm size for multi vmpt
>   drm/amdgpu: set page table depth by num_level
>   drm/amdgpu: enable four level VMPT for gmc9
>
>  drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c     |   6 +-
>  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c |  67 ++--
>  drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c    |   2 +-
>  drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c     | 474 +++++++++++++++++++----------
>  drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h     |  16 +-
>  drivers/gpu/drm/amd/amdgpu/gfxhub_v1_0.c   |   3 +-
>  drivers/gpu/drm/amd/amdgpu/gmc_v6_0.c      |   1 +
>  drivers/gpu/drm/amd/amdgpu/gmc_v7_0.c      |   1 +
>  drivers/gpu/drm/amd/amdgpu/gmc_v8_0.c      |   1 +
>  drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c      |   7 +
>  drivers/gpu/drm/amd/amdgpu/mmhub_v1_0.c    |   2 +-
>  11 files changed, 380 insertions(+), 200 deletions(-)
>