[PATCH 01/13] drm/amdgpu: fix vcn_v1_0_dec_ring_emit_wreg

felix.kuehling@xxxxxxx (Felix Kuehling) · Tue, 30 Jan 2018 11:44:23 -0500

On 2018-01-30 08:51 AM, Christian KÃ¶nig wrote:
> Am 30.01.2018 um 00:01 schrieb Felix Kuehling:
>> This is a very cool patch series. I made some specific comments on some
>> of the patches. But overall this is great.
>
> Thanks, going to comment on some of those on the patches.
>
>> I guess your plan is to start testing the SVM programming model with
>> ATC, and then enable the same programming model without the IOMMU using
>> HMM. That means there could be working and validated tests to run on
>> HMM.
>
> Yes, exactly. Well I think it would make implementing HMM much easier
> if we could rely on most of the mapping being done by the ATC instead
> of manually crafted GPUVM page tables.
>> I think 4.16 will get an IOMMUv2 driver that we worked on for Raven to
>> support PPR on newer IOMMU versions. Without that the GPU is not able to
>> make swapped or new pages resident or trigger a COW.
>
> Uff? So the ATC is actually not able to handle page faults?

It's supposed to. It worked on Carrizo. As I understand it, the new
IOMMUv2 on Raven needs the driver to enable that feature per-device.
That fix is coming in 4.16.

>
>> We're currently still working on one problem with Raven, related to the
>> way GFX9 retries memory accesses. Many PPR requests for the same virtual
>> address can be outstanding (in an IOMMU log buffer). After the first
>> request is handled, the GPU can continue, but the remaining requests are
>> still in the queue. This can result in the IOMMU driver trying to handle
>> a PPR for a page that's already freed by the application, which triggers
>> an invalid PPR callback.
>>
>> An invalid PPR is like the GPU-equivalent of a segfault, and KFD
>> implements it like that. With the above behaviour we end up segfaulting
>> applications that didn't do anything wrong. I guess for your
>> implementation it's not a problem because you don't implement that
>> callback yet.
>
> Yeah, that is exactly the same problem I'm currently running into with
> HMM.
>
> The interrupt handling is pipelined (even much much more than the ATC
> path), so what can happen is that applications free up some memory but
> we have stale page faults for that page in the pipeline.
>
> The only valid workaround I can see is to make sure interrupts are
> processed before returning to HMM that it can unmap pages, and that is
> a really show stopper for performance as far as I can see.

For ATC my idea is to use an invalidate_range_start MMU notifier to wait
for pending PPRs to get processed before letting the kernel unmap pages.
The underlying assumption is this: when an application frees memory, it
must be sure it's not using that memory any more. So we don't expect any
new faults for the address. We only need to wait for pending faults to
flush out of the pipe.

For HMM I think the prescreen interrupt handler stage I added should
help, so you only see one interrupt per faulting address. But you need
to figure out the right moment to clear the fault after updating the
page table.

Regards,
Â  Felix

>
> Regards,
> Christian.
>
>>
>> Regards,
>> Â Â  Felix
>>
>>
>> On 2018-01-26 03:13 PM, Christian KÃ¶nig wrote:
>>> That got mixed up with the encode ring function.
>>>
>>> Signed-off-by: Christian KÃ¶nig <christian.koenig at amd.com>
>>> ---
>>> Â  drivers/gpu/drm/amd/amdgpu/vcn_v1_0.c | 19 ++++++++++++++++++-
>>> Â  1 file changed, 18 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/vcn_v1_0.c
>>> b/drivers/gpu/drm/amd/amdgpu/vcn_v1_0.c
>>> index 44c041a1fe68..24ebc3e296a6 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/vcn_v1_0.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/vcn_v1_0.c
>>> @@ -880,6 +880,22 @@ static void
>>> vcn_v1_0_dec_ring_emit_vm_flush(struct amdgpu_ring *ring,
>>> Â Â Â Â Â  vcn_v1_0_dec_vm_reg_wait(ring, data0, data1, mask);
>>> Â  }
>>> Â  +static void vcn_v1_0_dec_ring_emit_wreg(struct amdgpu_ring *ring,
>>> +Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â  uint32_t reg, uint32_t val)
>>> +{
>>> +Â Â Â  struct amdgpu_device *adev = ring->adev;
>>> +
>>> +Â Â Â  amdgpu_ring_write(ring,
>>> +Â Â Â Â Â Â Â  PACKET0(SOC15_REG_OFFSET(UVD, 0, mmUVD_GPCOM_VCPU_DATA0), 0));
>>> +Â Â Â  amdgpu_ring_write(ring, reg << 2);
>>> +Â Â Â  amdgpu_ring_write(ring,
>>> +Â Â Â Â Â Â Â  PACKET0(SOC15_REG_OFFSET(UVD, 0, mmUVD_GPCOM_VCPU_DATA1), 0));
>>> +Â Â Â  amdgpu_ring_write(ring, val);
>>> +Â Â Â  amdgpu_ring_write(ring,
>>> +Â Â Â Â Â Â Â  PACKET0(SOC15_REG_OFFSET(UVD, 0, mmUVD_GPCOM_VCPU_CMD), 0));
>>> +Â Â Â  amdgpu_ring_write(ring, VCN_DEC_CMD_WRITE_REG << 1);
>>> +}
>>> +
>>> Â  /**
>>> Â Â  * vcn_v1_0_enc_ring_get_rptr - get enc read pointer
>>> Â Â  *
>>> @@ -1097,7 +1113,7 @@ static const struct amdgpu_ring_funcs
>>> vcn_v1_0_dec_ring_vm_funcs = {
>>> Â Â Â Â Â  .pad_ib = amdgpu_ring_generic_pad_ib,
>>> Â Â Â Â Â  .begin_use = amdgpu_vcn_ring_begin_use,
>>> Â Â Â Â Â  .end_use = amdgpu_vcn_ring_end_use,
>>> -Â Â Â  .emit_wreg = vcn_v1_0_enc_ring_emit_wreg,
>>> +Â Â Â  .emit_wreg = vcn_v1_0_dec_ring_emit_wreg,
>>> Â  };
>>> Â  Â  static const struct amdgpu_ring_funcs vcn_v1_0_enc_ring_vm_funcs
>>> = {
>>> @@ -1124,6 +1140,7 @@ static const struct amdgpu_ring_funcs
>>> vcn_v1_0_enc_ring_vm_funcs = {
>>> Â Â Â Â Â  .pad_ib = amdgpu_ring_generic_pad_ib,
>>> Â Â Â Â Â  .begin_use = amdgpu_vcn_ring_begin_use,
>>> Â Â Â Â Â  .end_use = amdgpu_vcn_ring_end_use,
>>> +Â Â Â  .emit_wreg = vcn_v1_0_enc_ring_emit_wreg,
>>> Â  };
>>> Â  Â  static void vcn_v1_0_set_dec_ring_funcs(struct amdgpu_device *adev)
>