[PATCH 03/14] drm/amdgpu: Add helper to turn an existing VM into a compute VM

felix.kuehling@xxxxxxx (Felix Kuehling) · Mon, 5 Mar 2018 12:22:19 -0500

On 2018-03-05 12:14 PM, Christian KÃ¶nig wrote:
> Am 05.03.2018 um 17:58 schrieb Felix Kuehling:
>> On 2018-03-05 07:17 AM, Christian KÃ¶nig wrote:
>>> Am 04.03.2018 um 03:34 schrieb Felix Kuehling:
>>>> Signed-off-by: Felix Kuehling <Felix.Kuehling at amd.com>
>>>> ---
>>>> Â Â  drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c | 82
>>>> ++++++++++++++++++++++++++++++++++
>>>> Â Â  drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h |Â  1 +
>>>> Â Â  2 files changed, 83 insertions(+)
>>>>
>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
>>>> index 5afbc5e..58153ef 100644
>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
>>>> @@ -2350,6 +2350,8 @@ int amdgpu_vm_init(struct amdgpu_device *adev,
>>>> struct amdgpu_vm *vm,
>>>> Â Â Â Â Â Â  INIT_KFIFO(vm->faults);
>>>> Â Â Â Â Â Â  vm->fault_credit = 16;
>>>> Â Â  +Â Â Â  vm->vm_context = vm_context;
>>> I think we should drop the vm_context parameter and all the related
>>> code in amdgpu_vm_init(). But that can be part of a later cleanup
>>> patch as well.
>> Yep. It will be a prerequisite for sharing VMs between graphics and
>> compute. But then CPU vs SDMA page table update would need to be handled
>> differently, on a per-call basis and probably some synchronization
>> between them.
>>
>> Also, we'd need to fix shadow page table handling with CPU updates, or
>> get rid of shadow page tables. I'm not sure why they're needed, TBH.
>
> The idea behind shadow page tables is that you have the current state
> of the pipeline if the GPU crashes. Since with CPU updates there is no
> pipeline they doesn't make sense at all with them.
>
>>
>>>> +
>>>> Â Â Â Â Â Â  return 0;
>>>> Â Â  Â  error_free_root:
>>>> @@ -2364,6 +2366,86 @@ int amdgpu_vm_init(struct amdgpu_device *adev,
>>>> struct amdgpu_vm *vm,
>>>> Â Â  }
>>>> Â Â  Â  /**
>>>> + * amdgpu_vm_make_compute - Turn a GFX VM into a compute VM
>>>> + *
>>>> + * This only works on GFX VMs that don't have any BOs added and no
>>>> + * page tables allocated yet.
>>>> + *
>>>> + * Changes the following VM parameters:
>>>> + * - vm_context
>>>> + * - use_cpu_for_update
>>>> + * - pte_supports_ats
>>>> + * - pasid (old PASID is released, because compute manages its own
>>>> PASIDs)
>>>> + *
>>>> + * Reinitializes the page directory to reflect the changed ATS
>>>> + * setting. May leave behind an unused shadow BO for the page
>>>> + * directory when switching from SDMA updates to CPU updates.
>>>> + *
>>>> + * Returns 0 for success, -errno for errors.
>>>> + */
>>>> +int amdgpu_vm_make_compute(struct amdgpu_device *adev, struct
>>>> amdgpu_vm *vm)
>>>> +{
>>>> +Â Â Â  bool pte_support_ats = (adev->asic_type == CHIP_RAVEN);
>>>> +Â Â Â  int r;
>>>> +
>>>> +Â Â Â  r = amdgpu_bo_reserve(vm->root.base.bo, true);
>>>> +Â Â Â  if (r)
>>>> +Â Â Â Â Â Â Â  return r;
>>>> +
>>>> +Â Â Â  /* Sanity checks */
>>>> +Â Â Â  if (vm->vm_context == AMDGPU_VM_CONTEXT_COMPUTE) {
>>>> +Â Â Â Â Â Â Â  /* Can happen if ioctl is interrupted by a signal after
>>>> +Â Â Â Â Â Â Â Â  * this function already completed. Just return success.
>>>> +Â Â Â Â Â Â Â Â  */
>>>> +Â Â Â Â Â Â Â  r = 0;
>>>> +Â Â Â Â Â Â Â  goto error;
>>>> +Â Â Â  }
>>> Ok, that is actually a show stopper. An interrupted IOCTL should never
>>> have a visible effect.
>>>
>>> Is that just a theoretical issue or did you run into this in practice?
>> It's theoretical. init_kfd_vm in patch 4 reserves the page directory
>> again, validates it and waits for the validation to complete. Both the
>> reservation and the waiting can be interrupted by signals.
>
> No, both have parameters to control if they are interruptible or not
> for exactly this reason.
>
>> Â  This happens
>> after amdgpu_vm_make_compute has completed.
>>
>> If we wanted to achieve "no visible effect", we'd need to roll back the
>> conversion to a compute VM, which may involve more waiting.
>>
>> We have a similar issue in map_memory_to_gpu. We map on all GPUs and
>> wait for a sync object in the end. If SDMA is used, this allows the SDMA
>> on all GPUs to work concurrently. The wait in the end can be
>> interrupted. Instead of rolling back the mapping on all GPUs (which
>> would require more waiting), we just let the system call return
>> -ERESTARTSYS and make sure the restart doesn't cause any trouble.
>
> If waits are interruptible or not is controllable by parameters. At
> least for importing the VM it is probably ok to disable this.

Sure, I can change that.

>
>> I've been looking for documentation about the expected behaviour for
>> system calls interrupted by signals. All I can find is this statement in
>> Documentation/kernel-hacking/hacking.rst:
>>> After you slept you should check if a signal occurred: the Unix/Linux
>>> way of handling signals is to temporarily exit the system call with the
>>> ``-ERESTARTSYS`` error. The system call entry code will switch back to
>>> user context, process the signal handler and then your system call will
>>> be restarted (unless the user disabled that). So you should be prepared
>>> to process the restart, e.g. if you're in the middle of manipulating
>>> some data structure.
>> The same paragraph is also copied in
>> https://www.kernel.org/doc/htmldocs/kernel-hacking/ioctls.html.
>>
>> The key sentence is "should be prepared to process the restart, e.g. if
>> you're in the middle of manipulating some data structure". Rolling back
>> everything would achieve that, but it's not the only way and not the
>> most efficient way.
>>
>> In this case, I'm handling the restart by checking whether the VM is
>> already a compute VM. So on the restart the conversion to a compute VM
>> becomes a no-op.
>>
>> Similarly, in the map_memory_to_gpu ioctl, mapping on a GPU where the
>> memory is already mapped becomes a no-op, which handles the restart
>> case. In this case we even added a specific test for this condition in
>> kfdtest, where we deliberately interrupt a big mapping operation with a
>> signal.
>
> This approach is a clear no-go.
>
> Changes done before interrupting a system call doesn't necessary need
> to be reverted, but only if they don't have a user visible effect.

First of all, I don't see that requirement document anywhere, except in
emails from you. The documentation that I've been able to find states no
such requirement. Can you point me to where this rule is documented?

Second, the way I see it, a mapping that is not complete has no user
visible effect. Before the mapping, the behaviour of accessing the
mapping is undefined. After an interrupted mapping, the behaviour is
still undefined.

>
> For example: When we map something we need to allocate multiple page
> directories. It is not necessary to release all previous allocated
> ones if an allocation suddenly says that it was interrupted, but only
> because those page directories are not user accessible.
>
> Similar exceptions can be done by avoiding things when the IOCTL was
> interrupted. E.g. in the VA IOCTL we usually directly submit updating
> the page tables, but if that is interrupted we just return success
> instead of restarting the IOCTL and wait with the update until the
> first command submission.
>
> The easiest way to work around this to to add a separate IOCTL which
> waits for VM updates to complete. Should be trivial to use
> vm->last_update for this.

For the cost of an extra system call. That means every GPU mapping
operation now requires two system calls. With Spectre workarounds, that
can be a lot more expensive.

Regards,
Â  Felix

>
> Regards,
> Christian.
>
>>
>> Regards,
>> Â Â  Felix
>>
>>> Regards,
>>> Christian.
>>>
>>>> +Â Â Â  if (!RB_EMPTY_ROOT(&vm->va.rb_root) || vm->root.entries) {
>>>> +Â Â Â Â Â Â Â  r = -EINVAL;
>>>> +Â Â Â Â Â Â Â  goto error;
>>>> +Â Â Â  }
>>>> +
>>>> +Â Â Â  /* Check if PD needs to be reinitialized and do it before
>>>> +Â Â Â Â  * changing any other state, in case it fails.
>>>> +Â Â Â Â  */
>>>> +Â Â Â  if (pte_support_ats != vm->pte_support_ats) {
>>>> +Â Â Â Â Â Â Â  /* TODO: This depends on a patch series by Christian.
>>>> +Â Â Â Â Â Â Â Â  * It's only needed for GFX9 GPUs, which aren't
>>>> +Â Â Â Â Â Â Â Â  * supported by upstream KFD yet.
>>>> +Â Â Â Â Â Â Â  r = amdgpu_vm_clear_bo(adev, vm, vm->root.base.bo,
>>>> +Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â  adev->vm_manager.root_level,
>>>> +Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â  pte_support_ats);
>>>> +Â Â Â Â Â Â Â  if (r)
>>>> +Â Â Â Â Â Â Â Â Â Â Â  goto error;
>>>> +Â Â Â Â Â Â Â  */
>>>> +Â Â Â  }
>>>> +
>>>> +Â Â Â  /* Update VM state */
>>>> +Â Â Â  vm->vm_context = AMDGPU_VM_CONTEXT_COMPUTE;
>>>> +Â Â Â  vm->use_cpu_for_update = !!(adev->vm_manager.vm_update_mode &
>>>> +Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â  AMDGPU_VM_USE_CPU_FOR_COMPUTE);
>>>> +Â Â Â  vm->pte_support_ats = pte_support_ats;
>>>> +Â Â Â  DRM_DEBUG_DRIVER("VM update mode is %s\n",
>>>> +Â Â Â Â Â Â Â Â Â Â Â Â  vm->use_cpu_for_update ? "CPU" : "SDMA");
>>>> +Â Â Â  WARN_ONCE((vm->use_cpu_for_update &
>>>> !amdgpu_vm_is_large_bar(adev)),
>>>> +Â Â Â Â Â Â Â Â Â  "CPU update of VM recommended only for large BAR
>>>> system\n");
>>>> +
>>>> +Â Â Â  if (vm->pasid) {
>>>> +Â Â Â Â Â Â Â  unsigned long flags;
>>>> +
>>>> +Â Â Â Â Â Â Â  spin_lock_irqsave(&adev->vm_manager.pasid_lock, flags);
>>>> +Â Â Â Â Â Â Â  idr_remove(&adev->vm_manager.pasid_idr, vm->pasid);
>>>> +Â Â Â Â Â Â Â  spin_unlock_irqrestore(&adev->vm_manager.pasid_lock, flags);
>>>> +
>>>> +Â Â Â Â Â Â Â  vm->pasid = 0;
>>>> +Â Â Â  }
>>>> +
>>>> +error:
>>>> +Â Â Â  amdgpu_bo_unreserve(vm->root.base.bo);
>>>> +Â Â Â  return r;
>>>> +}
>>>> +
>>>> +/**
>>>> Â Â Â  * amdgpu_vm_free_levels - free PD/PT levels
>>>> Â Â Â  *
>>>> Â Â Â  * @adev: amdgpu device structure
>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h
>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h
>>>> index 40b4e09..7f50a38 100644
>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h
>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h
>>>> @@ -263,6 +263,7 @@ void amdgpu_vm_manager_init(struct amdgpu_device
>>>> *adev);
>>>> Â Â  void amdgpu_vm_manager_fini(struct amdgpu_device *adev);
>>>> Â Â  int amdgpu_vm_init(struct amdgpu_device *adev, struct amdgpu_vm
>>>> *vm,
>>>> Â Â Â Â Â Â Â Â Â Â Â Â Â  int vm_context, unsigned int pasid);
>>>> +int amdgpu_vm_make_compute(struct amdgpu_device *adev, struct
>>>> amdgpu_vm *vm);
>>>> Â Â  void amdgpu_vm_fini(struct amdgpu_device *adev, struct amdgpu_vm
>>>> *vm);
>>>> Â Â  bool amdgpu_vm_pasid_fault_credit(struct amdgpu_device *adev,
>>>> Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â  unsigned int pasid);
>