Changes to let KFD use render node VMs

christian.koenig@xxxxxxx (Christian König) · Fri, 2 Mar 2018 19:07:19 +0100

Am 02.03.2018 um 17:38 schrieb Felix Kuehling:
> On 2018-03-02 02:43 AM, Christian KÃ¶nig wrote:
>> Hi Felix,
>>
>> patch #3 looks good to me in general, but why do you need to cache the
>> pd_phys_addr?
>>
>> The BO already does that anyway, so you can just use
>> amdgpu_bo_gpu_addr() for this which also makes some additional checks
>> on debug builds.
> Do you mean amdgpu_bo_gpu_offset? That one requires the reservation to
> be locked unless it's pinned.

Correct, well that's why I suggested it.

> We are caching the PD physical address here to get around a tricky
> circular locking issue we ran into under memory pressure with evictions.
> I don't remember all the details, but I do remember that the deadlock
> involved fences, which aren't tracked by the lock dependency checker. So
> it was particularly tricky to nail down. Avoiding the need to lock the
> page directory reservation for finding out its physical address breaks
> the lock cycle.

OK, so what you do is you get the pd address after you validated the BOs 
and cache it until you need it in the hardware setup?

If yes than that would be valid because we do exactly the same in the 
job during command submission.

>> patch #5 well mixing power management into the VM functions is a clear
>> NAK.
> This part won't be in my initial upstream push for GPUVM.
>
>> That certainly doesn't belong there, but we can have a counter how
>> many compute VMs we have in the manager. amdgpu_vm_make_compute() can
>> then return if this was the first VM to became a compute VM or not.
> We currently trigger compute power profile switching based on the
> existence of compute VMs. It was a convenient hook where amdgpu finds
> out about the existence of compute work. If that's not acceptable we can
> look into moving it elsewhere, possibly using a new KFD2KGD interface.

As I said the VM code can still count how many compute VM there are, the 
issue is it should not make power management decisions based on that.

The caller which triggers the switch of the VM to be a compute VM should 
make that decision.

Christian.

>
>> The rest of the patch looks good to me.
> Thanks,
>  Â  Felix
>
>> Regards,
>> Christian.
>>
>> Am 01.03.2018 um 23:58 schrieb Felix Kuehling:
>>> Hi Christian,
>>>
>>> I have a working patch series against amd-kfg-staging that lets KFD use
>>> VMs from render node FDs, as we discussed. There are two patches in that
>>> series that touch amdgpu_vm.[ch] that I'd like your feedback on before I
>>> commit the changes to amd-kfd-staging and include them in my upstream
>>> patch series for KFD GPUVM support. See attached.
>>>
>>> Thanks,
>>>  Â Â  Felix
>>>