ok , sounds good . Please go ahead to revert the change . I will send out another one for review . Regards shaoyun.liu On 2019-03-20 6:00 a.m., Christian König wrote: > Am 19.03.19 um 19:48 schrieb Liu, Shaoyun: >> As I understand, if we want to implement the logic in bo_add/rmv >> function , I need following conditions are all to be true for a valid >> XGMI request. >> >> 1. check the adev and the bo_adev are different . >> This is correct on rocm implementation .( >> amdgpu_amdkfd_gpuvm_map_memory_to_gpu pass in the correct adev it want >> to mapped to). For gem bo (amdgpu_gem_object_open/close), the adev is >> get from bo directly so it's alway same. Do we need extra adev info >> and how or we don't need to care about the XGMI for graphic side . > > That the GEM code uses the wrong adev sounds like a bug to me which > should be fixed anyway. > > Apart from that I would just add a function > amdgpu_xgmi_same_hive(adev1, adev2) which checks adev1!=adev2 and > adev1->hive_id == adev2->hive_id. > >> 2. check the bo->preferred_domains to be AMDGPU_GEM_DOMAIN_VRAM. >> But as Felix mentioned , this domain can be changed by >> AMDGPU_GEM_OP_SET_PLACEMENT , how to handle this? > > As Felix pointed out BOs can move around for a couple of reasons. So > actually checking if we use XGMI or not can be rather tricky. > > I would just check the hive_id of the involved devices, so that if we > can potentially use XGMI we power it on. > > Only alternative I can see is to have the same complicated handling as > with PRT. > >> Can you explain a little bit more on how to handle the eviction with the >> flag in bo_va structure as you mentioned ? Do you mean we disable the >> eviction for the bo with XGMI request ? > > Just put the is_xgmi flag into the bo_va structure instead of the > mapping structure. > > And when the bo_va structure is remove in amdgpu_vm_bo_rmv you also > decrement the counter. > > Apart from that the handling stays the same, e.g. you increment the > counter during the page table update. > > Regards, > Christian. > >> >> Regards >> shaoyun.liu >> >> >> On 2019-03-19 12:09 p.m., Koenig, Christian wrote: >>> Am 19.03.19 um 16:42 schrieb Liu, Shaoyun: >>>> Thanks Felix . >>>> >>>> We did consider to put the logic into bo_add/bo_rmv, but Felix >>>> pointed >>>> out the object can be migrate from FB to system memory after >>>> allocation >>>> . >>> Yeah, I also considered that and that is actually a really good >>> argument. >>> >>> But even in this case you still only need the flag in the bo_va >>> structure, not the mapping. >>> >>>> I also think of put the logic inside >>>> amdgpu_vm_bo_update_mapping , >>>> but seems that function prefer to take the dma address already been >>>> calculated (change that will cause a huge code reorganize) . That's >>>> the >>>> reason I put the logic before calling into >>>> amdgpu_vm_bo_update_mapping . >>> Both places won't work correctly. The problem is simply that you >>> haven't >>> considered what happens when the VM is destroyed. >>> >>> In this case we should not unmap the XGMI mapping, but rather just >>> throw >>> away the page tables completely. >>> >>> Additional to that the mapping code is often interrupted because the >>> calling process receives a signal. So it can happen that all of that is >>> called multiple times. The is_xgmi variable prevents that, but that's >>> really not straight forward. >>> >>> Se the PRT handling for how complicated that gets when you attach >>> something like this to the mapping. IIRC we have 3 different code paths >>> how a PRT mapping can end up being destroyed including a delayed >>> destroy >>> handler and callback. >>> >>>> As the race condition you described sounds reasonable. let me think >>>> how to fix it . >>> Two possible code pattern usually used for this: >>> >>> A) You have a lock protecting both the counter as well as the >>> operation: >>> >>> lock(); >>> if (increment_and_test_counter()) >>> power_on() >>> unlock(); >>> >>> lock() >>> if (decrement_and_test_counter()) >>> power_off(); >>> unlock(); >>> >>> B) The counter is an atomic and you have a lock protecting the >>> operation: >>> >>> if (atomic_inc_return() == 1) { >>> lock(); >>> power_on(); >>> unlock(); >>> } >>> >>> if (atomic_dec_return() == 0) { >>> lock(); >>> if (double_check_atomic_for_race()) >>> power_off(); >>> unlock(); >>> } >>> >>> The later is more efficient, but also more work to implement. >>> >>> Either way I suggest to move the actual increment/decrement into the >>> xgmi code and only have the signalling in the VM code. >>> >>> Regards, >>> Christian. >>> >>>> Regards >>>> >>>> shaoyun.liu >>>> >>>> On 2019-03-19 9:10 a.m., Kuehling, Felix wrote: >>>>> On 3/19/2019 8:49 AM, Christian König wrote: >>>>>> Yeah, all that is perfectly fine. >>>>>> >>>>>> The problem is Shaoyun didn't put this into the mapping code, but >>>>>> rather into the VM state machine. So this won't work at all (the >>>>>> counter and increment/decrement unbalanced and multiple times). >>>>> We tried to consider all the possible ways that this could go wrong. >>>>> Basically, every time a mapping is updated, we update the is_xgmi >>>>> state >>>>> and update the counter if it changed. Have you seen the counter >>>>> become >>>>> unbalanced? >>>>> >>>>> >>>>>> The correct place to add this is amdgpu_vm_bo_add/amdgpu_vm_bo_rmv. >>>>> I think we considered that. The problem is that a BO can be migrated >>>>> between bo_add and bo_rmv. I found that even bo->preferred_domain can >>>>> change with AMDGPU_GEM_OP_SET_PLACEMENT. So you can't reliably know >>>>> whether to increment your counter, and your counter can become >>>>> unbalanced if a migration or AMDGPU_GEM_OP_SET_PLACEMENT happens >>>>> between >>>>> bo_add and bo_rmv. >>>>> >>>>> Therefore we're trying to check for XGMI mappings every time the >>>>> mapping >>>>> changes and keep track of the state in amdgpu_bo_va_mapping. >>>>> >>>>> >>>>>> Additional to that the approach with the counter doesn't work >>>>>> because >>>>>> you don't have a lock protecting the hw update itself. E.g. while >>>>>> powering down you can add a mapping which needs to power it up again >>>>>> and so powering down and powering up race with each other. >>>>> That's a good point. >>>>> >>>>> Regards, >>>>> Felix >>>>> >>>>> >>>>>> Regards, >>>>>> Christian. >>>>>> >>>>>> Am 19.03.19 um 13:42 schrieb Kuehling, Felix: >>>>>>> We discussed a few different approaches before settling on this >>>>>>> one. >>>>>>> >>>>>>> Maybe it needs some more background. XGMI links are quite power >>>>>>> hungry. >>>>>>> Being able to power them down improves performance for >>>>>>> power-limited >>>>>>> workloads that don't need XGMI. In machine learning, pretty much >>>>>>> all >>>>>>> workloads are power limited on our GPUs, so this is not just a >>>>>>> theoretical thing. The problem is, how do you know whether you need >>>>>>> XGMI? You need to know whether there are P2P memory mappings >>>>>>> involving >>>>>>> XGMI. So the natural place to put that is in the memory mapping >>>>>>> code. >>>>>>> >>>>>>> If you do spot a race condition in the code, let's talk about >>>>>>> how to >>>>>>> fix it. >>>>>>> >>>>>>> Regards, >>>>>>> Felix >>>>>>> >>>>>>> On 3/19/2019 8:07 AM, Christian König wrote: >>>>>>>> This reverts commit c9115f8904eef0f880d3b4f8306f553b1bb1c532. >>>>>>>> >>>>>>>> Adding this to the mapping is complete nonsense and the whole >>>>>>>> implementation looks racy. This patch wasn't thoughtfully reviewed >>>>>>>> and should be reverted for now. >>>>>>>> >>>>>>>> Signed-off-by: Christian König <christian.koenig@xxxxxxx> >>>>>>>> --- >>>>>>>> drivers/gpu/drm/amd/amdgpu/amdgpu.h | 3 --- >>>>>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 3 --- >>>>>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_object.h | 1 - >>>>>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c | 29 >>>>>>>> +--------------------- >>>>>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.c | 15 ----------- >>>>>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.h | 2 -- >>>>>>>> 6 files changed, 1 insertion(+), 52 deletions(-) >>>>>>>> >>>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h >>>>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu.h >>>>>>>> index b5720c1610e1..1db192150532 100644 >>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h >>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h >>>>>>>> @@ -931,9 +931,6 @@ struct amdgpu_device { >>>>>>>> int asic_reset_res; >>>>>>>> struct work_struct xgmi_reset_work; >>>>>>>> - /* counter of mapped memory through xgmi */ >>>>>>>> - atomic_t xgmi_map_counter; >>>>>>>> - >>>>>>>> bool in_baco_reset; >>>>>>>> }; >>>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c >>>>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c >>>>>>>> index 964a4d3f1f43..206583707124 100644 >>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c >>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c >>>>>>>> @@ -2018,9 +2018,6 @@ static void >>>>>>>> amdgpu_device_ip_late_init_func_handler(struct work_struct *work) >>>>>>>> r = amdgpu_device_enable_mgpu_fan_boost(); >>>>>>>> if (r) >>>>>>>> DRM_ERROR("enable mgpu fan boost failed >>>>>>>> (%d).\n", r); >>>>>>>> - >>>>>>>> - /*set to low pstate by default */ >>>>>>>> - amdgpu_xgmi_set_pstate(adev, 0); >>>>>>>> } >>>>>>>> static void amdgpu_device_delay_enable_gfx_off(struct >>>>>>>> work_struct *work) >>>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.h >>>>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.h >>>>>>>> index 6f176bbe4cf2..220a6a7b1bc1 100644 >>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.h >>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.h >>>>>>>> @@ -54,7 +54,6 @@ struct amdgpu_bo_va_mapping { >>>>>>>> uint64_t __subtree_last; >>>>>>>> uint64_t offset; >>>>>>>> uint64_t flags; >>>>>>>> - bool is_xgmi; >>>>>>>> }; >>>>>>>> /* User space allocated BO in a VM */ >>>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c >>>>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c >>>>>>>> index c5230a9fb7f6..c8f0e4ca05fb 100644 >>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c >>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c >>>>>>>> @@ -34,7 +34,6 @@ >>>>>>>> #include "amdgpu_trace.h" >>>>>>>> #include "amdgpu_amdkfd.h" >>>>>>>> #include "amdgpu_gmc.h" >>>>>>>> -#include "amdgpu_xgmi.h" >>>>>>>> /** >>>>>>>> * DOC: GPUVM >>>>>>>> @@ -2022,9 +2021,8 @@ int amdgpu_vm_bo_update(struct amdgpu_device >>>>>>>> *adev, >>>>>>>> struct ttm_mem_reg *mem; >>>>>>>> struct drm_mm_node *nodes; >>>>>>>> struct dma_fence *exclusive, **last_update; >>>>>>>> - struct amdgpu_device *bo_adev = adev; >>>>>>>> - bool is_xgmi = false; >>>>>>>> uint64_t flags; >>>>>>>> + struct amdgpu_device *bo_adev = adev; >>>>>>>> int r; >>>>>>>> if (clear || !bo) { >>>>>>>> @@ -2046,10 +2044,6 @@ int amdgpu_vm_bo_update(struct >>>>>>>> amdgpu_device >>>>>>>> *adev, >>>>>>>> if (bo) { >>>>>>>> flags = amdgpu_ttm_tt_pte_flags(adev, >>>>>>>> bo->tbo.ttm, mem); >>>>>>>> bo_adev = amdgpu_ttm_adev(bo->tbo.bdev); >>>>>>>> - if (adev != bo_adev && >>>>>>>> - adev->gmc.xgmi.hive_id && >>>>>>>> - adev->gmc.xgmi.hive_id == bo_adev->gmc.xgmi.hive_id) >>>>>>>> - is_xgmi = true; >>>>>>>> } else { >>>>>>>> flags = 0x0; >>>>>>>> } >>>>>>>> @@ -2068,19 +2062,6 @@ int amdgpu_vm_bo_update(struct >>>>>>>> amdgpu_device >>>>>>>> *adev, >>>>>>>> } >>>>>>>> list_for_each_entry(mapping, &bo_va->invalids, >>>>>>>> list) { >>>>>>>> - if (mapping->is_xgmi != is_xgmi) { >>>>>>>> - if (is_xgmi) { >>>>>>>> - /* Adding an XGMI mapping to the PT */ >>>>>>>> - if (atomic_inc_return(&adev->xgmi_map_counter) >>>>>>>> == 1) >>>>>>>> - amdgpu_xgmi_set_pstate(adev, 1); >>>>>>>> - } else { >>>>>>>> - /* Removing an XGMI mapping from the PT */ >>>>>>>> - if (atomic_dec_return(&adev->xgmi_map_counter) >>>>>>>> == 0) >>>>>>>> - amdgpu_xgmi_set_pstate(adev, 0); >>>>>>>> - } >>>>>>>> - mapping->is_xgmi = is_xgmi; >>>>>>>> - } >>>>>>>> - >>>>>>>> r = amdgpu_vm_bo_split_mapping(adev, exclusive, >>>>>>>> pages_addr, vm, >>>>>>>> mapping, flags, bo_adev, nodes, >>>>>>>> last_update); >>>>>>>> @@ -2298,13 +2279,6 @@ int amdgpu_vm_clear_freed(struct >>>>>>>> amdgpu_device *adev, >>>>>>>> r = amdgpu_vm_bo_update_mapping(adev, NULL, >>>>>>>> NULL, vm, >>>>>>>> mapping->start, mapping->last, >>>>>>>> init_pte_value, 0, &f); >>>>>>>> - >>>>>>>> - if (mapping->is_xgmi) { >>>>>>>> - /* Removing an XGMI mapping from the PT */ >>>>>>>> - if (atomic_dec_return(&adev->xgmi_map_counter) == 0) >>>>>>>> - amdgpu_xgmi_set_pstate(adev, 0); >>>>>>>> - } >>>>>>>> - >>>>>>>> amdgpu_vm_free_mapping(adev, vm, mapping, f); >>>>>>>> if (r) { >>>>>>>> dma_fence_put(f); >>>>>>>> @@ -2501,7 +2475,6 @@ int amdgpu_vm_bo_map(struct amdgpu_device >>>>>>>> *adev, >>>>>>>> mapping->last = eaddr; >>>>>>>> mapping->offset = offset; >>>>>>>> mapping->flags = flags; >>>>>>>> - mapping->is_xgmi = false; >>>>>>>> amdgpu_vm_bo_insert_map(adev, bo_va, mapping); >>>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.c >>>>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.c >>>>>>>> index 807440d3edff..fcc4b05c745c 100644 >>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.c >>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.c >>>>>>>> @@ -200,7 +200,6 @@ struct amdgpu_hive_info >>>>>>>> *amdgpu_get_xgmi_hive(struct amdgpu_device *adev, int lo >>>>>>>> if (lock) >>>>>>>> mutex_lock(&tmp->hive_lock); >>>>>>>> - tmp->pstate = -1; >>>>>>>> mutex_unlock(&xgmi_mutex); >>>>>>>> @@ -322,17 +321,3 @@ void amdgpu_xgmi_remove_device(struct >>>>>>>> amdgpu_device *adev) >>>>>>>> mutex_unlock(&hive->hive_lock); >>>>>>>> } >>>>>>>> } >>>>>>>> - >>>>>>>> -int amdgpu_xgmi_set_pstate(struct amdgpu_device *adev, int >>>>>>>> pstate) >>>>>>>> -{ >>>>>>>> - int ret = 0; >>>>>>>> - struct amdgpu_hive_info *hive = amdgpu_get_xgmi_hive(adev, >>>>>>>> 0); >>>>>>>> - >>>>>>>> - if (!hive) >>>>>>>> - return 0; >>>>>>>> - >>>>>>>> - if (hive->pstate == pstate) >>>>>>>> - return 0; >>>>>>>> - /* Todo : sent the message to SMU for pstate change */ >>>>>>>> - return ret; >>>>>>>> -} >>>>>>>> \ No newline at end of file >>>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.h >>>>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.h >>>>>>>> index 7e1710fcbef2..24a3b0362f98 100644 >>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.h >>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.h >>>>>>>> @@ -33,13 +33,11 @@ struct amdgpu_hive_info { >>>>>>>> struct kobject *kobj; >>>>>>>> struct device_attribute dev_attr; >>>>>>>> struct amdgpu_device *adev; >>>>>>>> - int pstate; /*0 -- low , 1 -- high , -1 unknown*/ >>>>>>>> }; >>>>>>>> struct amdgpu_hive_info *amdgpu_get_xgmi_hive(struct >>>>>>>> amdgpu_device *adev, int lock); >>>>>>>> int amdgpu_xgmi_update_topology(struct amdgpu_hive_info >>>>>>>> *hive, >>>>>>>> struct amdgpu_device *adev); >>>>>>>> int amdgpu_xgmi_add_device(struct amdgpu_device *adev); >>>>>>>> void amdgpu_xgmi_remove_device(struct amdgpu_device *adev); >>>>>>>> -int amdgpu_xgmi_set_pstate(struct amdgpu_device *adev, int >>>>>>>> pstate); >>>>>>>> #endif >>>>> _______________________________________________ >>>>> amd-gfx mailing list >>>>> amd-gfx@xxxxxxxxxxxxxxxxxxxxx >>>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx >>> _______________________________________________ >>> amd-gfx mailing list >>> amd-gfx@xxxxxxxxxxxxxxxxxxxxx >>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx >> _______________________________________________ >> amd-gfx mailing list >> amd-gfx@xxxxxxxxxxxxxxxxxxxxx >> https://lists.freedesktop.org/mailman/listinfo/amd-gfx > _______________________________________________ amd-gfx mailing list amd-gfx@xxxxxxxxxxxxxxxxxxxxx https://lists.freedesktop.org/mailman/listinfo/amd-gfx