On Thu, Aug 09, 2018 at 08:27:10PM +0800, Koenig, Christian wrote: > Am 09.08.2018 um 14:25 schrieb Huang Rui: > > On Thu, Aug 09, 2018 at 03:18:55PM +0800, Koenig, Christian wrote: > >> Am 09.08.2018 um 08:18 schrieb Huang Rui: > >>> On Wed, Aug 08, 2018 at 06:47:49PM +0800, Christian König wrote: > >>>> Am 08.08.2018 um 11:59 schrieb Huang Rui: > >>>>> I continue to work for bulk moving that based on the proposal by Christian. > >>>>> > >>>>> Background: > >>>>> amdgpu driver will move all PD/PT and PerVM BOs into idle list. Then move all of > >>>>> them on the end of LRU list one by one. Thus, that cause so many BOs moved to > >>>>> the end of the LRU, and impact performance seriously. > >>>>> > >>>>> Then Christian provided a workaround to not move PD/PT BOs on LRU with below > >>>>> patch: > >>>>> "drm/amdgpu: band aid validating VM PTs" > >>>>> Commit 0bbf32026cf5ba41e9922b30e26e1bed1ecd38ae > >>>>> > >>>>> However, the final solution should bulk move all PD/PT and PerVM BOs on the LRU > >>>>> instead of one by one. > >>>>> > >>>>> Whenever amdgpu_vm_validate_pt_bos() is called and we have BOs which need to be > >>>>> validated we move all BOs together to the end of the LRU without dropping the > >>>>> lock for the LRU. > >>>>> > >>>>> While doing so we note the beginning and end of this block in the LRU list. > >>>>> > >>>>> Now when amdgpu_vm_validate_pt_bos() is called and we don't have anything to do, > >>>>> we don't move every BO one by one, but instead cut the LRU list into pieces so > >>>>> that we bulk move everything to the end in just one operation. > >>>>> > >>>>> Test data: > >>>>> +--------------+-----------------+-----------+---------------------------------------+ > >>>>> | |The Talos |Clpeak(OCL)|BusSpeedReadback(OCL) | > >>>>> | |Principle(Vulkan)| | | > >>>>> +------------------------------------------------------------------------------------+ > >>>>> | | | |0.319 ms(1k) 0.314 ms(2K) 0.308 ms(4K) | > >>>>> | Original | 147.7 FPS | 76.86 us |0.307 ms(8K) 0.310 ms(16K) | > >>>>> +------------------------------------------------------------------------------------+ > >>>>> | Orignial + WA| | |0.254 ms(1K) 0.241 ms(2K) | > >>>>> |(don't move | 162.1 FPS | 42.15 us |0.230 ms(4K) 0.223 ms(8K) 0.204 ms(16K)| > >>>>> |PT BOs on LRU)| | | | > >>>>> +------------------------------------------------------------------------------------+ > >>>>> | Bulk move | 163.1 FPS | 40.52 us |0.244 ms(1K) 0.252 ms(2K) 0.213 ms(4K) | > >>>>> | | | |0.214 ms(8K) 0.225 ms(16K) | > >>>>> +--------------+-----------------+-----------+---------------------------------------+ > >>>>> > >>>>> After test them with above three benchmarks include vulkan and opencl. We can > >>>>> see the visible improvement than original, and even better than original with > >>>>> workaround. > >>>>> > >>>>> Signed-off-by: Christian König <christian.koenig at amd.com> > >>>>> Signed-off-by: Huang Rui <ray.huang at amd.com> > >>>>> --- > >>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c | 22 ++++++++++++++-------- > >>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h | 4 ++++ > >>>>> 2 files changed, 18 insertions(+), 8 deletions(-) > >>>>> > >>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c > >>>>> index 9c84770..eda0bb9 100644 > >>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c > >>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c > >>>>> @@ -286,6 +286,7 @@ int amdgpu_vm_validate_pt_bos(struct amdgpu_device *adev, struct amdgpu_vm *vm, > >>>>> { > >>>>> struct ttm_bo_global *glob = adev->mman.bdev.glob; > >>>>> struct amdgpu_vm_bo_base *bo_base, *tmp; > >>>>> + bool validated = false; > >>>>> int r = 0; > >>>>> > >>>>> list_for_each_entry_safe(bo_base, tmp, &vm->evicted, vm_status) { > >>>>> @@ -295,14 +296,9 @@ int amdgpu_vm_validate_pt_bos(struct amdgpu_device *adev, struct amdgpu_vm *vm, > >>>>> r = validate(param, bo); > >>>>> if (r) > >>>>> break; > >>>>> - > >>>>> - spin_lock(&glob->lru_lock); > >>>>> - ttm_bo_move_to_lru_tail(&bo->tbo, NULL); > >>>>> - if (bo->shadow) > >>>>> - ttm_bo_move_to_lru_tail(&bo->shadow->tbo, NULL); > >>>>> - spin_unlock(&glob->lru_lock); > >>>>> } > >>>>> > >>>>> + validated = true; > >>>>> if (bo->tbo.type != ttm_bo_type_kernel) { > >>>>> spin_lock(&vm->moved_lock); > >>>>> list_move(&bo_base->vm_status, &vm->moved); > >>>>> @@ -312,6 +308,15 @@ int amdgpu_vm_validate_pt_bos(struct amdgpu_device *adev, struct amdgpu_vm *vm, > >>>>> } > >>>>> } > >>>>> > >>>>> + if (!validated) { > >>>>> + spin_lock(&glob->lru_lock); > >>>>> + ttm_bo_bulk_move_lru_tail(&vm->lru_bulk_move); > >>>>> + spin_unlock(&glob->lru_lock); > >>>>> + return 0; > >>>>> + } > >>>>> + > >>>>> + memset(&vm->lru_bulk_move, 0, sizeof(vm->lru_bulk_move)); > >>>>> + > >>>>> spin_lock(&glob->lru_lock); > >>>>> list_for_each_entry(bo_base, &vm->idle, vm_status) { > >>>>> struct amdgpu_bo *bo = bo_base->bo; > >>>>> @@ -319,9 +324,10 @@ int amdgpu_vm_validate_pt_bos(struct amdgpu_device *adev, struct amdgpu_vm *vm, > >>>>> if (!bo->parent) > >>>>> continue; > >>>>> > >>>>> - ttm_bo_move_to_lru_tail(&bo->tbo, NULL); > >>>>> + ttm_bo_move_to_lru_tail(&bo->tbo, &vm->lru_bulk_move); > >>>>> if (bo->shadow) > >>>>> - ttm_bo_move_to_lru_tail(&bo->shadow->tbo, NULL); > >>>>> + ttm_bo_move_to_lru_tail(&bo->shadow->tbo, > >>>>> + &vm->lru_bulk_move); > >>>> Here is still one problem: The BOs currently evicted are not yet in the > >>>> idle state. > >>>> > >>>> So this must be moved to the end of the state machine, probably best if > >>>> we put that into a separate function. > >>>> > >>> Thanks, so the evicted BOs should be behind the BOs in idle list that in > >>> the group between (first, last) of bulk_move_pos, right? > >> The order doesn't matter. > >> > >>> spin_lock(&glob->lru_lock); > >>> list_for_each_entry(bo_base, &vm->idle, vm_status) { > >>> struct amdgpu_bo *bo = bo_base->bo; > >>> > >>> if (!bo->parent) > >>> continue; > >>> > >>> ttm_bo_move_to_lru_tail(&bo->tbo, &vm->lru_bulk_move); > >>> if (bo->shadow) > >>> ttm_bo_move_to_lru_tail(&bo->shadow->tbo, > >>> &vm->lru_bulk_move); > >>> } > >>> list_for_each_entry_safe(bo_base, tmp, &vm->evicted, vm_status) { > >> That won't work. The BOs get removed from the evicted list when they are > >> validated. > >> > >> You need to walk both the relocated as well as the moved list here. > >> > > Is that because, at the start of the function, we already walk the evicted > > list, and some of BOs are validated and move to moved or relocated list, so > > here we need also walk both of them to make them together for bulk move? > > Yes, exactly. > > Alternatively you can add a new function which is called after command > submission is done to move everything on the LRU. > > At this point all BOs should now be on the idle list again and you only > need to walk that one. Do you mean we move all BOs to idle list again in amdgpu_vm_bo_update of amdgpu_cs_ib_vm_chunk subsequently? But at that time, I see if the BO is evicted, it will be moved to evicted list, not to idle list again. Am I missed something? Thanks, Ray > > Regards, > Christian. > > > > > Thanks, > > Ray > > > >> Christian. > >> > >>> struct amdgpu_bo *bo = bo_base->bo; > >>> if (!bo) > >>> continue; > >>> > >>> ttm_bo_move_to_lru_tail(&bo->tbo, &vm->lru_bulk_move); > >>> if (bo->shadow) > >>> ttm_bo_move_to_lru_tail(&bo->shadow->tbo, > >>> &vm->lru_bulk_move); > >>> } > >>> spin_unlock(&glob->lru_lock); > >>> > >>> Thanks, > >>> Ray > >>> > >>>> Regards, > >>>> Christian. > >>>> > >>>>> } > >>>>> spin_unlock(&glob->lru_lock); > >>>>> > >>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h > >>>>> index 67a15d4..92725ac 100644 > >>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h > >>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h > >>>>> @@ -29,6 +29,7 @@ > >>>>> #include <linux/rbtree.h> > >>>>> #include <drm/gpu_scheduler.h> > >>>>> #include <drm/drm_file.h> > >>>>> +#include <drm/ttm/ttm_bo_driver.h> > >>>>> > >>>>> #include "amdgpu_sync.h" > >>>>> #include "amdgpu_ring.h" > >>>>> @@ -226,6 +227,9 @@ struct amdgpu_vm { > >>>>> > >>>>> /* Some basic info about the task */ > >>>>> struct amdgpu_task_info task_info; > >>>>> + > >>>>> + /* Store positions of group of BOs */ > >>>>> + struct ttm_lru_bulk_move lru_bulk_move; > >>>>> }; > >>>>> > >>>>> struct amdgpu_vm_manager { >