Re: [PATCH] drm/amd/amdgpu: vm entities should have kernel priority

Christian König <ckoenig.leichtzumerken@xxxxxxxxx> · Mon, 19 Jul 2021 13:10:00 +0200

Am 19.07.21 um 11:42 schrieb Liu, Monk:
[AMD Official Use Only]

Besides, I think our current KMD have three types of kernel sdma jobs:
1) adev->mman.entity, it is already a KERNEL priority entity
2) vm->immediate
3) vm->delay

Do you mean now vm->immediate or delay are used as moving jobs instead of mman.entity ?

No, exactly that's the point. vm->immediate and vm->delayed are not for 
kernel paging jobs.

Those are used for userspace page table updates.

I agree that those should probably not considered guilty, but modifying 
the priority of them is not the right way of doing that.

Regards,
Christian.


Thanks

------------------------------------------
Monk Liu | Cloud-GPU Core team
------------------------------------------

-----Original Message-----
From: Liu, Monk
Sent: Monday, July 19, 2021 5:40 PM
To: 'Christian König' <ckoenig.leichtzumerken@xxxxxxxxx>; Chen, JingWen <JingWen.Chen2@xxxxxxx>; amd-gfx@xxxxxxxxxxxxxxxxxxxxx
Cc: Chen, Horace <Horace.Chen@xxxxxxx>
Subject: RE: [PATCH] drm/amd/amdgpu: vm entities should have kernel priority

[AMD Official Use Only]

If there is move jobs clashing there we probably need to fix the bugs of those move jobs

Previously I believe you also remember that we agreed to always trust kernel jobs especially paging jobs,

Without set paging jobs' priority to KERNEL level how can we keep that protocol ? do you have a better idea?

Thanks

------------------------------------------
Monk Liu | Cloud-GPU Core team
------------------------------------------

-----Original Message-----
From: Christian König <ckoenig.leichtzumerken@xxxxxxxxx>
Sent: Monday, July 19, 2021 4:25 PM
To: Chen, JingWen <JingWen.Chen2@xxxxxxx>; amd-gfx@xxxxxxxxxxxxxxxxxxxxx
Cc: Chen, Horace <Horace.Chen@xxxxxxx>; Liu, Monk <Monk.Liu@xxxxxxx>
Subject: Re: [PATCH] drm/amd/amdgpu: vm entities should have kernel priority

Am 19.07.21 um 07:57 schrieb Jingwen Chen:
[Why]
Current vm_pte entities have NORMAL priority, in SRIOV multi-vf use
case, the vf flr happens first and then job time out is found.
There can be several jobs timeout during a very small time slice.
And if the innocent sdma job time out is found before the real bad
job, then the innocent sdma job will be set to guilty as it only has
NORMAL priority. This will lead to a page fault after resubmitting
job.

[How]
sdma should always have KERNEL priority. The kernel job will always be
resubmitted.
I'm not sure if that is a good idea. We intentionally didn't gave the page table updates kernel priority to avoid clashing with the move jobs.

Christian.

Signed-off-by: Jingwen Chen <Jingwen.Chen2@xxxxxxx>
---
   drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c | 4 ++--
   1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
index 358316d6a38c..f7526b67cc5d 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
@@ -2923,13 +2923,13 @@ int amdgpu_vm_init(struct amdgpu_device *adev, struct amdgpu_vm *vm)
   	INIT_LIST_HEAD(&vm->done);
   
   	/* create scheduler entities for page table updates */
-	r = drm_sched_entity_init(&vm->immediate, DRM_SCHED_PRIORITY_NORMAL,
+	r = drm_sched_entity_init(&vm->immediate, DRM_SCHED_PRIORITY_KERNEL,
   				  adev->vm_manager.vm_pte_scheds,
   				  adev->vm_manager.vm_pte_num_scheds, NULL);
   	if (r)
   		return r;
   
-	r = drm_sched_entity_init(&vm->delayed, DRM_SCHED_PRIORITY_NORMAL,
+	r = drm_sched_entity_init(&vm->delayed, DRM_SCHED_PRIORITY_KERNEL,
   				  adev->vm_manager.vm_pte_scheds,
   				  adev->vm_manager.vm_pte_num_scheds, NULL);
   	if (r)

_______________________________________________
amd-gfx mailing list
amd-gfx@xxxxxxxxxxxxxxxxxxxxx
https://lists.freedesktop.org/mailman/listinfo/amd-gfx