First of all patches #10 and #11 look like bug fixes to existing code to me. So we should fix those problems before working on anything else. Patch #10 is Reviewed-by: Christian König <christian.koenig at amd.com> Patch #11: > list_for_each_entry(s_job, &sched->ring_mirror_list, node) { > struct amd_sched_fence *s_fence = s_job->s_fence; > - struct fence *fence = sched->ops->run_job(s_job); > + struct fence *fence; > > + spin_unlock(&sched->job_list_lock); > + fence = sched->ops->run_job(s_job); > atomic_inc(&sched->hw_rq_count); > if (fence) { > s_fence->parent = fence_get(fence); > @@ -451,6 +453,7 @@ void amd_sched_job_recovery(struct > amd_gpu_scheduler *sched) > DRM_ERROR("Failed to run job!\n"); > amd_sched_process_job(NULL, &s_fence->cb); > } > + spin_lock(&sched->job_list_lock); > } > spin_unlock(&sched->job_list_lock); The problem is that the job might complete while we dropped the lock. Please use list_for_each_entry_safe here and add a comment why the list could be modified in the meantime. With that fixed the patch is Reviewed-by: Christian König <christian.koenig at amd.com> as well. The remaining set looks very good to me as well, but I was rather thinking of a more general approach instead of making it VM PD/PT specific. For example we also need to backup/restore shaders when a hard GPU reset happens. So I would suggest the following: 1. We add an optional "shadow" flag so that when a BO in VRAM is allocated we also allocate a shadow BO in GART. 2. We have another "backup" flag that says on the next command submission the BO is backed up from VRAM to GART before that submission. 3. We set the shadow flag for VM PD/PT BOs and every time we modify them set the backup flag so they get backed up on next CS. 4. We add an IOCTL to allow setting the backup flag from userspace so that we can trigger another backup even after the first CS. What do you think? Regards, Christian. Am 25.07.2016 um 09:22 schrieb Chunming Zhou: > Since we cannot make sure VRAM is safe after gpu reset, page table backup > is neccessary, shadow page table is sense way to recovery page talbe when > gpu reset happens. > We need to allocate GTT bo as the shadow of VRAM bo when creating page table, > and make them same. After gpu reset, we will need to use SDMA to copy GTT bo > content to VRAM bo, then page table will be recoveried. > > Chunming Zhou (13): > drm/amdgpu: add pd/pt bo shadow > drm/amdgpu: update shadow pt bo while update pt > drm/amdgpu: update pd shadow while updating pd > drm/amdgpu: implement amdgpu_vm_recover_page_table_from_shadow > drm/amdgpu: link all vm clients > drm/amdgpu: add vm_list_lock > drm/amd: add block entity function > drm/amdgpu: recover page tables after gpu reset > drm/amdgpu: add vm recover pt fence > drm/amd: reset hw count when reset job > drm/amd: fix deadlock of job_list_lock > drm/amd: wait neccessary dependency before running job > drm/amdgpu: fix sched deadoff > > drivers/gpu/drm/amd/amdgpu/amdgpu.h | 17 ++- > drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c | 12 +- > drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 30 ++++- > drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c | 5 +- > drivers/gpu/drm/amd/amdgpu/amdgpu_job.c | 5 + > drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c | 161 ++++++++++++++++++++++++-- > drivers/gpu/drm/amd/scheduler/gpu_scheduler.c | 35 +++++- > drivers/gpu/drm/amd/scheduler/gpu_scheduler.h | 3 + > 8 files changed, 250 insertions(+), 18 deletions(-) >