On 10/19/2018 03:08 AM, Koenig, Christian wrote: > Am 18.10.18 um 20:44 schrieb Andrey Grodzovsky: >> A ring might become unusable after reset, if that the case >> drm_sched_entity_select_rq will choose another, working rq >> to run the job if there is one. >> Also, skip recovery of ring which is not ready. > Well that is not even remotely sufficient. > > If we can't bring up a ring any more after a reset we would need to move > all jobs which where previously scheduled on it and all of its entities > to a different instance. That one should be easy to add inside amdgpu_device_gpu_recover in case ring is dead, we just do the same for all the jobs in sched->ring_mirror_list of the dead ring before doing recovery for them, no? > > What you do here breaks dependencies between jobs and can result in > unforeseen consequences including random memory writes. Can you explain this a bit more ? AFAIK any job dependent on this job will still wait for it's completion before running regardless of did this job moved to a different ring. What am I missing ? > > As far as I can see that can't be done correctly with the current > scheduler design. > > Additional to that when we can't restart one instance of a ring we > usually can't restart all others either. So the whole approach is rather > pointless. From my testing looks like we can, compute ring 0 is dead but IB tests pass on other compute rings. Andrey > > Regards, > Christian. > >> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@xxxxxxx> >> --- >> drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 16 +++++++++++++++- >> 1 file changed, 15 insertions(+), 1 deletion(-) >> >> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c >> index d11489e..3124ca1 100644 >> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c >> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c >> @@ -3355,10 +3355,24 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev, >> else >> r = amdgpu_device_reset(adev); >> >> + /* >> + * After reboot a ring might fail in which case this will >> + * move the job to different rq if possible >> + */ >> + if (job) { >> + drm_sched_entity_select_rq(job->base.entity); >> + if (job->base.entity->rq) { >> + job->base.sched = job->base.entity->rq->sched; >> + } else { >> + job->base.sched = NULL; >> + r = -ENOENT; >> + } >> + } >> + >> for (i = 0; i < AMDGPU_MAX_RINGS; ++i) { >> struct amdgpu_ring *ring = adev->rings[i]; >> >> - if (!ring || !ring->sched.thread) >> + if (!ring || !ring->ready || !ring->sched.thread) >> continue; >> >> /* only need recovery sched of the given job's ring > _______________________________________________ > amd-gfx mailing list > amd-gfx@xxxxxxxxxxxxxxxxxxxxx > https://lists.freedesktop.org/mailman/listinfo/amd-gfx _______________________________________________ amd-gfx mailing list amd-gfx@xxxxxxxxxxxxxxxxxxxxx https://lists.freedesktop.org/mailman/listinfo/amd-gfx