> Because we can always rely on TDR and HYPERVISOR to detect GPU hang and resubmit malicious jobs or even kick them out later, > and the gpu reset will eventually be invoked, so there is no reason to manually and voluntarily call gpu reset under SRIOV case. Well there is a rather good reason, we detect that something is wrong much faster than waiting for the timeout. But I agree that it was broken before as well and we can fix that later. Please add a code comment that this needs more work. With that fixed feel free to add my rb on it. Christian. Am 08.05.2017 um 11:42 schrieb Liu, Monk: > The VM fault interrupt or illegal instruction will be delivered to GPU no matter it's SR-IOV or bare-metal case, > And I removed them from invoking GPU reset is due to the same reason: > Don't trigger gpu reset for sriov case if possible, always beware that trigger GPU reset under SR-IOV is a heavy cost (need take full access mode on GPU, so all > Other VFs will be paused for a while) > > Because we can always rely on TDR and HYPERVISOR to detect GPU hang and resubmit malicious jobs or even kick them out later, > and the gpu reset will eventually be invoked, so there is no reason to manually and voluntarily call gpu reset under SRIOV case. > > BR Monk > > > -----Original Message----- > From: Christian König [mailto:deathsimple at vodafone.de] > Sent: Monday, May 08, 2017 5:34 PM > To: Liu, Monk <Monk.Liu at amd.com>; amd-gfx at lists.freedesktop.org > Subject: Re: [PATCH 1/4] drm/amdgpu:don't invoke srio-gpu-reset in gpu-reset > > Sounds good, but what do we do with the amdgpu_irq_reset_work_func? > > Please note that I find that calling amdgpu_gpu_reset() here is a bad idea in the first place. > > Instead we should consider the scheduler as faulting and let the scheduler handle that as in the same way as a job timeout. > > But I'm not sure if those interrupts are actually send under SRIOV or if the hypervisor handles them somehow. > > Christian. > > Am 08.05.2017 um 11:24 schrieb Liu, Monk: >> I agree with disabling debugfs for amdgpu_reset when SRIOV detected. >> >> -----Original Message----- >> From: Christian König [mailto:deathsimple at vodafone.de] >> Sent: Monday, May 08, 2017 5:20 PM >> To: Liu, Monk <Monk.Liu at amd.com>; amd-gfx at lists.freedesktop.org >> Subject: Re: [PATCH 1/4] drm/amdgpu:don't invoke srio-gpu-reset in >> gpu-reset >> >>> You know that gpu reset under SR-IOV will have very big impact on all other VFs ... >> Mhm, good argument. But in this case we need to give at least some warning message instead of doing nothing. >> >> Or even better disable creating the amdgpu_reste debugfs file altogether. This way nobody will wonder why using it doesn't trigger anything. >> >> Christian. >> >> Am 08.05.2017 um 11:10 schrieb Liu, Monk: >>> For SR-IOV use case, we call gpu reset under the case we have no choice ... >>> >>> So many places like debug fs shouldn't a good reason to trigger gpu >>> reset >>> >>> You know that gpu reset under SR-IOV will have very big impact on all other VFs ... >>> >>> BR Monk >>> >>> -----Original Message----- >>> From: Christian König [mailto:deathsimple at vodafone.de] >>> Sent: Monday, May 08, 2017 5:08 PM >>> To: Liu, Monk <Monk.Liu at amd.com>; amd-gfx at lists.freedesktop.org >>> Subject: Re: [PATCH 1/4] drm/amdgpu:don't invoke srio-gpu-reset in >>> gpu-reset >>> >>> Am 08.05.2017 um 08:51 schrieb Monk Liu: >>>> because we don't want to do sriov-gpu-reset under certain cases, so >>>> just split those two funtion and don't invoke sr-iov one from >>>> bare-metal one. >>>> >>>> Change-Id: I641126c241e2ee2dfd54e6d16c389b159f99cfe0 >>>> Signed-off-by: Monk Liu <Monk.Liu at amd.com> >>>> --- >>>> drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 3 --- >>>> drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c | 3 ++- >>>> drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c | 2 +- >>>> drivers/gpu/drm/amd/amdgpu/amdgpu_irq.c | 3 ++- >>>> drivers/gpu/drm/amd/amdgpu/amdgpu_job.c | 6 +++++- >>>> 5 files changed, 10 insertions(+), 7 deletions(-) >>>> >>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c >>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c >>>> index 45a60a6..4985a7e 100644 >>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c >>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c >>>> @@ -2652,9 +2652,6 @@ int amdgpu_gpu_reset(struct amdgpu_device *adev) >>>> int resched; >>>> bool need_full_reset; >>>> >>>> - if (amdgpu_sriov_vf(adev)) >>>> - return amdgpu_sriov_gpu_reset(adev, true); >>>> - >>>> if (!amdgpu_check_soft_reset(adev)) { >>>> DRM_INFO("No hardware hang detected. Did some blocks stall?\n"); >>>> return 0; >>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c >>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c >>>> index 5772ef2..d7523d1 100644 >>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c >>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c >>>> @@ -651,7 +651,8 @@ static int amdgpu_debugfs_gpu_reset(struct seq_file *m, void *data) >>>> struct amdgpu_device *adev = dev->dev_private; >>>> >>>> seq_printf(m, "gpu reset\n"); >>>> - amdgpu_gpu_reset(adev); >>>> + if (!amdgpu_sriov_vf(adev)) >>>> + amdgpu_gpu_reset(adev); >>> Well that is clearly not a good idea. Why do you want to disable the reset here? >>> >>>> >>>> return 0; >>>> } >>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c >>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c >>>> index 67be795..5bcbea0 100644 >>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c >>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c >>>> @@ -221,7 +221,7 @@ void amdgpu_gem_object_close(struct >>>> drm_gem_object *obj, >>>> >>>> static int amdgpu_gem_handle_lockup(struct amdgpu_device *adev, int r) >>>> { >>>> - if (r == -EDEADLK) { >>>> + if (r == -EDEADLK && !amdgpu_sriov_vf(adev)) { >>> Not a problem of your patch, but that stuff is outdated and should have been removed completely years ago. Going to take care of that. >>> >>>> r = amdgpu_gpu_reset(adev); >>>> if (!r) >>>> r = -EAGAIN; >>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_irq.c >>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_irq.c >>>> index f8a6c95..49c6e6e 100644 >>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_irq.c >>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_irq.c >>>> @@ -89,7 +89,8 @@ static void amdgpu_irq_reset_work_func(struct work_struct *work) >>>> struct amdgpu_device *adev = container_of(work, struct amdgpu_device, >>>> reset_work); >>>> >>>> - amdgpu_gpu_reset(adev); >>>> + if (!amdgpu_sriov_vf(adev)) >>>> + amdgpu_gpu_reset(adev); >>> Mhm, that disables the reset on an invalid register access or invalid command stream. Is that really what we want? >>> >>> Christian. >>> >>>> } >>>> >>>> /* Disable *all* interrupts */ >>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c >>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c >>>> index 690ef3d..c7718af 100644 >>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c >>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c >>>> @@ -36,7 +36,11 @@ static void amdgpu_job_timedout(struct amd_sched_job *s_job) >>>> job->base.sched->name, >>>> atomic_read(&job->ring->fence_drv.last_seq), >>>> job->ring->fence_drv.sync_seq); >>>> - amdgpu_gpu_reset(job->adev); >>>> + >>>> + if (amdgpu_sriov_vf(job->adev)) >>>> + amdgpu_sriov_gpu_reset(job->adev, true); >>>> + else >>>> + amdgpu_gpu_reset(job->adev); >>>> } >>>> >>>> int amdgpu_job_alloc(struct amdgpu_device *adev, unsigned >>>> num_ibs, >>> _______________________________________________ >>> amd-gfx mailing list >>> amd-gfx at lists.freedesktop.org >>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx