Sounds feasible, only this step looks need some investigation on hardware side: >in gfx_v9_0_priv_reg_irq() (and all the other IRQ handlers as well) we add functionality to figure out the ring/scheduler which caused the illegal operation. e.g. : gfx_v9_priv_reg_irq(struct amdgpu_device *adev, strut amdgpu_irq_src *source, struct amdgpu_iv_entry *entry){ //in this function, need to figure out which ring cause this error from source/entry parameter } -----Original Message----- From: Koenig, Christian Sent: 2017å¹´11æ??1æ?¥ 16:42 To: Liu, Monk <Monk.Liu at amd.com>; amd-gfx at lists.freedesktop.org Subject: Re: [PATCH 2/2] drm/amdgpu:cleanup deprecated gpu reset logic Am 01.11.2017 um 04:29 schrieb Liu, Monk: > The thing is triggering gpu_recover() in irq routine give you NULL for the @bad/job parameter, so gpu_recover() actually did nothing meaningful, it just repeat scheduling un-signaled jobs again and again, and finally your GPU is stuck with infinite recovering .... Yeah, completely agree that this isn't a good implementation. But we need to somehow do better than that. > here is my initial thought: > In " gfx_v9_0_priv_reg_irq(struct amdgpu_device *adev, struct amdgpu_iv_entry *entry)" routine: > For lockup_timeout==0 case: we put a new parameter "@sched" to > amd_gpu_recover(), that way although we don't have @bad/job, but with @sched we can still find out the pending job on his scheduler, But that make code ugly, cuz we need to change amdgpu_gpu_recover(adev, job) to amdgpu_gpu_recover(adev, job, sched). Anyway need to find a way to tell gpu_recover() at least which ring has problem. > > For lockup_timeout!=0 case: did nothing and just return since TDR will correct this error and do recover() gracefully. I think waiting for the TDR will take to long. How about this instead: 1. In gfx_v9_0_priv_reg_irq() (and all the other IRQ handlers as well) we add functionality to figure out the ring/scheduler which caused the illegal operation. 2. Then we add a new function amd_sched_job_fault() which gets the scheduler which had a fault as parameter. 3. amd_sched_job_fault() then grabs job_list_lock and takes the first job from the ring_mirror_list. 4. We call cancel_delayed_work() to cancel the timeout of the job. 5. If cancel_delayed_work() returns true (which means the TDR isn't already running anyway) we call schedule_delayed_work() with a zero timeout. This way the actual timeout value is truncated to zero and we run the recovery immediately just in the same way as it would when we have waited. Should be trivial to implement actually. What do you think? Regards, Christian. > > > > -----Original Message----- > From: Christian König [mailto:ckoenig.leichtzumerken at gmail.com] > Sent: 2017å¹´10æ??31æ?¥ 23:01 > To: Liu, Monk <Monk.Liu at amd.com>; amd-gfx at lists.freedesktop.org > Subject: Re: [PATCH 2/2] drm/amdgpu:cleanup deprecated gpu reset logic > > Am 31.10.2017 um 12:55 schrieb Monk Liu: >> trigger gpu reset/recovery from illegle instruction IRQ is deprecated >> long time ago, we already switch to recover gpu by TDR triggering. >> >> now please set lockup_timeout to non-zero value in driver loading to >> enable TDR. > The patch is ok, but NAK to the general approach. We should have illegal instruction/access alongside the timeout. > > But instead of trying to trigger the reset directly inform the scheduler that the we detected a problem on the engine. The scheduler can then cancel the timeout and do the appropriate things. > > This patch would be ok if you install this new functionality directly after removing the old (and broken) one. > > Regards, > Christian. > >> Change-Id: I45a576a97fd9859e1098e785ce857c2cf5adfba5 >> Signed-off-by: Monk Liu <Monk.Liu at amd.com> >> --- >> drivers/gpu/drm/amd/amdgpu/amdgpu.h | 1 - >> drivers/gpu/drm/amd/amdgpu/amdgpu_irq.c | 21 --------------------- >> drivers/gpu/drm/amd/amdgpu/cik_sdma.c | 1 - >> drivers/gpu/drm/amd/amdgpu/gfx_v6_0.c | 2 -- >> drivers/gpu/drm/amd/amdgpu/gfx_v7_0.c | 3 --- >> drivers/gpu/drm/amd/amdgpu/gfx_v8_0.c | 2 -- >> drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c | 2 -- >> drivers/gpu/drm/amd/amdgpu/sdma_v2_4.c | 1 - >> drivers/gpu/drm/amd/amdgpu/sdma_v3_0.c | 1 - >> drivers/gpu/drm/amd/amdgpu/sdma_v4_0.c | 1 - >> drivers/gpu/drm/amd/amdgpu/si_dma.c | 1 - >> 11 files changed, 36 deletions(-) >> >> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h >> b/drivers/gpu/drm/amd/amdgpu/amdgpu.h >> index 6e89be5..b3453e3 100644 >> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h >> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h >> @@ -1459,7 +1459,6 @@ struct amdgpu_device { >> bool shutdown; >> bool need_dma32; >> bool accel_working; >> - struct work_struct reset_work; >> struct notifier_block acpi_nb; >> struct amdgpu_i2c_chan *i2c_bus[AMDGPU_MAX_I2C_BUS]; >> struct amdgpu_debugfs debugfs[AMDGPU_DEBUGFS_MAX_COMPONENTS]; >> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_irq.c >> b/drivers/gpu/drm/amd/amdgpu/amdgpu_irq.c >> index c340774..989b530 100644 >> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_irq.c >> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_irq.c >> @@ -73,23 +73,6 @@ static void amdgpu_hotplug_work_func(struct work_struct *work) >> drm_helper_hpd_irq_event(dev); >> } >> >> -/** >> - * amdgpu_irq_reset_work_func - execute gpu reset >> - * >> - * @work: work struct >> - * >> - * Execute scheduled gpu reset (cayman+). >> - * This function is called when the irq handler >> - * thinks we need a gpu reset. >> - */ >> -static void amdgpu_irq_reset_work_func(struct work_struct *work) -{ >> - struct amdgpu_device *adev = container_of(work, struct amdgpu_device, >> - reset_work); >> - >> - if (!amdgpu_sriov_vf(adev)) >> - amdgpu_gpu_recover(adev, NULL); >> -} >> >> /* Disable *all* interrupts */ >> static void amdgpu_irq_disable_all(struct amdgpu_device *adev) @@ >> -251,14 +234,11 @@ int amdgpu_irq_init(struct amdgpu_device *adev) >> amdgpu_hotplug_work_func); >> } >> >> - INIT_WORK(&adev->reset_work, amdgpu_irq_reset_work_func); >> - >> adev->irq.installed = true; >> r = drm_irq_install(adev->ddev, adev->ddev->pdev->irq); >> if (r) { >> adev->irq.installed = false; >> flush_work(&adev->hotplug_work); >> - cancel_work_sync(&adev->reset_work); >> return r; >> } >> >> @@ -283,7 +263,6 @@ void amdgpu_irq_fini(struct amdgpu_device *adev) >> if (adev->irq.msi_enabled) >> pci_disable_msi(adev->pdev); >> flush_work(&adev->hotplug_work); >> - cancel_work_sync(&adev->reset_work); >> } >> >> for (i = 0; i < AMDGPU_IH_CLIENTID_MAX; ++i) { diff --git >> a/drivers/gpu/drm/amd/amdgpu/cik_sdma.c >> b/drivers/gpu/drm/amd/amdgpu/cik_sdma.c >> index ed26dcb..c8d9bc1 100644 >> --- a/drivers/gpu/drm/amd/amdgpu/cik_sdma.c >> +++ b/drivers/gpu/drm/amd/amdgpu/cik_sdma.c >> @@ -1227,7 +1227,6 @@ static int cik_sdma_process_illegal_inst_irq(struct amdgpu_device *adev, >> struct amdgpu_iv_entry *entry) >> { >> DRM_ERROR("Illegal instruction in SDMA command stream\n"); >> - schedule_work(&adev->reset_work); >> return 0; >> } >> >> diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v6_0.c >> b/drivers/gpu/drm/amd/amdgpu/gfx_v6_0.c >> index 9430d48..25b32b7 100644 >> --- a/drivers/gpu/drm/amd/amdgpu/gfx_v6_0.c >> +++ b/drivers/gpu/drm/amd/amdgpu/gfx_v6_0.c >> @@ -3443,7 +3443,6 @@ static int gfx_v6_0_priv_reg_irq(struct amdgpu_device *adev, >> struct amdgpu_iv_entry *entry) >> { >> DRM_ERROR("Illegal register access in command stream\n"); >> - schedule_work(&adev->reset_work); >> return 0; >> } >> >> @@ -3452,7 +3451,6 @@ static int gfx_v6_0_priv_inst_irq(struct amdgpu_device *adev, >> struct amdgpu_iv_entry *entry) >> { >> DRM_ERROR("Illegal instruction in command stream\n"); >> - schedule_work(&adev->reset_work); >> return 0; >> } >> >> diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v7_0.c >> b/drivers/gpu/drm/amd/amdgpu/gfx_v7_0.c >> index 3c2b15a..b0e127d 100644 >> --- a/drivers/gpu/drm/amd/amdgpu/gfx_v7_0.c >> +++ b/drivers/gpu/drm/amd/amdgpu/gfx_v7_0.c >> @@ -5020,7 +5020,6 @@ static int gfx_v7_0_priv_reg_irq(struct amdgpu_device *adev, >> struct amdgpu_iv_entry *entry) >> { >> DRM_ERROR("Illegal register access in command stream\n"); >> - schedule_work(&adev->reset_work); >> return 0; >> } >> >> @@ -5029,8 +5028,6 @@ static int gfx_v7_0_priv_inst_irq(struct amdgpu_device *adev, >> struct amdgpu_iv_entry *entry) >> { >> DRM_ERROR("Illegal instruction in command stream\n"); >> - // XXX soft reset the gfx block only >> - schedule_work(&adev->reset_work); >> return 0; >> } >> >> diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v8_0.c >> b/drivers/gpu/drm/amd/amdgpu/gfx_v8_0.c >> index a74515a..5fd4996 100644 >> --- a/drivers/gpu/drm/amd/amdgpu/gfx_v8_0.c >> +++ b/drivers/gpu/drm/amd/amdgpu/gfx_v8_0.c >> @@ -6795,7 +6795,6 @@ static int gfx_v8_0_priv_reg_irq(struct amdgpu_device *adev, >> struct amdgpu_iv_entry *entry) >> { >> DRM_ERROR("Illegal register access in command stream\n"); >> - schedule_work(&adev->reset_work); >> return 0; >> } >> >> @@ -6804,7 +6803,6 @@ static int gfx_v8_0_priv_inst_irq(struct amdgpu_device *adev, >> struct amdgpu_iv_entry *entry) >> { >> DRM_ERROR("Illegal instruction in command stream\n"); >> - schedule_work(&adev->reset_work); >> return 0; >> } >> >> diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c >> b/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c >> index 9855dc0..dce1960 100644 >> --- a/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c >> +++ b/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c >> @@ -4136,7 +4136,6 @@ static int gfx_v9_0_priv_reg_irq(struct amdgpu_device *adev, >> struct amdgpu_iv_entry *entry) >> { >> DRM_ERROR("Illegal register access in command stream\n"); >> - schedule_work(&adev->reset_work); >> return 0; >> } >> >> @@ -4145,7 +4144,6 @@ static int gfx_v9_0_priv_inst_irq(struct amdgpu_device *adev, >> struct amdgpu_iv_entry *entry) >> { >> DRM_ERROR("Illegal instruction in command stream\n"); >> - schedule_work(&adev->reset_work); >> return 0; >> } >> >> diff --git a/drivers/gpu/drm/amd/amdgpu/sdma_v2_4.c >> b/drivers/gpu/drm/amd/amdgpu/sdma_v2_4.c >> index 92f8c44..7e2580c 100644 >> --- a/drivers/gpu/drm/amd/amdgpu/sdma_v2_4.c >> +++ b/drivers/gpu/drm/amd/amdgpu/sdma_v2_4.c >> @@ -1161,7 +1161,6 @@ static int sdma_v2_4_process_illegal_inst_irq(struct amdgpu_device *adev, >> struct amdgpu_iv_entry *entry) >> { >> DRM_ERROR("Illegal instruction in SDMA command stream\n"); >> - schedule_work(&adev->reset_work); >> return 0; >> } >> >> diff --git a/drivers/gpu/drm/amd/amdgpu/sdma_v3_0.c >> b/drivers/gpu/drm/amd/amdgpu/sdma_v3_0.c >> index 52e6bf2..c12f994 100644 >> --- a/drivers/gpu/drm/amd/amdgpu/sdma_v3_0.c >> +++ b/drivers/gpu/drm/amd/amdgpu/sdma_v3_0.c >> @@ -1480,7 +1480,6 @@ static int sdma_v3_0_process_illegal_inst_irq(struct amdgpu_device *adev, >> struct amdgpu_iv_entry *entry) >> { >> DRM_ERROR("Illegal instruction in SDMA command stream\n"); >> - schedule_work(&adev->reset_work); >> return 0; >> } >> >> diff --git a/drivers/gpu/drm/amd/amdgpu/sdma_v4_0.c >> b/drivers/gpu/drm/amd/amdgpu/sdma_v4_0.c >> index fe78c00..09b7586 100644 >> --- a/drivers/gpu/drm/amd/amdgpu/sdma_v4_0.c >> +++ b/drivers/gpu/drm/amd/amdgpu/sdma_v4_0.c >> @@ -1416,7 +1416,6 @@ static int sdma_v4_0_process_illegal_inst_irq(struct amdgpu_device *adev, >> struct amdgpu_iv_entry *entry) >> { >> DRM_ERROR("Illegal instruction in SDMA command stream\n"); >> - schedule_work(&adev->reset_work); >> return 0; >> } >> >> diff --git a/drivers/gpu/drm/amd/amdgpu/si_dma.c >> b/drivers/gpu/drm/amd/amdgpu/si_dma.c >> index ee469a9..408e145 100644 >> --- a/drivers/gpu/drm/amd/amdgpu/si_dma.c >> +++ b/drivers/gpu/drm/amd/amdgpu/si_dma.c >> @@ -683,7 +683,6 @@ static int si_dma_process_illegal_inst_irq(struct amdgpu_device *adev, >> struct amdgpu_iv_entry *entry) >> { >> DRM_ERROR("Illegal instruction in SDMA command stream\n"); >> - schedule_work(&adev->reset_work); >> return 0; >> } >> >