Hi Christian Trigger's patch intends to fix the double signaling on dma_fence from below scenario: 1) Gpu_recovery(job = some job) invoked by guest TDR ==> the hang job's fence is set as -ECANCELD and fake signaled 2)GPU_recovery(job = NULL) again invoked by hypervisor or KFD, but the job of above step hasn't finished its "drm_sched_job_finish" routine which is kicked by queu_work(), due to the cause that the second gpu_recovery() is called prior to "drm_sched_job_finish" (so the job is not took out from the mirror_list, so there will be duplicated dma_fence_set_error on that job and give you warning messeage) /Monk -----Original Message----- From: amd-gfx <amd-gfx-bounces@xxxxxxxxxxxxxxxxxxxxx> On Behalf Of Huang, Trigger Sent: Monday, November 12, 2018 4:36 PM To: Koenig, Christian <Christian.Koenig@xxxxxxx>; amd-gfx@xxxxxxxxxxxxxxxxxxxxx Subject: RE: [PATCH] drm/scheduler: Fix bad job be re-processed in TDR Hi Christian Thanks for the correction before, I gave my explanation as below, would you help to check again, thanks in advance. The key thing here is, if a job’s fence is signaled already, then call dma_fence_set_error to its fence will lead to a kernel warning call trace static inline void dma_fence_set_error(struct dma_fence *fence, int error) { WARN_ON(test_bit(DMA_FENCE_FLAG_SIGNALED_BIT, &fence->flags)); # here is warning source WARN_ON(error >= 0 || error < -MAX_ERRNO); fence->error = error; } Then, let’s see a guilty job’s process flow in TDR. amdgpu_device_gpu_recover -> drm_sched_job_recovery -> for each job in ring_mirror_list : 1) dma_fence_set_error(&s_fence->finished, -ECANCELED) if this job is guilty 2) amdgpu_job_run: the guilty job will be skipped, and not submitted into to ring 3) drm_sched_process_job(guilty_job) -> drm_sched_fence_finished -> dma_fence_signal(&fence->finished) -> drm_sched_job_finish_cb -> schedule_work(&job->finish_work); Later, finish_work’s callback function, drm_sched_job_finish, will be called by work queue, and guilty job will finally be deleted from ring_mirror_list. But sometimes, amdgpu_device_gpu_recover will be called(from host FLR interrupt or KFD) again immediately, and at this moment, finish_work is not scheduled by work queue, and drm_sched_job_finish for the guilty job is not called yet, so this guilty job is still in the ring_mirror_list. But remember, the guilty job’s fence was signaled before (by drm_sched_fence_finished), followed by the TDR process, dma_fence_set_error will be called again to this guilty, then cause the kernel warning call trace. So the root cause is, the finish_work for each job is not scheduled quickly enough for GPU TDR scenario. Three ideas: 1), Instead of using the system global work queue, maybe we can create a dedicate work queue to manage the job finish things, but still can’t guarantee its finish_work’s execution in GPU TDR scenario 2), Wait for all the guilty jobs’ finish_work to be finished execution before return from drm_sched_job_recovery , but I think it will waste too much time. 3), to delete the guilty job from the ring_mirror_list directly after this job is processed in amdgpu_job_run(). amdgpu_job_run is the final entrance for a job to be submitted into the ring, if something wrong with this job:(finished->error < 0), then this job should be never be submitted again, and so should be deleted from the recovery list. I choose 3). I am going to make a new patch as below instead of the original one diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c index e0af44f..aaf697f 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c @@ -208,6 +208,7 @@ static struct dma_fence *amdgpu_job_dependency(struct drm_sched_job *sched_job, static struct dma_fence *amdgpu_job_run(struct drm_sched_job *sched_job) { struct amdgpu_ring *ring = to_amdgpu_ring(sched_job->sched); + struct drm_gpu_scheduler *sched = sched_job->sched; struct dma_fence *fence = NULL, *finished; struct amdgpu_job *job; int r; @@ -223,7 +224,10 @@ static struct dma_fence *amdgpu_job_run(struct drm_sched_job *sched_job) dma_fence_set_error(finished, -ECANCELED);/* skip IB as well if VRAM lost */ if (finished->error < 0) { - DRM_INFO("Skip scheduling IBs!\n"); + DRM_INFO("Skip scheduling IBs and delete job from recovery list!\n"); + spin_lock(&sched->job_list_lock); + list_del_init(&sched_job->node); + spin_unlock(&sched->job_list_lock); } else { r = amdgpu_ib_schedule(ring, job->num_ibs, job->ibs, job, &fence); Thanks, Trigger -----Original Message----- From: Christian König <ckoenig.leichtzumerken@xxxxxxxxx> Sent: Friday, November 09, 2018 8:18 PM To: Huang, Trigger <Trigger.Huang@xxxxxxx>; amd-gfx@xxxxxxxxxxxxxxxxxxxxx Subject: Re: [PATCH] drm/scheduler: Fix bad job be re-processed in TDR Am 09.11.18 um 13:10 schrieb Trigger Huang: > A bad job is the one triggered TDR. In the recovery process, its fence > will be signaled and as a result, the work behind will be scheduled to > delete it from the mirror list, but if the TDR process is invoked > before the work's execution, then the call dma_fence_set_error to its > fence in TDR process will lead to kernel warning trace: > > [ 143.033605] WARNING: CPU: 2 PID: 53 at > ./include/linux/dma-fence.h:437 amddrm_sched_job_recovery+0x1af/0x1c0 > [amd_sched] > kernel: [ 143.033606] Modules linked in: amdgpu(OE) amdchash(OE) amdttm(OE) amd_sched(OE) amdkcl(OE) amd_iommu_v2 drm_kms_helper drm i2c_algo_bit fb_sys_fops syscopyarea sysfillrect sysimgblt kvm_intel kvm irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel pcbc aesni_intel aes_x86_64 snd_hda_codec_generic crypto_simd glue_helper cryptd snd_hda_intel snd_hda_codec snd_hda_core snd_hwdep snd_pcm snd_seq_midi snd_seq_midi_event snd_rawmidi snd_seq joydev snd_seq_device snd_timer snd soundcore binfmt_misc input_leds mac_hid serio_raw nfsd auth_rpcgss nfs_acl lockd grace sunrpc sch_fq_codel parport_pc ppdev lp parport ip_tables x_tables autofs4 8139too floppy psmouse 8139cp mii i2c_piix4 pata_acpi > [ 143.033649] CPU: 2 PID: 53 Comm: kworker/2:1 Tainted: G OE 4.15.0-20-generic #21-Ubuntu > [ 143.033650] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), > BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014 [ 143.033653] Workqueue: events > drm_sched_job_timedout [amd_sched] [ 143.033656] RIP: > 0010:amddrm_sched_job_recovery+0x1af/0x1c0 [amd_sched] [ 143.033657] > RSP: 0018:ffffa9f880fe7d48 EFLAGS: 00010202 [ 143.033659] RAX: > 0000000000000007 RBX: ffff9b98f2b24c00 RCX: ffff9b98efef4f08 [ > 143.033660] RDX: ffff9b98f2b27400 RSI: ffff9b98f2b24c50 RDI: > ffff9b98efef4f18 [ 143.033660] RBP: ffffa9f880fe7d98 R08: > 0000000000000001 R09: 00000000000002b6 [ 143.033661] R10: > 0000000000000000 R11: 0000000000000000 R12: ffff9b98efef3430 [ > 143.033662] R13: ffff9b98efef4d80 R14: ffff9b98efef4e98 R15: > ffff9b98eaf91c00 [ 143.033663] FS: 0000000000000000(0000) > GS:ffff9b98ffd00000(0000) knlGS:0000000000000000 [ 143.033664] CS: > 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 143.033665] CR2: > 00007fc49c96d470 CR3: 000000001400a005 CR4: 00000000003606e0 [ 143.033669] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 143.033669] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 143.033670] Call Trace: > [ 143.033744] amdgpu_device_gpu_recover+0x144/0x820 [amdgpu] [ > 143.033788] amdgpu_job_timedout+0x9b/0xa0 [amdgpu] [ 143.033791] > drm_sched_job_timedout+0xcc/0x150 [amd_sched] [ 143.033795] > process_one_work+0x1de/0x410 [ 143.033797] worker_thread+0x32/0x410 > [ 143.033799] kthread+0x121/0x140 [ 143.033801] ? > process_one_work+0x410/0x410 [ 143.033803] ? > kthread_create_worker_on_cpu+0x70/0x70 > [ 143.033806] ret_from_fork+0x35/0x40 > > So just delete the bad job from mirror list directly after signal its > fence > > Signed-off-by: Trigger Huang <Trigger.Huang@xxxxxxx> > --- > drivers/gpu/drm/scheduler/sched_main.c | 18 +++++++++++++++++- > 1 file changed, 17 insertions(+), 1 deletion(-) > > diff --git a/drivers/gpu/drm/scheduler/sched_main.c > b/drivers/gpu/drm/scheduler/sched_main.c > index 18ebbb0..b2832fb 100644 > --- a/drivers/gpu/drm/scheduler/sched_main.c > +++ b/drivers/gpu/drm/scheduler/sched_main.c > @@ -228,7 +228,10 @@ static void drm_sched_job_finish(struct > work_struct *work) > > spin_lock(&sched->job_list_lock); > /* remove job from ring_mirror_list */ > - list_del(&s_job->node); > + /* This node may already has been deleted in job recovery */ > + if (s_job->node.next != &(s_job->node)) > + list_del_init(&s_job->node); > + Well that doesn't make any sense at all. If the list_head is already empty (pointing to itself) list_del is a no-op. > /* queue TDR for next job */ > drm_sched_start_timeout(sched); > spin_unlock(&sched->job_list_lock); > @@ -394,6 +397,19 @@ void drm_sched_job_recovery(struct drm_gpu_scheduler *sched) > drm_sched_process_job(NULL, &s_fence->cb); > } > spin_lock(&sched->job_list_lock); > + /** > + * Normally the 'bad' job will be deleted by its finish_work > + * after signal its fence, but sometimes if a GPU recovery is > + * invoked before this work finished execution (for example, > + * KFD/Host triggered the GPU reset when the current one is > + * on-going), then the 'bad' job may will be processed again, > + * which is definitely no necessary, and also will cause a lot > + * of warning call traces when this job is set by > + * 'dma_fence_set_error' because it has already been signaled > + */ > + if ((s_fence->finished.error < 0) > + && (s_job->node.next != &(s_job->node))) > + list_del_init(&s_job->node); That strongly looks incorrect as well. Can you describe more specific what you are trying to do here? Thanks, Christian. > } > drm_sched_start_timeout(sched); > spin_unlock(&sched->job_list_lock); _______________________________________________ amd-gfx mailing list amd-gfx@xxxxxxxxxxxxxxxxxxxxx https://lists.freedesktop.org/mailman/listinfo/amd-gfx _______________________________________________ amd-gfx mailing list amd-gfx@xxxxxxxxxxxxxxxxxxxxx https://lists.freedesktop.org/mailman/listinfo/amd-gfx