Hi Andrey, Currently, I am busying with another issue, maybe will try next week. Best wishes Emily Deng >-----Original Message----- >From: Grodzovsky, Andrey <Andrey.Grodzovsky@xxxxxxx> >Sent: Friday, November 15, 2019 6:14 AM >To: Koenig, Christian <Christian.Koenig@xxxxxxx>; Deng, Emily ><Emily.Deng@xxxxxxx>; amd-gfx@xxxxxxxxxxxxxxxxxxxxx >Subject: Re: [PATCH] drm/amdgpu: Fix the null pointer issue for tdr > >Attached. > >Emily - can you give it a try ? > >Andrey > >On 11/14/19 3:12 AM, Christian König wrote: >>> What about instead of peeking at the job to actually remove it from >>> ring_mirror_list right there, >> Also an interesting idea. We would need to protect the mirror list >> with a lock again, but that should be the lesser evil. >> >> Maybe prototype that and see if it works or not. >> >> Regards, >> Christian. >> >> Am 13.11.19 um 17:00 schrieb Andrey Grodzovsky: >>> >>> >>> On 11/13/19 9:20 AM, Christian König wrote: >>>> Another more fundamental question: Could we get rid of the timeout >>>> job at all? >>> >>> >>> There are other stuff there besides picking the first unfinished job >>> which is common for all the drivers - such as freeing guilty signaled >>> job and rearming the timeout work timer. >>> >>> >>>> >>>> I mean we used to give this as parameter to the scheduler callback >>>> because we had the timeout worker in the job, but that is no longer >>>> the case. >>>> >>>> E.g. in drm_sched_job_timedout() we do the following: >>>>> job = list_first_entry_or_null(&sched->ring_mirror_list, >>>>> struct drm_sched_job, node); >>>> >>>> Why don't we just remove that here and only get the first job after >>>> we have stopped the scheduler? >>> >>> >>> Should be ok since we have the extra check for __kthread_should_park >>> in drm_sched_cleanup_jobs which will protect us in this case from a >>> wakeup of sched thread and execution of in drm_sched_cleanup_jobs >>> after we already parked it. The problem here is we need the >>> drm_sched_job to access the private data for each client driver (see >>> amdgpu_job_timedout for example). What about instead of peeking at >>> the job to actually remove it from ring_mirror_list right there, go >>> ahead with it through the reset routine, if it's signaled in the >>> meanwhile that great - release it, otherwise put it back into >>> ring_mirror_list in drm_sched_resubmit_jobs. >>> >>> Andrey >>> >>> >>>> >>>> Regards, >>>> Christian. >>>> >>>> Am 13.11.19 um 15:12 schrieb Andrey Grodzovsky: >>>>> >>>>> This why I asked for a trace with timer enabled, but since there is >>>>> a finite number of places we touch the timer Emily can just put >>>>> prints there. Also, I wonder if this temp fix helps her with the >>>>> issue or not. >>>>> >>>>> Andrey >>>>> >>>>> On 11/13/19 2:36 AM, Christian König wrote: >>>>>> The question is where do we rearm the timer for this problem to >>>>>> occur? >>>>>> >>>>>> Regards, >>>>>> Christian. >>>>>> >>>>>> Am 12.11.19 um 20:21 schrieb Andrey Grodzovsky: >>>>>>> >>>>>>> I was able to reproduce the crash by using the attached >>>>>>> simulate_crash.patch - waiting on guilty job to signal in reset >>>>>>> work and artificially rearming the timeout timer just before the >>>>>>> check for !cancel_delayed_work(&sched->work_tdr) in >>>>>>> drm_sched_cleanup_jobs - crash log attached in crash.log. This I >>>>>>> think confirms my theory i described earlier in this thread. >>>>>>> >>>>>>> basic_fix.patch handles this by testing whether another timer >>>>>>> already armed ob this scheduler or is there a timeout work in >>>>>>> execution right now (see documentation for work_busy) - obviously >>>>>>> this is not a full solution as this will not protect from races >>>>>>> if for example there is immediate work scheduling such as in >>>>>>> drm_sched_fault - so we probably need to account for this by >>>>>>> making drm_sched_cleanup_jobs (at least in the part where it >>>>>>> iterates ring mirror list and frees jobs) and GPU reset really >>>>>>> mutually exclusive and not like now. >>>>>>> >>>>>>> Andrey >>>>>>> >>>>>>> >>>>>>> On 11/11/19 4:11 PM, Christian König wrote: >>>>>>>> Hi Emily, >>>>>>>> >>>>>>>> you need to print which scheduler instance is freeing the jobs >>>>>>>> and which one is triggering the reset. The TID and PID is >>>>>>>> completely meaningless here since we are called from different >>>>>>>> worker threads and the TID/PID can change on each call. >>>>>>>> >>>>>>>> Apart from that I will look into this a bit deeper when I have >>>>>>>> time. >>>>>>>> >>>>>>>> Regards, >>>>>>>> Christian. >>>>>>>> >>>>>>>> Am 12.11.19 um 07:02 schrieb Deng, Emily: >>>>>>>>> Hi Christian, >>>>>>>>> I add the follow print in function drm_sched_cleanup_jobs. >>>>>>>>> From the log it shows that only use cancel_delayed_work could >>>>>>>>> not avoid to free job when the sched is in reset. But don’t >>>>>>>>> know exactly where it is wrong about the driver. Do you have >>>>>>>>> any suggestion about this? >>>>>>>>> + printk("Emily:drm_sched_cleanup_jobs:begin,tid:%lu, >>>>>>>>> pid:%lu\n", current->tgid, current->pid); >>>>>>>>> /* >>>>>>>>> * Don't destroy jobs while the timeout worker is >>>>>>>>> running OR thread >>>>>>>>> * is being parked and hence assumed to not touch >>>>>>>>> ring_mirror_list >>>>>>>>> */ >>>>>>>>> if ((sched->timeout != MAX_SCHEDULE_TIMEOUT && >>>>>>>>> !cancel_delayed_work(&sched->work_tdr))) >>>>>>>>> return; >>>>>>>>> + printk("Emily:drm_sched_cleanup_jobs,tid:%lu, pid:%lu\n", >>>>>>>>> current->tgid, current->pid); >>>>>>>>> Best wishes >>>>>>>>> Emily Deng >>>>>>>>> Nov 12 12:58:20 ubuntu-drop-August-2018-rc2-gpu0-vf02 kernel: >>>>>>>>> [11380.695091] Emily:drm_sched_cleanup_jobs:begin,tid:2262, >>>>>>>>> pid:2262 >>>>>>>>> Nov 12 12:58:20 ubuntu-drop-August-2018-rc2-gpu0-vf02 kernel: >>>>>>>>> [11380.695104] Emily:drm_sched_cleanup_jobs:begin,tid:2262, >>>>>>>>> pid:2262 >>>>>>>>> Nov 12 12:58:20 ubuntu-drop-August-2018-rc2-gpu0-vf02 kernel: >>>>>>>>> [11380.695105] Emily:drm_sched_cleanup_jobs,tid:2262, pid:2262 >>>>>>>>> Nov 12 12:58:20 ubuntu-drop-August-2018-rc2-gpu0-vf02 kernel: >>>>>>>>> [11380.695107] Emily:drm_sched_cleanup_jobs:begin,tid:2262, >>>>>>>>> pid:2262 >>>>>>>>> Nov 12 12:58:20 ubuntu-drop-August-2018-rc2-gpu0-vf02 kernel: >>>>>>>>> [11380.695107] Emily:drm_sched_cleanup_jobs,tid:2262, pid:2262 >>>>>>>>> Nov 12 12:58:20 ubuntu-drop-August-2018-rc2-gpu0-vf02 kernel: >>>>>>>>> [11381.222954] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* >ring >>>>>>>>> sdma0 timeout, signaled seq=78585, emitted seq=78587 Nov 12 >>>>>>>>> 12:58:20 ubuntu-drop-August-2018-rc2-gpu0-vf02 kernel: >>>>>>>>> [11381.224275] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* >>>>>>>>> Process information: process pid 0 thread pid 0, >>>>>>>>> s_job:00000000fe75ab36,tid=15603, pid=15603 Nov 12 12:58:20 >>>>>>>>> ubuntu-drop-August-2018-rc2-gpu0-vf02 kernel: >>>>>>>>> [11381.225413] amdgpu 0000:00:08.0: GPU reset begin! >>>>>>>>> Nov 12 12:58:20 ubuntu-drop-August-2018-rc2-gpu0-vf02 kernel: >>>>>>>>> [11381.225417] Emily:drm_sched_cleanup_jobs:begin,tid:2262, >>>>>>>>> pid:2262 >>>>>>>>> Nov 12 12:58:20 ubuntu-drop-August-2018-rc2-gpu0-vf02 kernel: >>>>>>>>> [11381.225425] Emily:drm_sched_cleanup_jobs:begin,tid:2262, >>>>>>>>> pid:2262 >>>>>>>>> Nov 12 12:58:20 ubuntu-drop-August-2018-rc2-gpu0-vf02 kernel: >>>>>>>>> [11381.225425] Emily:drm_sched_cleanup_jobs,tid:2262, pid:2262 >>>>>>>>> Nov 12 12:58:20 ubuntu-drop-August-2018-rc2-gpu0-vf02 kernel: >>>>>>>>> [11381.225428] Emily:amdgpu_job_free_cb,Process information: >>>>>>>>> process pid 0 thread pid 0, s_job:00000000fe75ab36, tid:2262, >>>>>>>>> pid:2262 >>>>>>>>> Nov 12 12:58:20 ubuntu-drop-August-2018-rc2-gpu0-vf02 kernel: >>>>>>>>> [11381.225429] Emily:drm_sched_cleanup_jobs:begin,tid:2262, >>>>>>>>> pid:2262 >>>>>>>>> Nov 12 12:58:20 ubuntu-drop-August-2018-rc2-gpu0-vf02 kernel: >>>>>>>>> [11381.225430] Emily:drm_sched_cleanup_jobs,tid:2262, pid:2262 >>>>>>>>> Nov 12 12:58:20 ubuntu-drop-August-2018-rc2-gpu0-vf02 kernel: >>>>>>>>> [11381.225473] Emily:drm_sched_cleanup_jobs:begin,tid:2253, >>>>>>>>> pid:2253 >>>>>>>>> Nov 12 12:58:20 ubuntu-drop-August-2018-rc2-gpu0-vf02 kernel: >>>>>>>>> [11381.225486] Emily:drm_sched_cleanup_jobs:begin,tid:2262, >>>>>>>>> pid:2262 >>>>>>>>> Nov 12 12:58:20 ubuntu-drop-August-2018-rc2-gpu0-vf02 kernel: >>>>>>>>> [11381.225489] Emily:drm_sched_cleanup_jobs,tid:2262, pid:2262 >>>>>>>>> Nov 12 12:58:20 ubuntu-drop-August-2018-rc2-gpu0-vf02 kernel: >>>>>>>>> [11381.225494] Emily:amdgpu_job_free_cb,Process information: >>>>>>>>> process pid 0 thread pid 0, s_job:00000000f086ec84, tid:2262, >>>>>>>>> pid:2262 >>>>>>>>> >-----Original Message----- >>>>>>>>> >From: Grodzovsky, Andrey <Andrey.Grodzovsky@xxxxxxx> >>>>>>>>> >Sent: Tuesday, November 12, 2019 11:28 AM >>>>>>>>> >To: Koenig, Christian <Christian.Koenig@xxxxxxx>; Deng, Emily >>>>>>>>> ><Emily.Deng@xxxxxxx>; amd-gfx@xxxxxxxxxxxxxxxxxxxxx >>>>>>>>> >Subject: Re: [PATCH] drm/amdgpu: Fix the null pointer issue >>>>>>>>> for tdr >>>>>>>>> > >>>>>>>>> >Thinking more about this claim - we assume here that if >>>>>>>>> cancel_delayed_work >>>>>>>>> >returned true it guarantees that timeout work is not running >>>>>>>>> but, it merely >>>>>>>>> >means there was a pending timeout work which was removed >from >>>>>>>>> >the workqueue before it's timer elapsed and so it didn't have >>>>>>>>> >a >>>>>>>>> chance to be >>>>>>>>> >dequeued and executed, it doesn't cover already executing >>>>>>>>> work. So there is a >>>>>>>>> >possibility where while timeout work started executing another >>>>>>>>> timeout work >>>>>>>>> >already got enqueued (maybe through earlier cleanup jobs or >>>>>>>>> through >>>>>>>>> >drm_sched_fault) and if at this point another >>>>>>>>> drm_sched_cleanup_jobs runs >>>>>>>>> >cancel_delayed_work(&sched->work_tdr) will return true even >>>>>>>>> while there is a >>>>>>>>> >timeout job in progress. >>>>>>>>> >Unfortunately we cannot change cancel_delayed_work to >>>>>>>>> >cancel_delayed_work_sync to flush the timeout work as timeout >>>>>>>>> work itself >>>>>>>>> >waits for schedule thread to be parked again when calling >>>>>>>>> park_thread. >>>>>>>>> > >>>>>>>>> >Andrey >>>>>>>>> > >>>>>>>>> >________________________________________ >>>>>>>>> >From: amd-gfx <amd-gfx-bounces@xxxxxxxxxxxxxxxxxxxxx> on >>>>>>>>> behalf of >>>>>>>>> >Koenig, Christian <Christian.Koenig@xxxxxxx> >>>>>>>>> >Sent: 08 November 2019 05:35:18 >>>>>>>>> >To: Deng, Emily; amd-gfx@xxxxxxxxxxxxxxxxxxxxx >>>>>>>>> >Subject: Re: [PATCH] drm/amdgpu: Fix the null pointer issue >>>>>>>>> for tdr >>>>>>>>> > >>>>>>>>> >Hi Emily, >>>>>>>>> > >>>>>>>>> >exactly that can't happen. See here: >>>>>>>>> > >>>>>>>>> >> /* Don't destroy jobs while the timeout worker is >>>>>>>>> running */ >>>>>>>>> >> if (sched->timeout != MAX_SCHEDULE_TIMEOUT && >>>>>>>>> >> !cancel_delayed_work(&sched->work_tdr)) >>>>>>>>> >> return NULL; >>>>>>>>> > >>>>>>>>> >We never free jobs while the timeout working is running to >>>>>>>>> prevent exactly >>>>>>>>> >that issue. >>>>>>>>> > >>>>>>>>> >Regards, >>>>>>>>> >Christian. >>>>>>>>> > >>>>>>>>> >Am 08.11.19 um 11:32 schrieb Deng, Emily: >>>>>>>>> >> Hi Christian, >>>>>>>>> >> The drm_sched_job_timedout-> amdgpu_job_timedout call >>>>>>>>> >amdgpu_device_gpu_recover. I mean the main scheduler free the >>>>>>>>> jobs while >>>>>>>>> >in amdgpu_device_gpu_recover, and before calling >drm_sched_stop. >>>>>>>>> >> >>>>>>>>> >> Best wishes >>>>>>>>> >> Emily Deng >>>>>>>>> >> >>>>>>>>> >> >>>>>>>>> >> >>>>>>>>> >>> -----Original Message----- >>>>>>>>> >>> From: Koenig, Christian <Christian.Koenig@xxxxxxx> >>>>>>>>> >>> Sent: Friday, November 8, 2019 6:26 PM >>>>>>>>> >>> To: Deng, Emily <Emily.Deng@xxxxxxx>; >>>>>>>>> amd-gfx@xxxxxxxxxxxxxxxxxxxxx >>>>>>>>> >>> Subject: Re: [PATCH] drm/amdgpu: Fix the null pointer issue >>>>>>>>> for tdr >>>>>>>>> >>> >>>>>>>>> >>> Hi Emily, >>>>>>>>> >>> >>>>>>>>> >>> well who is calling amdgpu_device_gpu_recover() in this case? >>>>>>>>> >>> >>>>>>>>> >>> When it's not the scheduler we shouldn't have a guilty job >>>>>>>>> in the first place. >>>>>>>>> >>> >>>>>>>>> >>> Regards, >>>>>>>>> >>> Christian. >>>>>>>>> >>> >>>>>>>>> >>> Am 08.11.19 um 11:22 schrieb Deng, Emily: >>>>>>>>> >>>> Hi Chrisitan, >>>>>>>>> >>>> No, I am with the new branch and also has the >>>>>>>>> patch. Even it >>>>>>>>> >>>> are freed by >>>>>>>>> >>> main scheduler, how we could avoid main scheduler to free >>>>>>>>> jobs while >>>>>>>>> >>> enter to function amdgpu_device_gpu_recover? >>>>>>>>> >>>> Best wishes >>>>>>>>> >>>> Emily Deng >>>>>>>>> >>>> >>>>>>>>> >>>> >>>>>>>>> >>>> >>>>>>>>> >>>>> -----Original Message----- >>>>>>>>> >>>>> From: Koenig, Christian <Christian.Koenig@xxxxxxx> >>>>>>>>> >>>>> Sent: Friday, November 8, 2019 6:15 PM >>>>>>>>> >>>>> To: Deng, Emily <Emily.Deng@xxxxxxx>; amd- >>>>>>>>> >gfx@xxxxxxxxxxxxxxxxxxxxx >>>>>>>>> >>>>> Subject: Re: [PATCH] drm/amdgpu: Fix the null pointer >>>>>>>>> issue for tdr >>>>>>>>> >>>>> >>>>>>>>> >>>>> Hi Emily, >>>>>>>>> >>>>> >>>>>>>>> >>>>> in this case you are on an old code branch. >>>>>>>>> >>>>> >>>>>>>>> >>>>> Jobs are freed now by the main scheduler thread and only >>>>>>>>> if no >>>>>>>>> >>>>> timeout handler is running. >>>>>>>>> >>>>> >>>>>>>>> >>>>> See this patch here: >>>>>>>>> >>>>>> commit 5918045c4ed492fb5813f980dcf89a90fefd0a4e >>>>>>>>> >>>>>> Author: Christian König <christian.koenig@xxxxxxx> >>>>>>>>> >>>>>> Date: Thu Apr 18 11:00:21 2019 -0400 >>>>>>>>> >>>>>> >>>>>>>>> >>>>>> drm/scheduler: rework job destruction >>>>>>>>> >>>>> Regards, >>>>>>>>> >>>>> Christian. >>>>>>>>> >>>>> >>>>>>>>> >>>>> Am 08.11.19 um 11:11 schrieb Deng, Emily: >>>>>>>>> >>>>>> Hi Christian, >>>>>>>>> >>>>>> Please refer to follow log, when it enter to >>>>>>>>> >>>>>>amdgpu_device_gpu_recover >>>>>>>>> >>>>> function, the bad job 000000005086879e is freeing in >>>>>>>>> function >>>>>>>>> >>>>> amdgpu_job_free_cb at the same time, because of the >>>>>>>>> hardware fence >>>>>>>>> >>> signal. >>>>>>>>> >>>>> But amdgpu_device_gpu_recover goes faster, at this case, >>>>>>>>> >>>>> the s_fence is already freed, but job is not freed in time. >>>>>>>>> Then this issue >>>>>>>>> >occurs. >>>>>>>>> >>>>>> [ 449.792189] [drm:amdgpu_job_timedout [amdgpu]] >>>>>>>>> *ERROR* ring >>>>>>>>> >>> sdma0 >>>>>>>>> >>>>>> timeout, signaled seq=2481, emitted seq=2483 [ >>>>>>>>> >>>>>> 449.793202] [drm:amdgpu_job_timedout [amdgpu]] >*ERROR* >>>>>>>>> >>>>>> Process >>>>>>>>> information: >>>>>>>>> >>>>> process pid 0 thread pid 0, s_job:000000005086879e [ >>>>>>>>> 449.794163] >>>>>>>>> >>>>> amdgpu >>>>>>>>> >>>>> 0000:00:08.0: GPU reset begin! >>>>>>>>> >>>>>> [ 449.794175] Emily:amdgpu_job_free_cb,Process >>>>>>>>> information: >>>>>>>>> >>>>>> process pid 0 thread pid 0, s_job:000000005086879e [ >>>>>>>>> 449.794221] >>>>>>>>> >>>>>> Emily:amdgpu_job_free_cb,Process information: process >>>>>>>>> pid 0 >>>>>>>>> >>>>>> thread pid 0, s_job:0000000066eb74ab [ 449.794222] >>>>>>>>> >>>>>> Emily:amdgpu_job_free_cb,Process information: process >>>>>>>>> pid 0 >>>>>>>>> >>>>>> thread pid 0, s_job:00000000d4438ad9 [ 449.794255] >>>>>>>>> >>>>>> Emily:amdgpu_job_free_cb,Process information: process >>>>>>>>> pid 0 >>>>>>>>> >>>>>> thread pid 0, s_job:00000000b6d69c65 [ 449.794257] >>>>>>>>> >>>>>> Emily:amdgpu_job_free_cb,Process information: process >>>>>>>>> pid 0 >>>>>>>>> >>>>>> thread pid 0, >>>>>>>>> >>>>> s_job:00000000ea85e922 [ 449.794287] >>>>>>>>> >>>>> Emily:amdgpu_job_free_cb,Process >>>>>>>>> >>>>> information: process pid 0 thread pid 0, >>>>>>>>> s_job:00000000ed3a5ac6 [ >>>>>>>>> >>>>> 449.794366] BUG: unable to handle kernel NULL pointer >>>>>>>>> dereference >>>>>>>>> >>>>> at >>>>>>>>> >>>>> 00000000000000c0 [ 449.800818] PGD 0 P4D 0 >[ 449.801040] >>>>>>>>> Oops: >>>>>>>>> >>>>> 0000 [#1] SMP PTI >>>>>>>>> >>>>>> [ 449.801338] CPU: 3 PID: 55 Comm: kworker/3:1 Tainted: >>>>>>>>> G OE >>>>>>>>> >>>>> 4.18.0-15-generic #16~18.04.1-Ubuntu >>>>>>>>> >>>>>> [ 449.802157] Hardware name: QEMU Standard PC (i440FX >+ >>>>>>>>> PIIX, >>>>>>>>> >>>>>> 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014 [ >>>>>>>>> >>>>>> 449.802944] >>>>>>>>> >>>>>> Workqueue: events drm_sched_job_timedout [amd_sched] >[ >>>>>>>>> >>>>>> 449.803488] >>>>>>>>> >>> RIP: >>>>>>>>> >>>>> 0010:amdgpu_device_gpu_recover+0x1da/0xb60 [amdgpu] >>>>>>>>> >>>>>> [ 449.804020] Code: dd ff ff 49 39 c5 48 89 55 a8 0f 85 >>>>>>>>> 56 ff ff >>>>>>>>> >>>>>> ff >>>>>>>>> >>>>>> 45 85 e4 0f >>>>>>>>> >>>>> 85 a1 00 00 00 48 8b 45 b0 48 85 c0 0f 84 60 01 00 00 48 >>>>>>>>> 8b 40 10 >>>>>>>>> >>>>> <48> 8b >>>>>>>>> >>> 98 >>>>>>>>> >>>>> c0 00 00 00 48 85 db 0f 84 4c 01 00 00 48 8b 43 >>>>>>>>> 48 a8 01 >>>>>>>>> >>>>>> [ 449.805593] RSP: 0018:ffffb4c7c08f7d68 EFLAGS: >>>>>>>>> 00010286 [ >>>>>>>>> >>>>>> 449.806032] RAX: 0000000000000000 RBX: >0000000000000000 >>>>>>>>> RCX: >>>>>>>>> >>>>>> 0000000000000000 [ 449.806625] RDX: ffffb4c7c08f5ac0 >RSI: >>>>>>>>> >>>>>> 0000000fffffffe0 RDI: 0000000000000246 [ 449.807224] >RBP: >>>>>>>>> >>>>>> ffffb4c7c08f7de0 R08: 00000068b9d54000 R09: >>>>>>>>> 0000000000000000 [ >>>>>>>>> >>>>>> 449.807818] R10: 0000000000000000 R11: >0000000000000148 >>>>>>>>> R12: >>>>>>>>> >>>>>> 0000000000000000 [ 449.808411] R13: ffffb4c7c08f7da0 >R14: >>>>>>>>> >>>>>> ffff8d82b8525d40 R15: ffff8d82b8525d40 [ 449.809004] FS: >>>>>>>>> >>>>>> 0000000000000000(0000) GS:ffff8d82bfd80000(0000) >>>>>>>>> >>>>>> knlGS:0000000000000000 [ 449.809674] CS: 0010 DS: 0000 >>>>>>>>> ES: 0000 >>>>>>>>> >CR0: >>>>>>>>> >>>>>> 0000000080050033 [ 449.810153] CR2: 00000000000000c0 >CR3: >>>>>>>>> >>>>>> 000000003cc0a001 CR4: 00000000003606e0 [ 449.810747] >DR0: >>>>>>>>> >>>>> 0000000000000000 DR1: 0000000000000000 DR2: >>>>>>>>> 0000000000000000 [ >>>>>>>>> >>>>> 449.811344] DR3: 0000000000000000 DR6: >00000000fffe0ff0 DR7: >>>>>>>>> >>>>> 0000000000000400 [ 449.811937] Call Trace: >>>>>>>>> >>>>>> [ 449.812206] amdgpu_job_timedout+0x114/0x140 >[amdgpu] >>>>>>>>> >>>>>> [ 449.812635] drm_sched_job_timedout+0x44/0x90 >>>>>>>>> >>>>>> [amd_sched] [ 449.813139] ? >>>>>>>>> >>>>>> amdgpu_cgs_destroy_device+0x10/0x10 >>>>>>>>> [amdgpu] [ >>>>>>>>> >>>>>> 449.813609] ? drm_sched_job_timedout+0x44/0x90 >>>>>>>>> [amd_sched] [ >>>>>>>>> >>>>>> 449.814077] process_one_work+0x1fd/0x3f0 [ 449.814417] >>>>>>>>> >>>>>> worker_thread+0x34/0x410 [ 449.814728] >>>>>>>>> kthread+0x121/0x140 [ >>>>>>>>> >>>>>> 449.815004] ? process_one_work+0x3f0/0x3f0 [ >>>>>>>>> 449.815374] ? >>>>>>>>> >>>>>> kthread_create_worker_on_cpu+0x70/0x70 >>>>>>>>> >>>>>> [ 449.815799] ret_from_fork+0x35/0x40 >>>>>>>>> >>>>>> >>>>>>>>> >>>>>>> -----Original Message----- >>>>>>>>> >>>>>>> From: Koenig, Christian <Christian.Koenig@xxxxxxx> >>>>>>>>> >>>>>>> Sent: Friday, November 8, 2019 5:43 PM >>>>>>>>> >>>>>>> To: Deng, Emily <Emily.Deng@xxxxxxx>; amd- >>>>>>>>> >>> gfx@xxxxxxxxxxxxxxxxxxxxx >>>>>>>>> >>>>>>> Subject: Re: [PATCH] drm/amdgpu: Fix the null pointer >>>>>>>>> issue for >>>>>>>>> >>>>>>> tdr >>>>>>>>> >>>>>>> >>>>>>>>> >>>>>>> Am 08.11.19 um 10:39 schrieb Deng, Emily: >>>>>>>>> >>>>>>>> Sorry, please take your time. >>>>>>>>> >>>>>>> Have you seen my other response a bit below? >>>>>>>>> >>>>>>> >>>>>>>>> >>>>>>> I can't follow how it would be possible for >>>>>>>>> job->s_fence to be >>>>>>>>> >>>>>>> NULL without the job also being freed. >>>>>>>>> >>>>>>> >>>>>>>>> >>>>>>> So it looks like this patch is just papering over some >>>>>>>>> bigger issues. >>>>>>>>> >>>>>>> >>>>>>>>> >>>>>>> Regards, >>>>>>>>> >>>>>>> Christian. >>>>>>>>> >>>>>>> >>>>>>>>> >>>>>>>> Best wishes >>>>>>>>> >>>>>>>> Emily Deng >>>>>>>>> >>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>>>> >>>>>>>>> -----Original Message----- >>>>>>>>> >>>>>>>>> From: Koenig, Christian <Christian.Koenig@xxxxxxx> >>>>>>>>> >>>>>>>>> Sent: Friday, November 8, 2019 5:08 PM >>>>>>>>> >>>>>>>>> To: Deng, Emily <Emily.Deng@xxxxxxx>; amd- >>>>>>>>> >>>>> gfx@xxxxxxxxxxxxxxxxxxxxx >>>>>>>>> >>>>>>>>> Subject: Re: [PATCH] drm/amdgpu: Fix the null pointer >>>>>>>>> issue for >>>>>>>>> >>>>>>>>> tdr >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> Am 08.11.19 um 09:52 schrieb Deng, Emily: >>>>>>>>> >>>>>>>>>> Ping..... >>>>>>>>> >>>>>>>>> You need to give me at least enough time to wake up >>>>>>>>> >>>>>>>>> :) >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>>> Best wishes >>>>>>>>> >>>>>>>>>> Emily Deng >>>>>>>>> >>>>>>>>>> >>>>>>>>> >>>>>>>>>> >>>>>>>>> >>>>>>>>>> >>>>>>>>> >>>>>>>>>>> -----Original Message----- >>>>>>>>> >>>>>>>>>>> From: amd-gfx >>>>>>>>> <amd-gfx-bounces@xxxxxxxxxxxxxxxxxxxxx> On >>>>>>>>> >>> Behalf >>>>>>>>> >>>>>>>>>>> Of Deng, Emily >>>>>>>>> >>>>>>>>>>> Sent: Friday, November 8, 2019 10:56 AM >>>>>>>>> >>>>>>>>>>> To: Koenig, Christian <Christian.Koenig@xxxxxxx>; >>>>>>>>> >>>>>>>>>>> amd- gfx@xxxxxxxxxxxxxxxxxxxxx >>>>>>>>> >>>>>>>>>>> Subject: RE: [PATCH] drm/amdgpu: Fix the null >>>>>>>>> pointer issue >>>>>>>>> >>>>>>>>>>> for tdr >>>>>>>>> >>>>>>>>>>> >>>>>>>>> >>>>>>>>>>>> -----Original Message----- >>>>>>>>> >>>>>>>>>>>> From: Christian König >>>>>>>>> <ckoenig.leichtzumerken@xxxxxxxxx> >>>>>>>>> >>>>>>>>>>>> Sent: Thursday, November 7, 2019 7:28 PM >>>>>>>>> >>>>>>>>>>>> To: Deng, Emily <Emily.Deng@xxxxxxx>; >>>>>>>>> >>>>>>>>>>>> amd-gfx@xxxxxxxxxxxxxxxxxxxxx >>>>>>>>> >>>>>>>>>>>> Subject: Re: [PATCH] drm/amdgpu: Fix the null >>>>>>>>> pointer issue >>>>>>>>> >>>>>>>>>>>> for tdr >>>>>>>>> >>>>>>>>>>>> >>>>>>>>> >>>>>>>>>>>> Am 07.11.19 um 11:25 schrieb Emily Deng: >>>>>>>>> >>>>>>>>>>>>> When the job is already signaled, the s_fence is >>>>>>>>> freed. >>>>>>>>> >>>>>>>>>>>>> Then it will has null pointer in >>>>>>>>> amdgpu_device_gpu_recover. >>>>>>>>> >>>>>>>>>>>> NAK, the s_fence is only set to NULL when the job >>>>>>>>> is destroyed. >>>>>>>>> >>>>>>>>>>>> See drm_sched_job_cleanup(). >>>>>>>>> >>>>>>>>>>> I know it is set to NULL in drm_sched_job_cleanup. >>>>>>>>> But in one >>>>>>>>> >>>>>>>>>>> case, when it enter into the >>>>>>>>> amdgpu_device_gpu_recover, it >>>>>>>>> >>>>>>>>>>> already in drm_sched_job_cleanup, and at this time, >>>>>>>>> it will >>>>>>>>> >>>>>>>>>>> go to free >>>>>>>>> >>>>> job. >>>>>>>>> >>>>>>>>>>> But the amdgpu_device_gpu_recover sometimes is >>>>>>>>> faster. At >>>>>>>>> >>>>>>>>>>> that time, job is not freed, but s_fence is already >>>>>>>>> NULL. >>>>>>>>> >>>>>>>>> No, that case can't happen. See here: >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>>> drm_sched_job_cleanup(s_job); >>>>>>>>> >>>>>>>>>> >>>>>>>>> >>>>>>>>>> amdgpu_ring_priority_put(ring, s_job->s_priority); >>>>>>>>> >>>>>>>>>> dma_fence_put(job->fence); >>>>>>>>> >>>>>>>>>> amdgpu_sync_free(&job->sync); >>>>>>>>> >>>>>>>>>> amdgpu_sync_free(&job->sched_sync); >>>>>>>>> >>>>>>>>>> kfree(job); >>>>>>>>> >>>>>>>>> The job itself is freed up directly after freeing the >>>>>>>>> reference >>>>>>>>> >>>>>>>>> to the >>>>>>>>> >>>>> s_fence. >>>>>>>>> >>>>>>>>> So you are just papering over a much bigger problem >>>>>>>>> here. This >>>>>>>>> >>>>>>>>> patch is a clear NAK. >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> Regards, >>>>>>>>> >>>>>>>>> Christian. >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>>>>> When you see a job without an s_fence then that >>>>>>>>> means the >>>>>>>>> >>>>>>>>>>>> problem is somewhere else. >>>>>>>>> >>>>>>>>>>>> >>>>>>>>> >>>>>>>>>>>> Regards, >>>>>>>>> >>>>>>>>>>>> Christian. >>>>>>>>> >>>>>>>>>>>> >>>>>>>>> >>>>>>>>>>>>> Signed-off-by: Emily Deng <Emily.Deng@xxxxxxx> >>>>>>>>> >>>>>>>>>>>>> --- >>>>>>>>> >>>>>>>>>>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_device.c >| 2 >>>>>>>>> >>>>>>>>>>>>> +- drivers/gpu/drm/scheduler/sched_main.c | >>>>>>>>> >>>>>>>>>>>>> 11 >>>>>>>>> ++++++--- >>>>>>>>> >-- >>>>>>>>> >>>>>>>>>>>>> 2 files changed, 7 insertions(+), 6 >>>>>>>>> deletions(-) >>>>>>>>> >>>>>>>>>>>>> >>>>>>>>> >>>>>>>>>>>>> diff --git >>>>>>>>> a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c >>>>>>>>> >>>>>>>>>>>>> >b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c >>>>>>>>> >>>>>>>>>>>>> index e6ce949..5a8f08e 100644 >>>>>>>>> >>>>>>>>>>>>> --- >a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c >>>>>>>>> >>>>>>>>>>>>> +++ >b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c >>>>>>>>> >>>>>>>>>>>>> @@ -4075,7 +4075,7 @@ int >>>>>>>>> >>> amdgpu_device_gpu_recover(struct >>>>>>>>> >>>>>>>>>>>> amdgpu_device *adev, >>>>>>>>> >>>>>>>>>>>>> * >>>>>>>>> >>>>>>>>>>>>> * job->base holds a reference to >>>>>>>>> parent fence >>>>>>>>> >>>>>>>>>>>>> */ >>>>>>>>> >>>>>>>>>>>>> - if (job && job->base.s_fence->parent && >>>>>>>>> >>>>>>>>>>>>> + if (job && job->base.s_fence && >>>>>>>>> >>>>>>>>>>>>> + job->base.s_fence->parent >>>>>>>>> >>>>>>> && >>>>>>>>> >>>>>>>>>>>>> dma_fence_is_signaled(job->base.s_fence- >>parent)) >>>>>>>>> >>>>>>>>>>>>> job_signaled = true; >>>>>>>>> >>>>>>>>>>>>> >>>>>>>>> >>>>>>>>>>>>> diff --git >>>>>>>>> >>>>>>>>>>>>> a/drivers/gpu/drm/scheduler/sched_main.c >>>>>>>>> >>>>>>>>>>>>> b/drivers/gpu/drm/scheduler/sched_main.c >>>>>>>>> >>>>>>>>>>>>> index 31809ca..56cc10e 100644 >>>>>>>>> >>>>>>>>>>>>> --- a/drivers/gpu/drm/scheduler/sched_main.c >>>>>>>>> >>>>>>>>>>>>> +++ b/drivers/gpu/drm/scheduler/sched_main.c >>>>>>>>> >>>>>>>>>>>>> @@ -334,8 +334,8 @@ void >>>>>>>>> >drm_sched_increase_karma(struct >>>>>>>>> >>>>>>>>>>>> drm_sched_job >>>>>>>>> >>>>>>>>>>>>> *bad) >>>>>>>>> >>>>>>>>>>>>> >>>>>>>>> >>>>>>>>>>>>> spin_lock(&rq->lock); >>>>>>>>> >>>>>>>>>>>>> list_for_each_entry_safe(entity, tmp, >>>>>>>>> >>> &rq- >>>>>>>>> >>>>>>>> entities, >>>>>>>>> >>>>>>>>>>>> list) { >>>>>>>>> >>>>>>>>>>>>> - if >>>>>>>>> (bad->s_fence->scheduled.context >>>>>>>>> >>>>>>> == >>>>>>>>> >>>>>>>>>>>>> - entity->fence_context) { >>>>>>>>> >>>>>>>>>>>>> + if (bad->s_fence && >>>>>>>>> >>>>>>>>>>>>> + (bad->s_fence- >>>>>>>>> >>>>>>>>>>>>> scheduled.context == >>>>>>>>> >>>>>>>>>>>>> + entity->fence_context)) { >>>>>>>>> >>>>>>>>>>>>> if >>>>>>>>> >>>>>>>>>>>>> (atomic_read(&bad- >>>>>>>>> >>>>>>>> karma) > >>>>>>>>> >>>>>>>>>>>>> bad->sched- >>>>>>>>> >>>> hang_limit) >>>>>>>>> >>>>>>>>>>>>> if >>>>>>>>> >>>>>>>>>>>>> (entity- >>>>>>>>> >>>> guilty) @@ -376,7 +376,7 @@ void >>>>>>>>> >>>>>>>>>>>>> drm_sched_stop(struct >>>>>>>>> >>>>>>> drm_gpu_scheduler >>>>>>>>> >>>>>>>>>>>> *sched, struct drm_sched_job *bad) >>>>>>>>> >>>>>>>>>>>>> * This iteration is thread safe as >>>>>>>>> sched thread >>>>>>>>> >>>>>>>>>>>>> is >>>>>>>>> >>> stopped. >>>>>>>>> >>>>>>>>>>>>> */ >>>>>>>>> >>>>>>>>>>>>> list_for_each_entry_safe_reverse(s_job, tmp, >>>>>>>>> >>>>>>>>>>>>> &sched- ring_mirror_list, node) { >>>>>>>>> >>>>>>>>>>>>> - if (s_job->s_fence->parent && >>>>>>>>> >>>>>>>>>>>>> + if (s_job->s_fence && >>>>>>>>> s_job->s_fence->parent && >>>>>>>>> >>>>>>>>>>>>> dma_fence_remove_callback(s_job- >>>>>>>>> >>>> s_fence- >>>>>>>>> >>>>>>>> parent, >>>>>>>>> >>>>>>>>>>>>> &s_job->cb)) { >>>>>>>>> >>>>>>>>>>>>> atomic_dec(&sched->hw_rq_count); >>>>>>>>> >>> @@ - >>>>>>>>> >>>>>>> 395,7 >>>>>>>>> >>>>>>>>>>> +395,8 @@ void >>>>>>>>> >>>>>>>>>>>>> drm_sched_stop(struct drm_gpu_scheduler >>>>>>>>> >>>>>>>>>>>> *sched, struct drm_sched_job *bad) >>>>>>>>> >>>>>>>>>>>>> * >>>>>>>>> >>>>>>>>>>>>> * Job is still alive so fence refcount at >>>>>>>>> >>> least 1 >>>>>>>>> >>>>>>>>>>>>> */ >>>>>>>>> >>>>>>>>>>>>> - dma_fence_wait(&s_job->s_fence->finished, >>>>>>>>> >>>>>>> false); >>>>>>>>> >>>>>>>>>>>>> + if (s_job->s_fence) >>>>>>>>> >>>>>>>>>>>>> + dma_fence_wait(&s_job->s_fence- >>>>>>>>> >>>>>>>> finished, >>>>>>>>> >>>>>>>>>>>> false); >>>>>>>>> >>>>>>>>>>>>> /* >>>>>>>>> >>>>>>>>>>>>> * We must keep bad job alive for later >>>>>>>>> >>> use >>>>>>>>> >>>>>>> during @@ >>>>>>>>> >>>>>>>>>>>> -438,7 >>>>>>>>> >>>>>>>>>>>>> +439,7 @@ void drm_sched_start(struct >>>>>>>>> drm_gpu_scheduler >>>>>>>>> >>>>> *sched, >>>>>>>>> >>>>>>>>>>>>> +bool >>>>>>>>> >>>>>>>>>>>> full_recovery) >>>>>>>>> >>>>>>>>>>>>> * GPU recovers can't run in parallel. >>>>>>>>> >>>>>>>>>>>>> */ >>>>>>>>> >>>>>>>>>>>>> list_for_each_entry_safe(s_job, tmp, >>>>>>>>> >>>>>>>>>>>>>&sched->ring_mirror_list, >>>>>>>>> >>>>>>>>>>>>> node) >>>>>>>>> >>>>>>>>>>>> { >>>>>>>>> >>>>>>>>>>>>> - struct dma_fence *fence = >>>>>>>>> s_job->s_fence->parent; >>>>>>>>> >>>>>>>>>>>>> + struct dma_fence *fence = >>>>>>>>> s_job->s_fence ? >>>>>>>>> >>>>>>>>>>>>> + s_job- >>>>>>>>> >>>>>>>> s_fence- >>>>>>>>> >>>>>>>>>>>>> parent : >>>>>>>>> >>>>>>>>>>>>> +NULL; >>>>>>>>> >>>>>>>>>>>>> >>>>>>>>> >>>>>>>>>>>>> atomic_inc(&sched->hw_rq_count); >>>>>>>>> >>>>>>>>>>>>> >>>>>>>>> >>>>>>>>>>> >_______________________________________________ >>>>>>>>> >>>>>>>>>>> amd-gfx mailing list amd-gfx@xxxxxxxxxxxxxxxxxxxxx >>>>>>>>> >>>>>>>>>>> >>>>>>>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx >>>>>>>>> <https://lists.freedesktop.org/mailman/listinfo/amd-gfx> >>>>>>>>> > >>>>>>>>> >_______________________________________________ >>>>>>>>> >amd-gfx mailing list >>>>>>>>> >amd-gfx@xxxxxxxxxxxxxxxxxxxxx >>>>>>>>> >https://lists.freedesktop.org/mailman/listinfo/amd-gfx >>>>>>>> >>>>>>> >>>>>>> _______________________________________________ >>>>>>> amd-gfx mailing list >>>>>>> amd-gfx@xxxxxxxxxxxxxxxxxxxxx >>>>>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx >>>>>> >>>> >>> >>> _______________________________________________ >>> amd-gfx mailing list >>> amd-gfx@xxxxxxxxxxxxxxxxxxxxx >>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx >> _______________________________________________ amd-gfx mailing list amd-gfx@xxxxxxxxxxxxxxxxxxxxx https://lists.freedesktop.org/mailman/listinfo/amd-gfx