Hi Christian,
I add the follow print in function drm_sched_cleanup_jobs.
From the log it shows that only use cancel_delayed_work could
not avoid to free job when the sched is in reset. But don’t
know exactly where it is wrong about the driver. Do you have
any suggestion about this?
+ printk("Emily:drm_sched_cleanup_jobs:begin,tid:%lu,
pid:%lu\n", current->tgid, current->pid);
/*
* Don't destroy jobs while the timeout worker is
running OR thread
* is being parked and hence assumed to not touch
ring_mirror_list
*/
if ((sched->timeout != MAX_SCHEDULE_TIMEOUT &&
!cancel_delayed_work(&sched->work_tdr)))
return;
+ printk("Emily:drm_sched_cleanup_jobs,tid:%lu, pid:%lu\n",
current->tgid, current->pid);
Best wishes
Emily Deng
Nov 12 12:58:20 ubuntu-drop-August-2018-rc2-gpu0-vf02 kernel:
[11380.695091] Emily:drm_sched_cleanup_jobs:begin,tid:2262,
pid:2262
Nov 12 12:58:20 ubuntu-drop-August-2018-rc2-gpu0-vf02 kernel:
[11380.695104] Emily:drm_sched_cleanup_jobs:begin,tid:2262,
pid:2262
Nov 12 12:58:20 ubuntu-drop-August-2018-rc2-gpu0-vf02 kernel:
[11380.695105] Emily:drm_sched_cleanup_jobs,tid:2262, pid:2262
Nov 12 12:58:20 ubuntu-drop-August-2018-rc2-gpu0-vf02 kernel:
[11380.695107] Emily:drm_sched_cleanup_jobs:begin,tid:2262,
pid:2262
Nov 12 12:58:20 ubuntu-drop-August-2018-rc2-gpu0-vf02 kernel:
[11380.695107] Emily:drm_sched_cleanup_jobs,tid:2262, pid:2262
Nov 12 12:58:20 ubuntu-drop-August-2018-rc2-gpu0-vf02 kernel:
[11381.222954] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring
sdma0 timeout, signaled seq=78585, emitted seq=78587
Nov 12 12:58:20 ubuntu-drop-August-2018-rc2-gpu0-vf02 kernel:
[11381.224275] [drm:amdgpu_job_timedout [amdgpu]] *ERROR*
Process information: process pid 0 thread pid 0,
s_job:00000000fe75ab36,tid=15603, pid=15603
Nov 12 12:58:20 ubuntu-drop-August-2018-rc2-gpu0-vf02 kernel:
[11381.225413] amdgpu 0000:00:08.0: GPU reset begin!
Nov 12 12:58:20 ubuntu-drop-August-2018-rc2-gpu0-vf02 kernel:
[11381.225417] Emily:drm_sched_cleanup_jobs:begin,tid:2262,
pid:2262
Nov 12 12:58:20 ubuntu-drop-August-2018-rc2-gpu0-vf02 kernel:
[11381.225425] Emily:drm_sched_cleanup_jobs:begin,tid:2262,
pid:2262
Nov 12 12:58:20 ubuntu-drop-August-2018-rc2-gpu0-vf02 kernel:
[11381.225425] Emily:drm_sched_cleanup_jobs,tid:2262, pid:2262
Nov 12 12:58:20 ubuntu-drop-August-2018-rc2-gpu0-vf02 kernel:
[11381.225428] Emily:amdgpu_job_free_cb,Process information:
process pid 0 thread pid 0, s_job:00000000fe75ab36, tid:2262,
pid:2262
Nov 12 12:58:20 ubuntu-drop-August-2018-rc2-gpu0-vf02 kernel:
[11381.225429] Emily:drm_sched_cleanup_jobs:begin,tid:2262,
pid:2262
Nov 12 12:58:20 ubuntu-drop-August-2018-rc2-gpu0-vf02 kernel:
[11381.225430] Emily:drm_sched_cleanup_jobs,tid:2262, pid:2262
Nov 12 12:58:20 ubuntu-drop-August-2018-rc2-gpu0-vf02 kernel:
[11381.225473] Emily:drm_sched_cleanup_jobs:begin,tid:2253,
pid:2253
Nov 12 12:58:20 ubuntu-drop-August-2018-rc2-gpu0-vf02 kernel:
[11381.225486] Emily:drm_sched_cleanup_jobs:begin,tid:2262,
pid:2262
Nov 12 12:58:20 ubuntu-drop-August-2018-rc2-gpu0-vf02 kernel:
[11381.225489] Emily:drm_sched_cleanup_jobs,tid:2262, pid:2262
Nov 12 12:58:20 ubuntu-drop-August-2018-rc2-gpu0-vf02 kernel:
[11381.225494] Emily:amdgpu_job_free_cb,Process information:
process pid 0 thread pid 0, s_job:00000000f086ec84, tid:2262,
pid:2262
>-----Original Message-----
>From: Grodzovsky, Andrey <Andrey.Grodzovsky@xxxxxxx>
>Sent: Tuesday, November 12, 2019 11:28 AM
>To: Koenig, Christian <Christian.Koenig@xxxxxxx>; Deng, Emily
><Emily.Deng@xxxxxxx>; amd-gfx@xxxxxxxxxxxxxxxxxxxxx
>Subject: Re: [PATCH] drm/amdgpu: Fix the null pointer issue
for tdr
>
>Thinking more about this claim - we assume here that if
cancel_delayed_work
>returned true it guarantees that timeout work is not running
but, it merely
>means there was a pending timeout work which was removed from the
>workqueue before it's timer elapsed and so it didn't have a
chance to be
>dequeued and executed, it doesn't cover already executing
work. So there is a
>possibility where while timeout work started executing another
timeout work
>already got enqueued (maybe through earlier cleanup jobs or
through
>drm_sched_fault) and if at this point another
drm_sched_cleanup_jobs runs
>cancel_delayed_work(&sched->work_tdr) will return true even
while there is a
>timeout job in progress.
>Unfortunately we cannot change cancel_delayed_work to
>cancel_delayed_work_sync to flush the timeout work as timeout
work itself
>waits for schedule thread to be parked again when calling
park_thread.
>
>Andrey
>
>________________________________________
>From: amd-gfx <amd-gfx-bounces@xxxxxxxxxxxxxxxxxxxxx> on
behalf of
>Koenig, Christian <Christian.Koenig@xxxxxxx>
>Sent: 08 November 2019 05:35:18
>To: Deng, Emily; amd-gfx@xxxxxxxxxxxxxxxxxxxxx
>Subject: Re: [PATCH] drm/amdgpu: Fix the null pointer issue
for tdr
>
>Hi Emily,
>
>exactly that can't happen. See here:
>
>> /* Don't destroy jobs while the timeout worker is
running */
>> if (sched->timeout != MAX_SCHEDULE_TIMEOUT &&
>> !cancel_delayed_work(&sched->work_tdr))
>> return NULL;
>
>We never free jobs while the timeout working is running to
prevent exactly
>that issue.
>
>Regards,
>Christian.
>
>Am 08.11.19 um 11:32 schrieb Deng, Emily:
>> Hi Christian,
>> The drm_sched_job_timedout-> amdgpu_job_timedout call
>amdgpu_device_gpu_recover. I mean the main scheduler free the
jobs while
>in amdgpu_device_gpu_recover, and before calling drm_sched_stop.
>>
>> Best wishes
>> Emily Deng
>>
>>
>>
>>> -----Original Message-----
>>> From: Koenig, Christian <Christian.Koenig@xxxxxxx>
>>> Sent: Friday, November 8, 2019 6:26 PM
>>> To: Deng, Emily <Emily.Deng@xxxxxxx>;
amd-gfx@xxxxxxxxxxxxxxxxxxxxx
>>> Subject: Re: [PATCH] drm/amdgpu: Fix the null pointer issue
for tdr
>>>
>>> Hi Emily,
>>>
>>> well who is calling amdgpu_device_gpu_recover() in this case?
>>>
>>> When it's not the scheduler we shouldn't have a guilty job
in the first place.
>>>
>>> Regards,
>>> Christian.
>>>
>>> Am 08.11.19 um 11:22 schrieb Deng, Emily:
>>>> Hi Chrisitan,
>>>> No, I am with the new branch and also has the
patch. Even it
>>>> are freed by
>>> main scheduler, how we could avoid main scheduler to free
jobs while
>>> enter to function amdgpu_device_gpu_recover?
>>>> Best wishes
>>>> Emily Deng
>>>>
>>>>
>>>>
>>>>> -----Original Message-----
>>>>> From: Koenig, Christian <Christian.Koenig@xxxxxxx>
>>>>> Sent: Friday, November 8, 2019 6:15 PM
>>>>> To: Deng, Emily <Emily.Deng@xxxxxxx>; amd-
>gfx@xxxxxxxxxxxxxxxxxxxxx
>>>>> Subject: Re: [PATCH] drm/amdgpu: Fix the null pointer
issue for tdr
>>>>>
>>>>> Hi Emily,
>>>>>
>>>>> in this case you are on an old code branch.
>>>>>
>>>>> Jobs are freed now by the main scheduler thread and only
if no
>>>>> timeout handler is running.
>>>>>
>>>>> See this patch here:
>>>>>> commit 5918045c4ed492fb5813f980dcf89a90fefd0a4e
>>>>>> Author: Christian König <christian.koenig@xxxxxxx>
>>>>>> Date: Thu Apr 18 11:00:21 2019 -0400
>>>>>>
>>>>>> drm/scheduler: rework job destruction
>>>>> Regards,
>>>>> Christian.
>>>>>
>>>>> Am 08.11.19 um 11:11 schrieb Deng, Emily:
>>>>>> Hi Christian,
>>>>>> Please refer to follow log, when it enter to
>>>>>> amdgpu_device_gpu_recover
>>>>> function, the bad job 000000005086879e is freeing in
function
>>>>> amdgpu_job_free_cb at the same time, because of the
hardware fence
>>> signal.
>>>>> But amdgpu_device_gpu_recover goes faster, at this case, the
>>>>> s_fence is already freed, but job is not freed in time.
Then this issue
>occurs.
>>>>>> [ 449.792189] [drm:amdgpu_job_timedout [amdgpu]]
*ERROR* ring
>>> sdma0
>>>>>> timeout, signaled seq=2481, emitted seq=2483 [ 449.793202]
>>>>>> [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process
information:
>>>>> process pid 0 thread pid 0, s_job:000000005086879e [
449.794163]
>>>>> amdgpu
>>>>> 0000:00:08.0: GPU reset begin!
>>>>>> [ 449.794175] Emily:amdgpu_job_free_cb,Process
information:
>>>>>> process pid 0 thread pid 0, s_job:000000005086879e [
449.794221]
>>>>>> Emily:amdgpu_job_free_cb,Process information: process
pid 0
>>>>>> thread pid 0, s_job:0000000066eb74ab [ 449.794222]
>>>>>> Emily:amdgpu_job_free_cb,Process information: process
pid 0
>>>>>> thread pid 0, s_job:00000000d4438ad9 [ 449.794255]
>>>>>> Emily:amdgpu_job_free_cb,Process information: process
pid 0
>>>>>> thread pid 0, s_job:00000000b6d69c65 [ 449.794257]
>>>>>> Emily:amdgpu_job_free_cb,Process information: process
pid 0
>>>>>> thread pid 0,
>>>>> s_job:00000000ea85e922 [ 449.794287]
>>>>> Emily:amdgpu_job_free_cb,Process
>>>>> information: process pid 0 thread pid 0,
s_job:00000000ed3a5ac6 [
>>>>> 449.794366] BUG: unable to handle kernel NULL pointer
dereference
>>>>> at
>>>>> 00000000000000c0 [ 449.800818] PGD 0 P4D 0 [ 449.801040]
Oops:
>>>>> 0000 [#1] SMP PTI
>>>>>> [ 449.801338] CPU: 3 PID: 55 Comm: kworker/3:1 Tainted:
G OE
>>>>> 4.18.0-15-generic #16~18.04.1-Ubuntu
>>>>>> [ 449.802157] Hardware name: QEMU Standard PC (i440FX +
PIIX,
>>>>>> 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014 [ 449.802944]
>>>>>> Workqueue: events drm_sched_job_timedout [amd_sched] [
>>>>>> 449.803488]
>>> RIP:
>>>>> 0010:amdgpu_device_gpu_recover+0x1da/0xb60 [amdgpu]
>>>>>> [ 449.804020] Code: dd ff ff 49 39 c5 48 89 55 a8 0f 85
56 ff ff
>>>>>> ff
>>>>>> 45 85 e4 0f
>>>>> 85 a1 00 00 00 48 8b 45 b0 48 85 c0 0f 84 60 01 00 00 48
8b 40 10
>>>>> <48> 8b
>>> 98
>>>>> c0 00 00 00 48 85 db 0f 84 4c 01 00 00 48 8b 43
48 a8 01
>>>>>> [ 449.805593] RSP: 0018:ffffb4c7c08f7d68 EFLAGS:
00010286 [
>>>>>> 449.806032] RAX: 0000000000000000 RBX: 0000000000000000
RCX:
>>>>>> 0000000000000000 [ 449.806625] RDX: ffffb4c7c08f5ac0 RSI:
>>>>>> 0000000fffffffe0 RDI: 0000000000000246 [ 449.807224] RBP:
>>>>>> ffffb4c7c08f7de0 R08: 00000068b9d54000 R09:
0000000000000000 [
>>>>>> 449.807818] R10: 0000000000000000 R11: 0000000000000148
R12:
>>>>>> 0000000000000000 [ 449.808411] R13: ffffb4c7c08f7da0 R14:
>>>>>> ffff8d82b8525d40 R15: ffff8d82b8525d40 [ 449.809004] FS:
>>>>>> 0000000000000000(0000) GS:ffff8d82bfd80000(0000)
>>>>>> knlGS:0000000000000000 [ 449.809674] CS: 0010 DS: 0000
ES: 0000
>CR0:
>>>>>> 0000000080050033 [ 449.810153] CR2: 00000000000000c0 CR3:
>>>>>> 000000003cc0a001 CR4: 00000000003606e0 [ 449.810747] DR0:
>>>>> 0000000000000000 DR1: 0000000000000000 DR2:
0000000000000000 [
>>>>> 449.811344] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7:
>>>>> 0000000000000400 [ 449.811937] Call Trace:
>>>>>> [ 449.812206] amdgpu_job_timedout+0x114/0x140 [amdgpu] [
>>>>>> 449.812635] drm_sched_job_timedout+0x44/0x90 [amd_sched] [
>>>>>> 449.813139] ? amdgpu_cgs_destroy_device+0x10/0x10
[amdgpu] [
>>>>>> 449.813609] ? drm_sched_job_timedout+0x44/0x90
[amd_sched] [
>>>>>> 449.814077] process_one_work+0x1fd/0x3f0 [ 449.814417]
>>>>>> worker_thread+0x34/0x410 [ 449.814728]
kthread+0x121/0x140 [
>>>>>> 449.815004] ? process_one_work+0x3f0/0x3f0 [
449.815374] ?
>>>>>> kthread_create_worker_on_cpu+0x70/0x70
>>>>>> [ 449.815799] ret_from_fork+0x35/0x40
>>>>>>
>>>>>>> -----Original Message-----
>>>>>>> From: Koenig, Christian <Christian.Koenig@xxxxxxx>
>>>>>>> Sent: Friday, November 8, 2019 5:43 PM
>>>>>>> To: Deng, Emily <Emily.Deng@xxxxxxx>; amd-
>>> gfx@xxxxxxxxxxxxxxxxxxxxx
>>>>>>> Subject: Re: [PATCH] drm/amdgpu: Fix the null pointer
issue for
>>>>>>> tdr
>>>>>>>
>>>>>>> Am 08.11.19 um 10:39 schrieb Deng, Emily:
>>>>>>>> Sorry, please take your time.
>>>>>>> Have you seen my other response a bit below?
>>>>>>>
>>>>>>> I can't follow how it would be possible for
job->s_fence to be
>>>>>>> NULL without the job also being freed.
>>>>>>>
>>>>>>> So it looks like this patch is just papering over some
bigger issues.
>>>>>>>
>>>>>>> Regards,
>>>>>>> Christian.
>>>>>>>
>>>>>>>> Best wishes
>>>>>>>> Emily Deng
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>> -----Original Message-----
>>>>>>>>> From: Koenig, Christian <Christian.Koenig@xxxxxxx>
>>>>>>>>> Sent: Friday, November 8, 2019 5:08 PM
>>>>>>>>> To: Deng, Emily <Emily.Deng@xxxxxxx>; amd-
>>>>> gfx@xxxxxxxxxxxxxxxxxxxxx
>>>>>>>>> Subject: Re: [PATCH] drm/amdgpu: Fix the null pointer
issue for
>>>>>>>>> tdr
>>>>>>>>>
>>>>>>>>> Am 08.11.19 um 09:52 schrieb Deng, Emily:
>>>>>>>>>> Ping.....
>>>>>>>>> You need to give me at least enough time to wake up :)
>>>>>>>>>
>>>>>>>>>> Best wishes
>>>>>>>>>> Emily Deng
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>> -----Original Message-----
>>>>>>>>>>> From: amd-gfx
<amd-gfx-bounces@xxxxxxxxxxxxxxxxxxxxx> On
>>> Behalf
>>>>>>>>>>> Of Deng, Emily
>>>>>>>>>>> Sent: Friday, November 8, 2019 10:56 AM
>>>>>>>>>>> To: Koenig, Christian <Christian.Koenig@xxxxxxx>; amd-
>>>>>>>>>>> gfx@xxxxxxxxxxxxxxxxxxxxx
>>>>>>>>>>> Subject: RE: [PATCH] drm/amdgpu: Fix the null
pointer issue
>>>>>>>>>>> for tdr
>>>>>>>>>>>
>>>>>>>>>>>> -----Original Message-----
>>>>>>>>>>>> From: Christian König
<ckoenig.leichtzumerken@xxxxxxxxx>
>>>>>>>>>>>> Sent: Thursday, November 7, 2019 7:28 PM
>>>>>>>>>>>> To: Deng, Emily <Emily.Deng@xxxxxxx>;
>>>>>>>>>>>> amd-gfx@xxxxxxxxxxxxxxxxxxxxx
>>>>>>>>>>>> Subject: Re: [PATCH] drm/amdgpu: Fix the null
pointer issue
>>>>>>>>>>>> for tdr
>>>>>>>>>>>>
>>>>>>>>>>>> Am 07.11.19 um 11:25 schrieb Emily Deng:
>>>>>>>>>>>>> When the job is already signaled, the s_fence is
freed.
>>>>>>>>>>>>> Then it will has null pointer in
amdgpu_device_gpu_recover.
>>>>>>>>>>>> NAK, the s_fence is only set to NULL when the job
is destroyed.
>>>>>>>>>>>> See drm_sched_job_cleanup().
>>>>>>>>>>> I know it is set to NULL in drm_sched_job_cleanup.
But in one
>>>>>>>>>>> case, when it enter into the
amdgpu_device_gpu_recover, it
>>>>>>>>>>> already in drm_sched_job_cleanup, and at this time,
it will
>>>>>>>>>>> go to free
>>>>> job.
>>>>>>>>>>> But the amdgpu_device_gpu_recover sometimes is
faster. At
>>>>>>>>>>> that time, job is not freed, but s_fence is already
NULL.
>>>>>>>>> No, that case can't happen. See here:
>>>>>>>>>
>>>>>>>>>> drm_sched_job_cleanup(s_job);
>>>>>>>>>>
>>>>>>>>>> amdgpu_ring_priority_put(ring, s_job->s_priority);
>>>>>>>>>> dma_fence_put(job->fence);
>>>>>>>>>> amdgpu_sync_free(&job->sync);
>>>>>>>>>> amdgpu_sync_free(&job->sched_sync);
>>>>>>>>>> kfree(job);
>>>>>>>>> The job itself is freed up directly after freeing the
reference
>>>>>>>>> to the
>>>>> s_fence.
>>>>>>>>> So you are just papering over a much bigger problem
here. This
>>>>>>>>> patch is a clear NAK.
>>>>>>>>>
>>>>>>>>> Regards,
>>>>>>>>> Christian.
>>>>>>>>>
>>>>>>>>>>>> When you see a job without an s_fence then that
means the
>>>>>>>>>>>> problem is somewhere else.
>>>>>>>>>>>>
>>>>>>>>>>>> Regards,
>>>>>>>>>>>> Christian.
>>>>>>>>>>>>
>>>>>>>>>>>>> Signed-off-by: Emily Deng <Emily.Deng@xxxxxxx>
>>>>>>>>>>>>> ---
>>>>>>>>>>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 2 +-
>>>>>>>>>>>>> drivers/gpu/drm/scheduler/sched_main.c | 11
++++++---
>--
>>>>>>>>>>>>> 2 files changed, 7 insertions(+), 6
deletions(-)
>>>>>>>>>>>>>
>>>>>>>>>>>>> diff --git
a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>>>>>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>>>>>>>>> index e6ce949..5a8f08e 100644
>>>>>>>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>>>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>>>>>>>>> @@ -4075,7 +4075,7 @@ int
>>> amdgpu_device_gpu_recover(struct
>>>>>>>>>>>> amdgpu_device *adev,
>>>>>>>>>>>>> *
>>>>>>>>>>>>> * job->base holds a reference to
parent fence
>>>>>>>>>>>>> */
>>>>>>>>>>>>> - if (job && job->base.s_fence->parent &&
>>>>>>>>>>>>> + if (job && job->base.s_fence &&
>>>>>>>>>>>>> + job->base.s_fence->parent
>>>>>>> &&
>>>>>>>>>>>>> dma_fence_is_signaled(job->base.s_fence->parent))
>>>>>>>>>>>>> job_signaled = true;
>>>>>>>>>>>>>
>>>>>>>>>>>>> diff --git a/drivers/gpu/drm/scheduler/sched_main.c
>>>>>>>>>>>>> b/drivers/gpu/drm/scheduler/sched_main.c
>>>>>>>>>>>>> index 31809ca..56cc10e 100644
>>>>>>>>>>>>> --- a/drivers/gpu/drm/scheduler/sched_main.c
>>>>>>>>>>>>> +++ b/drivers/gpu/drm/scheduler/sched_main.c
>>>>>>>>>>>>> @@ -334,8 +334,8 @@ void
>drm_sched_increase_karma(struct
>>>>>>>>>>>> drm_sched_job
>>>>>>>>>>>>> *bad)
>>>>>>>>>>>>>
>>>>>>>>>>>>> spin_lock(&rq->lock);
>>>>>>>>>>>>> list_for_each_entry_safe(entity,
>>>>>>>>>>>>> tmp,
>>> &rq-
>>>>>>>> entities,
>>>>>>>>>>>> list) {
>>>>>>>>>>>>> - if
(bad->s_fence->scheduled.context
>>>>>>> ==
>>>>>>>>>>>>> - entity->fence_context) {
>>>>>>>>>>>>> + if (bad->s_fence &&
>>>>>>>>>>>>> + (bad->s_fence-
>>>>>>>>>>>>> scheduled.context ==
>>>>>>>>>>>>> + entity->fence_context)) {
>>>>>>>>>>>>> if
>>>>>>>>>>>>> (atomic_read(&bad-
>>>>>>>> karma) >
>>>>>>>>>>>>> bad->sched-
>>>> hang_limit)
>>>>>>>>>>>>> if
>>>>>>>>>>>>> (entity-
>>>> guilty) @@ -376,7 +376,7 @@ void
>>>>>>>>>>>>> drm_sched_stop(struct
>>>>>>> drm_gpu_scheduler
>>>>>>>>>>>> *sched, struct drm_sched_job *bad)
>>>>>>>>>>>>> * This iteration is thread safe as
sched thread
>>>>>>>>>>>>> is
>>> stopped.
>>>>>>>>>>>>> */
>>>>>>>>>>>>> list_for_each_entry_safe_reverse(s_job, tmp,
>>>>>>>>>>>>> &sched- ring_mirror_list, node) {
>>>>>>>>>>>>> - if (s_job->s_fence->parent &&
>>>>>>>>>>>>> + if (s_job->s_fence &&
s_job->s_fence->parent &&
>>>>>>>>>>>>> dma_fence_remove_callback(s_job-
>>>> s_fence-
>>>>>>>> parent,
>>>>>>>>>>>>> &s_job->cb)) {
>>>>>>>>>>>>> atomic_dec(&sched->hw_rq_count);
>>> @@ -
>>>>>>> 395,7
>>>>>>>>>>> +395,8 @@ void
>>>>>>>>>>>>> drm_sched_stop(struct drm_gpu_scheduler
>>>>>>>>>>>> *sched, struct drm_sched_job *bad)
>>>>>>>>>>>>> *
>>>>>>>>>>>>> * Job is still alive so fence
>>>>>>>>>>>>> refcount at
>>> least 1
>>>>>>>>>>>>> */
>>>>>>>>>>>>> - dma_fence_wait(&s_job->s_fence->finished,
>>>>>>> false);
>>>>>>>>>>>>> + if (s_job->s_fence)
>>>>>>>>>>>>> + dma_fence_wait(&s_job->s_fence-
>>>>>>>> finished,
>>>>>>>>>>>> false);
>>>>>>>>>>>>> /*
>>>>>>>>>>>>> * We must keep bad job alive
>>>>>>>>>>>>> for later
>>> use
>>>>>>> during @@
>>>>>>>>>>>> -438,7
>>>>>>>>>>>>> +439,7 @@ void drm_sched_start(struct
drm_gpu_scheduler
>>>>> *sched,
>>>>>>>>>>>>> +bool
>>>>>>>>>>>> full_recovery)
>>>>>>>>>>>>> * GPU recovers can't run in parallel.
>>>>>>>>>>>>> */
>>>>>>>>>>>>> list_for_each_entry_safe(s_job, tmp,
>>>>>>>>>>>>> &sched->ring_mirror_list,
>>>>>>>>>>>>> node)
>>>>>>>>>>>> {
>>>>>>>>>>>>> - struct dma_fence *fence =
s_job->s_fence->parent;
>>>>>>>>>>>>> + struct dma_fence *fence =
s_job->s_fence ?
>>>>>>>>>>>>> + s_job-
>>>>>>>> s_fence-
>>>>>>>>>>>>> parent :
>>>>>>>>>>>>> +NULL;
>>>>>>>>>>>>>
>>>>>>>>>>>>> atomic_inc(&sched->hw_rq_count);
>>>>>>>>>>>>>
>>>>>>>>>>> _______________________________________________
>>>>>>>>>>> amd-gfx mailing list
>>>>>>>>>>> amd-gfx@xxxxxxxxxxxxxxxxxxxxx
>>>>>>>>>>>
https://lists.freedesktop.org/mailman/listinfo/amd-gfx
<https://lists.freedesktop.org/mailman/listinfo/amd-gfx>
>
>_______________________________________________
>amd-gfx mailing list
>amd-gfx@xxxxxxxxxxxxxxxxxxxxx
>https://lists.freedesktop.org/mailman/listinfo/amd-gfx