[PATCH 1/2] drm/amdgpu: change job_list_lock to mutex

deathsimple@xxxxxxxxxxx (Christian König) · Tue, 28 Jun 2016 12:22:53 +0200

Am 28.06.2016 um 12:05 schrieb zhoucm1:
>
>
> On 2016å¹´06æ??28æ?¥ 18:03, Christian KÃ¶nig wrote:
>> Am 28.06.2016 um 11:33 schrieb zhoucm1:
>>>
>>>
>>> On 2016å¹´06æ??28æ?¥ 17:36, Christian KÃ¶nig wrote:
>>>> Am 28.06.2016 um 09:27 schrieb Huang Rui:
>>>>> On Tue, Jun 28, 2016 at 03:04:18PM +0800, Chunming Zhou wrote:
>>>>>> ring_mirror_list is only used kthread context, no need to spinlock.
>>>>>> otherwise deadlock happens when kthread_park.
>>>>>>
>>>>> Yes, in process context, we prefer to use mutex, because it avoids to
>>>>> grab the CPU all the time.
>>>>>
>>>>> Reviewed-by: Huang Rui <ray.huang at amd.com>
>>>>
>>>> NAK, the patch won't apply because I've changed the irq save spin 
>>>> lock to a normal one quite a while ago. But, I'm not sure if Alex 
>>>> picked up that patch yet.
>>>>
>>>> You shouldn't use a mutex here when you don't have a reason to do 
>>>> so. Spin locks have less overhead and we won't expect any 
>>>> contention here.
>>>>
>>>> Additional to that how should this cause a deadlock with kthread_park?
>>> you can apply another patch to drop hw ring, and then run glxgears, 
>>> and hang the gfx with the attached, then you will find that deadlock 
>>> info.
>>
>> Please provide the deadlock info. Since the spin lock only protects 
>> the linked list we should be able to easily avoid any lock inversion 
>> which could lead to a deadlock.
>
> With the latest code, you could easily reproduce it with above steps, 
> after mutex, these deadlock info disappears:

That looks like a known issue which should already be fixed. Don't you 
have commit 3ebd43f5696147c6a4f808ded4ce1233496c7fd6 "drm/amdgpu: stop 
trying to schedule() with a spin held" in your branch?

Regards,
Christian.

>
>
> Jun 28 14:34:25 zhoucm1-System-Product-Name kernel: [   96.282892] NMI 
> watchdog: BUG: soft lockup - CPU#7 stuck for 22s! [kworker/7:0:58]
> Jun 28 14:34:25 zhoucm1-System-Product-Name kernel: [   96.283106] 
> Modules linked in: amdgpu(OE) ttm(OE) drm_kms_helper(OE) drm(E) 
> i2c_algo_bit(E) fb_sys_fops(E) syscopyarea(E) sysfillrect(E) 
> sysimgblt(E) nfsv3(E) bnep(E) rfcomm(E) bluetooth(E) 
> rpcsec_gss_krb5(E) nfsv4(E) nfsd(E) auth_rpcgss(E) nfs_acl(E) nfs(E) 
> lockd(E) grace(E) sunrpc(E) fscache(E) binfmt_misc(E) 
> snd_hda_codec_realtek(E) snd_hda_codec_hdmi(E) 
> snd_hda_codec_generic(E) snd_hda_intel(E) snd_hda_codec(E) 
> snd_hda_core(E) snd_hwdep(E) coretemp(E) snd_pcm(E) snd_seq_midi(E) 
> snd_seq_midi_event(E) snd_rawmidi(E) snd_seq(E) joydev(E) kvm_intel(E) 
> snd_seq_device(E) snd_timer(E) kvm(E) snd(E) soundcore(E) irqbypass(E) 
> gpio_ich(E) mxm_wmi(E) serio_raw(E) shpchp(E) wmi(E) i7core_edac(E) 
> edac_core(E) lpc_ich(E) parport_pc(E) mac_hid(E) ppdev(E) lp(E) 
> parport(E) hid_generic(E) usbhid(E) hid(E) firewire_ohci(E) 8139too(E) 
> pata_acpi(E) firewire_core(E) 8139cp(E) crc_itu_t(E) r8169(E) mii(E) 
> pata_jmicron(E) ahci(E) libahci(E)
> Jun 28 14:34:25 zhoucm1-System-Product-Name kernel: [   96.283134] 
> CPU: 7 PID: 58 Comm: kworker/7:0 Tainted: G          IOE 4.5.0-custom #7
> Jun 28 14:34:25 zhoucm1-System-Product-Name kernel: [   96.283136] 
> Hardware name: Gigabyte Technology Co., Ltd. X58A-UD3R/X58A-UD3R, BIOS 
> FE 12/23/2010
> Jun 28 14:34:25 zhoucm1-System-Product-Name kernel: [   96.283182] 
> Workqueue: events amd_sched_job_finish [amdgpu]
> Jun 28 14:34:25 zhoucm1-System-Product-Name kernel: [   96.283183] 
> task: ffff8801a8490000 ti: ffff8801a848c000 task.ti: ffff8801a848c000
> Jun 28 14:34:25 zhoucm1-System-Product-Name kernel: [   96.283184] 
> RIP: 0010:[<ffffffff810c1131>]  [<ffffffff810c1131>] 
> native_queued_spin_lock_slowpath+0x171/0x190
> Jun 28 14:34:25 zhoucm1-System-Product-Name kernel: [   96.283189] 
> RSP: 0018:ffff8801a848fdc8  EFLAGS: 00000202
> Jun 28 14:34:25 zhoucm1-System-Product-Name kernel: [   96.283190] 
> RAX: 0000000000000101 RBX: ffff8800c5ba1030 RCX: 0000000000000001
> Jun 28 14:34:25 zhoucm1-System-Product-Name kernel: [   96.283191] 
> RDX: 0000000000000101 RSI: 0000000000000001 RDI: ffff8800c42a4518
> Jun 28 14:34:25 zhoucm1-System-Product-Name kernel: [   96.283192] 
> RBP: ffff8801a848fdc8 R08: 0000000000000101 R09: ffff8801a93d6300
> Jun 28 14:34:25 zhoucm1-System-Product-Name kernel: [   96.283193] 
> R10: 0000000000000001 R11: 0000000000000000 R12: ffff8800c42a4468
> Jun 28 14:34:25 zhoucm1-System-Product-Name kernel: [   96.283194] 
> R13: ffff8800c42a4518 R14: ffff8800c5ba1000 R15: ffff8800c5ba1030
> Jun 28 14:34:25 zhoucm1-System-Product-Name kernel: [   96.283195] 
> FS:  0000000000000000(0000) GS:ffff8801a93c0000(0000) 
> knlGS:0000000000000000
> Jun 28 14:34:25 zhoucm1-System-Product-Name kernel: [   96.283196] 
> CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
> Jun 28 14:34:25 zhoucm1-System-Product-Name kernel: [   96.283197] 
> CR2: 00007efc3bc21e50 CR3: 0000000001c0c000 CR4: 00000000000006e0
> Jun 28 14:34:25 zhoucm1-System-Product-Name kernel: [   96.283198] Stack:
> Jun 28 14:34:25 zhoucm1-System-Product-Name kernel: [ 96.283199]  
> ffff8801a848fdd8 ffffffff81173ef2 ffff8801a848fde8 ffffffff817a19c0
> Jun 28 14:34:25 zhoucm1-System-Product-Name kernel: [ 96.283200]  
> ffff8801a848fe18 ffffffffa06937fb ffff8801a75deb40 ffff8801a93d6300
> Jun 28 14:34:25 zhoucm1-System-Product-Name kernel: [ 96.283202]  
> ffff8801a93dae00 00000000000001c0 ffff8801a848fe60 ffffffff81090bb0
> Jun 28 14:34:25 zhoucm1-System-Product-Name kernel: [   96.283203] 
> Call Trace:
> Jun 28 14:34:25 zhoucm1-System-Product-Name kernel: [ 96.283206]  
> [<ffffffff81173ef2>] queued_spin_lock_slowpath+0xb/0xf
> Jun 28 14:34:25 zhoucm1-System-Product-Name kernel: [ 96.283209]  
> [<ffffffff817a19c0>] _raw_spin_lock+0x20/0x30
> Jun 28 14:34:25 zhoucm1-System-Product-Name kernel: [ 96.283234]  
> [<ffffffffa06937fb>] amd_sched_job_finish+0x2b/0xc0 [amdgpu]
> Jun 28 14:34:25 zhoucm1-System-Product-Name kernel: [ 96.283237]  
> [<ffffffff81090bb0>] process_one_work+0x150/0x3f0
> Jun 28 14:34:25 zhoucm1-System-Product-Name kernel: [ 96.283239]  
> [<ffffffff8109136b>] worker_thread+0x12b/0x4b0
> Jun 28 14:34:25 zhoucm1-System-Product-Name kernel: [ 96.283241]  
> [<ffffffff81091240>] ? rescuer_thread+0x340/0x340
> Jun 28 14:34:25 zhoucm1-System-Product-Name kernel: [ 96.283242]  
> [<ffffffff81096b22>] kthread+0xd2/0xf0
> Jun 28 14:34:25 zhoucm1-System-Product-Name kernel: [ 96.283244]  
> [<ffffffff81096a50>] ? kthread_park+0x50/0x50
> Jun 28 14:34:25 zhoucm1-System-Product-Name kernel: [ 96.283246]  
> [<ffffffff817a210f>] ret_from_fork+0x3f/0x70
> Jun 28 14:34:25 zhoucm1-System-Product-Name kernel: [ 96.283247]  
> [<ffffffff81096a50>] ? kthread_park+0x50/0x50
>
> Regards,
> David Zhou
>
>>
>> BTW: The debug patch you attached is not such a good idea either, 
>> cause you modify the gfx ring from outside the worker thread.
>>
>> Regards,
>> Christian.
>>
>>>
>>> Regards,
>>> David Zhou
>>>>
>>>> Regards,
>>>> Christian.
>>>>
>>>>>
>>>>>> Change-Id: I906022297015faf14a0ddb5f62a728af3e5f9448
>>>>>> Signed-off-by: Chunming Zhou <David1.Zhou at amd.com>
>>>>>> ---
>>>>>>   drivers/gpu/drm/amd/scheduler/gpu_scheduler.c | 12 +++++-------
>>>>>>   drivers/gpu/drm/amd/scheduler/gpu_scheduler.h |  2 +-
>>>>>>   2 files changed, 6 insertions(+), 8 deletions(-)
>>>>>>
>>>>>> diff --git a/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c 
>>>>>> b/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c
>>>>>> index b53cf58..3373d97 100644
>>>>>> --- a/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c
>>>>>> +++ b/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c
>>>>>> @@ -331,10 +331,9 @@ static void amd_sched_job_finish(struct 
>>>>>> work_struct *work)
>>>>>>       struct amd_sched_job *s_job = container_of(work, struct 
>>>>>> amd_sched_job,
>>>>>>                              finish_work);
>>>>>>       struct amd_gpu_scheduler *sched = s_job->sched;
>>>>>> -    unsigned long flags;
>>>>>>         /* remove job from ring_mirror_list */
>>>>>> -    spin_lock_irqsave(&sched->job_list_lock, flags);
>>>>>> +    mutex_lock(&sched->job_list_lock);
>>>>>>       list_del_init(&s_job->node);
>>>>>>       if (sched->timeout != MAX_SCHEDULE_TIMEOUT) {
>>>>>>           struct amd_sched_job *next;
>>>>>> @@ -348,7 +347,7 @@ static void amd_sched_job_finish(struct 
>>>>>> work_struct *work)
>>>>>>           if (next)
>>>>>> schedule_delayed_work(&next->work_tdr, sched->timeout);
>>>>>>       }
>>>>>> - spin_unlock_irqrestore(&sched->job_list_lock, flags);
>>>>>> +    mutex_unlock(&sched->job_list_lock);
>>>>>>       sched->ops->free_job(s_job);
>>>>>>   }
>>>>>>   @@ -362,15 +361,14 @@ static void 
>>>>>> amd_sched_job_finish_cb(struct fence *f, struct fence_cb *cb)
>>>>>>   static void amd_sched_job_begin(struct amd_sched_job *s_job)
>>>>>>   {
>>>>>>       struct amd_gpu_scheduler *sched = s_job->sched;
>>>>>> -    unsigned long flags;
>>>>>>   -    spin_lock_irqsave(&sched->job_list_lock, flags);
>>>>>> +    mutex_lock(&sched->job_list_lock);
>>>>>>       list_add_tail(&s_job->node, &sched->ring_mirror_list);
>>>>>>       if (sched->timeout != MAX_SCHEDULE_TIMEOUT &&
>>>>>> list_first_entry_or_null(&sched->ring_mirror_list,
>>>>>>                        struct amd_sched_job, node) == s_job)
>>>>>>           schedule_delayed_work(&s_job->work_tdr, sched->timeout);
>>>>>> - spin_unlock_irqrestore(&sched->job_list_lock, flags);
>>>>>> +    mutex_unlock(&sched->job_list_lock);
>>>>>>   }
>>>>>>     static void amd_sched_job_timedout(struct work_struct *work)
>>>>>> @@ -564,7 +562,7 @@ int amd_sched_init(struct amd_gpu_scheduler 
>>>>>> *sched,
>>>>>> init_waitqueue_head(&sched->wake_up_worker);
>>>>>>       init_waitqueue_head(&sched->job_scheduled);
>>>>>>       INIT_LIST_HEAD(&sched->ring_mirror_list);
>>>>>> -    spin_lock_init(&sched->job_list_lock);
>>>>>> +    mutex_init(&sched->job_list_lock);
>>>>>>       atomic_set(&sched->hw_rq_count, 0);
>>>>>>       if (atomic_inc_return(&sched_fence_slab_ref) == 1) {
>>>>>>           sched_fence_slab = kmem_cache_create(
>>>>>> diff --git a/drivers/gpu/drm/amd/scheduler/gpu_scheduler.h 
>>>>>> b/drivers/gpu/drm/amd/scheduler/gpu_scheduler.h
>>>>>> index 221a515..5675906 100644
>>>>>> --- a/drivers/gpu/drm/amd/scheduler/gpu_scheduler.h
>>>>>> +++ b/drivers/gpu/drm/amd/scheduler/gpu_scheduler.h
>>>>>> @@ -132,7 +132,7 @@ struct amd_gpu_scheduler {
>>>>>>       atomic_t            hw_rq_count;
>>>>>>       struct task_struct        *thread;
>>>>>>       struct list_head    ring_mirror_list;
>>>>>> -    spinlock_t            job_list_lock;
>>>>>> +    struct mutex            job_list_lock;
>>>>>>   };
>>>>>>     int amd_sched_init(struct amd_gpu_scheduler *sched,
>>>>>> -- 
>>>>>> 1.9.1
>>>>>>
>>>>>> _______________________________________________
>>>>>> amd-gfx mailing list
>>>>>> amd-gfx at lists.freedesktop.org
>>>>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>>>>> _______________________________________________
>>>>> amd-gfx mailing list
>>>>> amd-gfx at lists.freedesktop.org
>>>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>>>>
>>>> _______________________________________________
>>>> amd-gfx mailing list
>>>> amd-gfx at lists.freedesktop.org
>>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>>>
>>>
>>>
>>> _______________________________________________
>>> amd-gfx mailing list
>>> amd-gfx at lists.freedesktop.org
>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>>
>
>
>
> _______________________________________________
> amd-gfx mailing list
> amd-gfx at lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/amd-gfx

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.freedesktop.org/archives/amd-gfx/attachments/20160628/a1d321d7/attachment-0001.html>