Am 28.06.2016 um 12:05 schrieb zhoucm1: > > > On 2016å¹´06æ??28æ?¥ 18:03, Christian König wrote: >> Am 28.06.2016 um 11:33 schrieb zhoucm1: >>> >>> >>> On 2016å¹´06æ??28æ?¥ 17:36, Christian König wrote: >>>> Am 28.06.2016 um 09:27 schrieb Huang Rui: >>>>> On Tue, Jun 28, 2016 at 03:04:18PM +0800, Chunming Zhou wrote: >>>>>> ring_mirror_list is only used kthread context, no need to spinlock. >>>>>> otherwise deadlock happens when kthread_park. >>>>>> >>>>> Yes, in process context, we prefer to use mutex, because it avoids to >>>>> grab the CPU all the time. >>>>> >>>>> Reviewed-by: Huang Rui <ray.huang at amd.com> >>>> >>>> NAK, the patch won't apply because I've changed the irq save spin >>>> lock to a normal one quite a while ago. But, I'm not sure if Alex >>>> picked up that patch yet. >>>> >>>> You shouldn't use a mutex here when you don't have a reason to do >>>> so. Spin locks have less overhead and we won't expect any >>>> contention here. >>>> >>>> Additional to that how should this cause a deadlock with kthread_park? >>> you can apply another patch to drop hw ring, and then run glxgears, >>> and hang the gfx with the attached, then you will find that deadlock >>> info. >> >> Please provide the deadlock info. Since the spin lock only protects >> the linked list we should be able to easily avoid any lock inversion >> which could lead to a deadlock. > > With the latest code, you could easily reproduce it with above steps, > after mutex, these deadlock info disappears: That looks like a known issue which should already be fixed. Don't you have commit 3ebd43f5696147c6a4f808ded4ce1233496c7fd6 "drm/amdgpu: stop trying to schedule() with a spin held" in your branch? Regards, Christian. > > > Jun 28 14:34:25 zhoucm1-System-Product-Name kernel: [ 96.282892] NMI > watchdog: BUG: soft lockup - CPU#7 stuck for 22s! [kworker/7:0:58] > Jun 28 14:34:25 zhoucm1-System-Product-Name kernel: [ 96.283106] > Modules linked in: amdgpu(OE) ttm(OE) drm_kms_helper(OE) drm(E) > i2c_algo_bit(E) fb_sys_fops(E) syscopyarea(E) sysfillrect(E) > sysimgblt(E) nfsv3(E) bnep(E) rfcomm(E) bluetooth(E) > rpcsec_gss_krb5(E) nfsv4(E) nfsd(E) auth_rpcgss(E) nfs_acl(E) nfs(E) > lockd(E) grace(E) sunrpc(E) fscache(E) binfmt_misc(E) > snd_hda_codec_realtek(E) snd_hda_codec_hdmi(E) > snd_hda_codec_generic(E) snd_hda_intel(E) snd_hda_codec(E) > snd_hda_core(E) snd_hwdep(E) coretemp(E) snd_pcm(E) snd_seq_midi(E) > snd_seq_midi_event(E) snd_rawmidi(E) snd_seq(E) joydev(E) kvm_intel(E) > snd_seq_device(E) snd_timer(E) kvm(E) snd(E) soundcore(E) irqbypass(E) > gpio_ich(E) mxm_wmi(E) serio_raw(E) shpchp(E) wmi(E) i7core_edac(E) > edac_core(E) lpc_ich(E) parport_pc(E) mac_hid(E) ppdev(E) lp(E) > parport(E) hid_generic(E) usbhid(E) hid(E) firewire_ohci(E) 8139too(E) > pata_acpi(E) firewire_core(E) 8139cp(E) crc_itu_t(E) r8169(E) mii(E) > pata_jmicron(E) ahci(E) libahci(E) > Jun 28 14:34:25 zhoucm1-System-Product-Name kernel: [ 96.283134] > CPU: 7 PID: 58 Comm: kworker/7:0 Tainted: G IOE 4.5.0-custom #7 > Jun 28 14:34:25 zhoucm1-System-Product-Name kernel: [ 96.283136] > Hardware name: Gigabyte Technology Co., Ltd. X58A-UD3R/X58A-UD3R, BIOS > FE 12/23/2010 > Jun 28 14:34:25 zhoucm1-System-Product-Name kernel: [ 96.283182] > Workqueue: events amd_sched_job_finish [amdgpu] > Jun 28 14:34:25 zhoucm1-System-Product-Name kernel: [ 96.283183] > task: ffff8801a8490000 ti: ffff8801a848c000 task.ti: ffff8801a848c000 > Jun 28 14:34:25 zhoucm1-System-Product-Name kernel: [ 96.283184] > RIP: 0010:[<ffffffff810c1131>] [<ffffffff810c1131>] > native_queued_spin_lock_slowpath+0x171/0x190 > Jun 28 14:34:25 zhoucm1-System-Product-Name kernel: [ 96.283189] > RSP: 0018:ffff8801a848fdc8 EFLAGS: 00000202 > Jun 28 14:34:25 zhoucm1-System-Product-Name kernel: [ 96.283190] > RAX: 0000000000000101 RBX: ffff8800c5ba1030 RCX: 0000000000000001 > Jun 28 14:34:25 zhoucm1-System-Product-Name kernel: [ 96.283191] > RDX: 0000000000000101 RSI: 0000000000000001 RDI: ffff8800c42a4518 > Jun 28 14:34:25 zhoucm1-System-Product-Name kernel: [ 96.283192] > RBP: ffff8801a848fdc8 R08: 0000000000000101 R09: ffff8801a93d6300 > Jun 28 14:34:25 zhoucm1-System-Product-Name kernel: [ 96.283193] > R10: 0000000000000001 R11: 0000000000000000 R12: ffff8800c42a4468 > Jun 28 14:34:25 zhoucm1-System-Product-Name kernel: [ 96.283194] > R13: ffff8800c42a4518 R14: ffff8800c5ba1000 R15: ffff8800c5ba1030 > Jun 28 14:34:25 zhoucm1-System-Product-Name kernel: [ 96.283195] > FS: 0000000000000000(0000) GS:ffff8801a93c0000(0000) > knlGS:0000000000000000 > Jun 28 14:34:25 zhoucm1-System-Product-Name kernel: [ 96.283196] > CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b > Jun 28 14:34:25 zhoucm1-System-Product-Name kernel: [ 96.283197] > CR2: 00007efc3bc21e50 CR3: 0000000001c0c000 CR4: 00000000000006e0 > Jun 28 14:34:25 zhoucm1-System-Product-Name kernel: [ 96.283198] Stack: > Jun 28 14:34:25 zhoucm1-System-Product-Name kernel: [ 96.283199] > ffff8801a848fdd8 ffffffff81173ef2 ffff8801a848fde8 ffffffff817a19c0 > Jun 28 14:34:25 zhoucm1-System-Product-Name kernel: [ 96.283200] > ffff8801a848fe18 ffffffffa06937fb ffff8801a75deb40 ffff8801a93d6300 > Jun 28 14:34:25 zhoucm1-System-Product-Name kernel: [ 96.283202] > ffff8801a93dae00 00000000000001c0 ffff8801a848fe60 ffffffff81090bb0 > Jun 28 14:34:25 zhoucm1-System-Product-Name kernel: [ 96.283203] > Call Trace: > Jun 28 14:34:25 zhoucm1-System-Product-Name kernel: [ 96.283206] > [<ffffffff81173ef2>] queued_spin_lock_slowpath+0xb/0xf > Jun 28 14:34:25 zhoucm1-System-Product-Name kernel: [ 96.283209] > [<ffffffff817a19c0>] _raw_spin_lock+0x20/0x30 > Jun 28 14:34:25 zhoucm1-System-Product-Name kernel: [ 96.283234] > [<ffffffffa06937fb>] amd_sched_job_finish+0x2b/0xc0 [amdgpu] > Jun 28 14:34:25 zhoucm1-System-Product-Name kernel: [ 96.283237] > [<ffffffff81090bb0>] process_one_work+0x150/0x3f0 > Jun 28 14:34:25 zhoucm1-System-Product-Name kernel: [ 96.283239] > [<ffffffff8109136b>] worker_thread+0x12b/0x4b0 > Jun 28 14:34:25 zhoucm1-System-Product-Name kernel: [ 96.283241] > [<ffffffff81091240>] ? rescuer_thread+0x340/0x340 > Jun 28 14:34:25 zhoucm1-System-Product-Name kernel: [ 96.283242] > [<ffffffff81096b22>] kthread+0xd2/0xf0 > Jun 28 14:34:25 zhoucm1-System-Product-Name kernel: [ 96.283244] > [<ffffffff81096a50>] ? kthread_park+0x50/0x50 > Jun 28 14:34:25 zhoucm1-System-Product-Name kernel: [ 96.283246] > [<ffffffff817a210f>] ret_from_fork+0x3f/0x70 > Jun 28 14:34:25 zhoucm1-System-Product-Name kernel: [ 96.283247] > [<ffffffff81096a50>] ? kthread_park+0x50/0x50 > > Regards, > David Zhou > >> >> BTW: The debug patch you attached is not such a good idea either, >> cause you modify the gfx ring from outside the worker thread. >> >> Regards, >> Christian. >> >>> >>> Regards, >>> David Zhou >>>> >>>> Regards, >>>> Christian. >>>> >>>>> >>>>>> Change-Id: I906022297015faf14a0ddb5f62a728af3e5f9448 >>>>>> Signed-off-by: Chunming Zhou <David1.Zhou at amd.com> >>>>>> --- >>>>>> drivers/gpu/drm/amd/scheduler/gpu_scheduler.c | 12 +++++------- >>>>>> drivers/gpu/drm/amd/scheduler/gpu_scheduler.h | 2 +- >>>>>> 2 files changed, 6 insertions(+), 8 deletions(-) >>>>>> >>>>>> diff --git a/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c >>>>>> b/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c >>>>>> index b53cf58..3373d97 100644 >>>>>> --- a/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c >>>>>> +++ b/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c >>>>>> @@ -331,10 +331,9 @@ static void amd_sched_job_finish(struct >>>>>> work_struct *work) >>>>>> struct amd_sched_job *s_job = container_of(work, struct >>>>>> amd_sched_job, >>>>>> finish_work); >>>>>> struct amd_gpu_scheduler *sched = s_job->sched; >>>>>> - unsigned long flags; >>>>>> /* remove job from ring_mirror_list */ >>>>>> - spin_lock_irqsave(&sched->job_list_lock, flags); >>>>>> + mutex_lock(&sched->job_list_lock); >>>>>> list_del_init(&s_job->node); >>>>>> if (sched->timeout != MAX_SCHEDULE_TIMEOUT) { >>>>>> struct amd_sched_job *next; >>>>>> @@ -348,7 +347,7 @@ static void amd_sched_job_finish(struct >>>>>> work_struct *work) >>>>>> if (next) >>>>>> schedule_delayed_work(&next->work_tdr, sched->timeout); >>>>>> } >>>>>> - spin_unlock_irqrestore(&sched->job_list_lock, flags); >>>>>> + mutex_unlock(&sched->job_list_lock); >>>>>> sched->ops->free_job(s_job); >>>>>> } >>>>>> @@ -362,15 +361,14 @@ static void >>>>>> amd_sched_job_finish_cb(struct fence *f, struct fence_cb *cb) >>>>>> static void amd_sched_job_begin(struct amd_sched_job *s_job) >>>>>> { >>>>>> struct amd_gpu_scheduler *sched = s_job->sched; >>>>>> - unsigned long flags; >>>>>> - spin_lock_irqsave(&sched->job_list_lock, flags); >>>>>> + mutex_lock(&sched->job_list_lock); >>>>>> list_add_tail(&s_job->node, &sched->ring_mirror_list); >>>>>> if (sched->timeout != MAX_SCHEDULE_TIMEOUT && >>>>>> list_first_entry_or_null(&sched->ring_mirror_list, >>>>>> struct amd_sched_job, node) == s_job) >>>>>> schedule_delayed_work(&s_job->work_tdr, sched->timeout); >>>>>> - spin_unlock_irqrestore(&sched->job_list_lock, flags); >>>>>> + mutex_unlock(&sched->job_list_lock); >>>>>> } >>>>>> static void amd_sched_job_timedout(struct work_struct *work) >>>>>> @@ -564,7 +562,7 @@ int amd_sched_init(struct amd_gpu_scheduler >>>>>> *sched, >>>>>> init_waitqueue_head(&sched->wake_up_worker); >>>>>> init_waitqueue_head(&sched->job_scheduled); >>>>>> INIT_LIST_HEAD(&sched->ring_mirror_list); >>>>>> - spin_lock_init(&sched->job_list_lock); >>>>>> + mutex_init(&sched->job_list_lock); >>>>>> atomic_set(&sched->hw_rq_count, 0); >>>>>> if (atomic_inc_return(&sched_fence_slab_ref) == 1) { >>>>>> sched_fence_slab = kmem_cache_create( >>>>>> diff --git a/drivers/gpu/drm/amd/scheduler/gpu_scheduler.h >>>>>> b/drivers/gpu/drm/amd/scheduler/gpu_scheduler.h >>>>>> index 221a515..5675906 100644 >>>>>> --- a/drivers/gpu/drm/amd/scheduler/gpu_scheduler.h >>>>>> +++ b/drivers/gpu/drm/amd/scheduler/gpu_scheduler.h >>>>>> @@ -132,7 +132,7 @@ struct amd_gpu_scheduler { >>>>>> atomic_t hw_rq_count; >>>>>> struct task_struct *thread; >>>>>> struct list_head ring_mirror_list; >>>>>> - spinlock_t job_list_lock; >>>>>> + struct mutex job_list_lock; >>>>>> }; >>>>>> int amd_sched_init(struct amd_gpu_scheduler *sched, >>>>>> -- >>>>>> 1.9.1 >>>>>> >>>>>> _______________________________________________ >>>>>> amd-gfx mailing list >>>>>> amd-gfx at lists.freedesktop.org >>>>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx >>>>> _______________________________________________ >>>>> amd-gfx mailing list >>>>> amd-gfx at lists.freedesktop.org >>>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx >>>> >>>> _______________________________________________ >>>> amd-gfx mailing list >>>> amd-gfx at lists.freedesktop.org >>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx >>> >>> >>> >>> _______________________________________________ >>> amd-gfx mailing list >>> amd-gfx at lists.freedesktop.org >>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx >> > > > > _______________________________________________ > amd-gfx mailing list > amd-gfx at lists.freedesktop.org > https://lists.freedesktop.org/mailman/listinfo/amd-gfx -------------- next part -------------- An HTML attachment was scrubbed... URL: <https://lists.freedesktop.org/archives/amd-gfx/attachments/20160628/a1d321d7/attachment-0001.html>