Re: [PATCH] drm/amdgpu: grab extra fence reference for drm_sched_job_add_dependency

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Am 05.01.23 um 02:44 schrieb Mikhail Gavrilov:
On Tue, Jan 3, 2023 at 7:26 PM Alex Deucher <alexdeucher@xxxxxxxxx> wrote:
On Tue, Jan 3, 2023 at 3:34 AM Christian König
<ckoenig.leichtzumerken@xxxxxxxxx> wrote:
I assume that this was already upstreamed while I was on sick leave?
Yes.

Alex

What about commit 2fdb8a8f07c2f1353770a324fd19b8114e4329ac ?

That one should be fixed by:

commit 9f1ecfc5dcb47a7ca37be47b0eaca0f37f1ae93d
Author: Dmitry Osipenko <dmitry.osipenko@xxxxxxxxxxxxx>
Date:   Wed Nov 23 03:13:03 2022 +0300

    drm/scheduler: Fix lockup in drm_sched_entity_kill()

    The drm_sched_entity_kill() is invoked twice by drm_sched_entity_destroy()     while userspace process is exiting or being killed. First time it's invoked     when sched entity is flushed and second time when entity is released. This     causes a lockup within wait_for_completion(entity_idle) due to how completion
    API works.

    Calling wait_for_completion() more times than complete() was invoked is a
    error condition that causes lockup because completion internally uses
    counter for complete/wait calls. The complete_all() must be used instead
    in such cases.

    This patch fixes lockup of Panfrost driver that is reproducible by killing
    any application in a middle of 3d drawing operation.

    Fixes: 2fdb8a8f07c2 ("drm/scheduler: rework entity flush, kill and fini")
    Signed-off-by: Dmitry Osipenko <dmitry.osipenko@xxxxxxxxxxxxx>
    Reviewed-by: Christian König <christian.koenig@xxxxxxx>
    Link: https://patchwork.freedesktop.org/patch/msgid/20221123001303.533968-1-dmitry.osipenko@xxxxxxxxxxxxx

Regards,
Christian.

I checked twice and I'm sure that this commit is the reason why I
can't terminate some games (and others processes).
Demonstration: https://youtu.be/O0AfjiMdFGw
I also attached a full kernel log.

INFO: task ZAT.exe:4745 blocked for more than 122 seconds.
       Tainted: G        W    L
6.1.0-rc1-13-2fdb8a8f07c2f1353770a324fd19b8114e4329ac+ #18
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
task:ZAT.exe         state:D stack:12608 pid:4745  ppid:1      flags:0x20004006
Call Trace:
  <TASK>
  __schedule+0x4c5/0x1740
  schedule+0x5d/0xe0
  schedule_timeout+0xf0/0x130
  __wait_for_common+0xa9/0x1f0
  ? usleep_range_state+0x90/0x90
  drm_sched_entity_kill.part.0+0x4d/0x210 [gpu_sched]
  drm_sched_entity_flush+0xa0/0x260 [gpu_sched]
  amdgpu_ctx_mgr_entity_flush+0x83/0xd0 [amdgpu]
  amdgpu_flush+0x25/0x40 [amdgpu]
  filp_close+0x31/0x70
  put_files_struct+0x78/0xf0
  do_exit+0x364/0xc30
  ? sched_clock_cpu+0xb/0xc0
  do_group_exit+0x33/0xa0
  get_signal+0xb41/0xb50
  arch_do_signal_or_restart+0x44/0x7a0
  exit_to_user_mode_prepare+0x17b/0x250
  syscall_exit_to_user_mode+0x16/0x50
  __do_fast_syscall_32+0x94/0xf0
2132]: Reached target exit.target - Exit the Session.
1]: user@1000.service: Killing process 4402 (reaper) with signal SIGKILL.
1]: user@1000.service: Killing process 4745 (ZAT.exe) with signal SIGKILL.
1]: Started plymouth-reboot.service - Show Plymouth Reboot Screen.
: SERVICE_START pid=1 uid=0 auid=4294967295 ses=4294967295
subj=system_u:system_r:init_t:s0 msg='unit=plymouth-reboot
comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=?
terminal=? res=succe>
1]: plymouth-switch-root-initramfs.service - Tell Plymouth To Jump To
initramfs was skipped because of an unmet condition check
(ConditionPathExists=/run/initramfs/bin/sh).
INFO: task ZAT.exe:4745 blocked for more than 122 seconds.
       Tainted: G        W    L
6.1.0-rc1-13-2fdb8a8f07c2f1353770a324fd19b8114e4329ac+ #18
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
task:ZAT.exe         state:D stack:12608 pid:4745  ppid:1      flags:0x20004006
Call Trace:
  <TASK>
  __schedule+0x4c5/0x1740
  schedule+0x5d/0xe0
  schedule_timeout+0xf0/0x130
  __wait_for_common+0xa9/0x1f0
  ? usleep_range_state+0x90/0x90
  drm_sched_entity_kill.part.0+0x4d/0x210 [gpu_sched]
  drm_sched_entity_flush+0xa0/0x260 [gpu_sched]
  amdgpu_ctx_mgr_entity_flush+0x83/0xd0 [amdgpu]
  amdgpu_flush+0x25/0x40 [amdgpu]
  filp_close+0x31/0x70
  put_files_struct+0x78/0xf0
  do_exit+0x364/0xc30
  ? sched_clock_cpu+0xb/0xc0
  do_group_exit+0x33/0xa0
  get_signal+0xb41/0xb50
  arch_do_signal_or_restart+0x44/0x7a0
  exit_to_user_mode_prepare+0x17b/0x250
  syscall_exit_to_user_mode+0x16/0x50
  __do_fast_syscall_32+0x94/0xf0
  ? __do_fast_syscall_32+0x94/0xf0
  ? lockdep_hardirqs_on+0x7d/0x100
  ? __do_fast_syscall_32+0x94/0xf0
  ? __do_fast_syscall_32+0x94/0xf0
  do_fast_syscall_32+0x2f/0x70
  entry_SYSCALL_compat_after_hwframe+0x62/0x6a
RIP: 0023:0xf7f6b579
RSP: 002b:00000000e8dffd40 EFLAGS: 00200282 ORIG_RAX: 00000000000000f0
RAX: fffffffffffffe00 RBX: 00000000f0b54dcc RCX: 0000000000000189
RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
RBP: 00000000ffffffff R08: 00000000e8dffd40 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000200282 R12: 0000000000000000
R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
  </TASK>

Showing all locks held in the system:
1 lock held by rcu_tasks_kthre/11:
  #0: ffffffffae368a20 (rcu_tasks.tasks_gp_mutex){+.+.}-{3:3}, at:
rcu_tasks_one_gp+0x2b/0x3e0
1 lock held by rcu_tasks_rude_/12:
  #0: ffffffffae368760 (rcu_tasks_rude.tasks_gp_mutex){+.+.}-{3:3}, at:
rcu_tasks_one_gp+0x2b/0x3e0
1 lock held by rcu_tasks_trace/13:
  #0: ffffffffae368460 (rcu_tasks_trace.tasks_gp_mutex){+.+.}-{3:3},
at: rcu_tasks_one_gp+0x2b/0x3e0
1 lock held by khungtaskd/182:
  #0: ffffffffae369520 (rcu_read_lock){....}-{1:2}, at:
debug_show_all_locks+0x15/0x16b
2 locks held by kworker/25:1/215:
1 lock held by systemd-journal/852:
1 lock held by ZAT.exe/4745:
  #0: ffff9b087c337cf8 (&mgr->lock#3){+.+.}-{3:3}, at:
amdgpu_ctx_mgr_entity_flush+0x3a/0xd0 [amdgpu]

=============================================
1]: user@1000.service: Processes still around after final SIGKILL.
Entering failed mode.
1]: user@1000.service: Failed with result 'timeout'.
1]: Stopped user@1000.service - User Manager for UID 1000.






[Index of Archives]     [Linux USB Devel]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux