On 10/3/23 15:53, Harry Wentland wrote:
On 2023-09-21 10:15, Sebastian Andrzej Siewior wrote:
This is a revert of the commit mentioned below while it is not wrong, as
in the kernel will explode, having migrate_disable() here it is
complete waste of resources.
Additionally commit message is plain wrong the review tag does not make
Not sure I follow what's unhelpful about the review tag with
0c316556d1249 ("drm/amd/display: Disable migration to ensure consistency of per-CPU variable")
I do wish the original patch showed the splat it's attempting
to fix. It apparently made a difference for something, whether
inadvertently or not. I wish I knew what that "something" was.
I did some digging, and it seems like the intention of that patch was to
fix the following splat:
WARNING: CPU: 5 PID: 1062 at
drivers/gpu/drm/amd/amdgpu/../display/amdgpu_dm/dc_fpu.c:71
dc_assert_fp_enabled+0x1a/0x30 [amdgpu]
[...]
CPU: 5 PID: 1062 Comm: Xorg Tainted: G OE
5.15.0-56-generic #62-Ubuntu
Hardware name: ASUS System Product Name/ROG STRIX Z590-F GAMING WIFI,
BIOS 1202 10/27/2021
RIP: 0010:dc_assert_fp_enabled+0x1a/0x30 [amdgpu]
Code: ff 45 31 f6 0f 0b e9 ca fe ff ff e8 90 1c 1f f7 48 c7 c0 00 30 03
00 65 48 03 05 b1 aa 86 3f 8b 00 85 c0 7e 05 c3 cc cc cc cc <0f> 0b c3
cc cc cc cc 66 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 40 00
RSP: 0000:ffffb89b82a8f118 EFLAGS: 00010246
RAX: 0000000000000000 RBX: ffff8c271cd00000 RCX: 0000000000000000
RDX: ffff8c2708025000 RSI: ffff8c270e8c0000 RDI: ffff8c271cd00000
RBP: ffffb89b82a8f1d0 R08: 0000000000000000 R09: 7fffff6affffffff
R10: ffffb89b82a8f240 R11: 0000000000000000 R12: 0000000000000002
R13: ffff8c271cd00000 R14: ffff8c270e8c0000 R15: ffff8c2708025000
FS: 00007f0570019a80(0000) GS:ffff8c2e3fb40000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00005594858a0058 CR3: 000000010e204001 CR4: 0000000000770ee0
PKRU: 55555554
Call Trace:
<TASK>
? dcn20_populate_dml_pipes_from_context+0x47/0x1730 [amdgpu]
? __kmalloc_node+0x2cb/0x3a0
dcn32_populate_dml_pipes_from_context+0x2b/0x450 [amdgpu]
dcn32_internal_validate_bw+0x15f9/0x2670 [amdgpu]
dcn32_find_dummy_latency_index_for_fw_based_mclk_switch+0xd0/0x310
[amdgpu]
dcn32_calculate_wm_and_dlg_fpu+0xe6/0x1e50 [amdgpu]
dcn32_calculate_wm_and_dlg+0x46/0x70 [amdgpu]
dcn32_validate_bandwidth+0x1b7/0x3e0 [amdgpu]
dc_validate_global_state+0x32c/0x560 [amdgpu]
dc_validate_with_context+0x6e6/0xd80 [amdgpu]
dc_commit_streams+0x21b/0x500 [amdgpu]
dc_commit_state+0xf3/0x150 [amdgpu]
amdgpu_dm_atomic_commit_tail+0x60d/0x3120 [amdgpu]
? dcn32_internal_validate_bw+0xcf8/0x2670 [amdgpu]
? fill_plane_buffer_attributes+0x1e5/0x560 [amdgpu]
? dcn32_validate_bandwidth+0x1e0/0x3e0 [amdgpu]
? kfree+0x1f7/0x250
? dcn32_validate_bandwidth+0x1e0/0x3e0 [amdgpu]
? dc_validate_global_state+0x32c/0x560 [amdgpu]
? __cond_resched+0x1a/0x50
? __wait_for_common+0x3e/0x150
? fill_plane_buffer_attributes+0x1e5/0x560 [amdgpu]
? usleep_range_state+0x90/0x90
? wait_for_completion_timeout+0x1d/0x30
commit_tail+0xc2/0x170 [drm_kms_helper]
? drm_atomic_helper_swap_state+0x20f/0x370 [drm_kms_helper]
drm_atomic_helper_commit+0x12b/0x150 [drm_kms_helper]
amdgpu_dm_atomic_commit+0x11/0x20 [amdgpu]
drm_atomic_commit+0x47/0x60 [drm]
drm_mode_obj_set_property_ioctl+0x16b/0x420 [drm]
? mutex_lock+0x13/0x50
? drm_mode_createblob_ioctl+0xf6/0x130 [drm]
? drm_mode_obj_find_prop_id+0x90/0x90 [drm]
drm_ioctl_kernel+0xb0/0x100 [drm]
drm_ioctl+0x268/0x4b0 [drm]
? drm_mode_obj_find_prop_id+0x90/0x90 [drm]
? ktime_get_mono_fast_ns+0x4a/0xa0
amdgpu_drm_ioctl+0x4e/0x90 [amdgpu]
__x64_sys_ioctl+0x92/0xd0
do_syscall_64+0x59/0xc0
? do_user_addr_fault+0x1e7/0x670
? do_syscall_64+0x69/0xc0
? exit_to_user_mode_prepare+0x37/0xb0
? irqentry_exit_to_user_mode+0x9/0x20
? irqentry_exit+0x1d/0x30
? exc_page_fault+0x89/0x170
entry_SYSCALL_64_after_hwframe+0x61/0xcb
RIP: 0033:0x7f05704a2aff
Code: 00 48 89 44 24 18 31 c0 48 8d 44 24 60 c7 04 24 10 00 00 00 48 89
44 24 08 48 8d 44 24 20 48 89 44 24 10 b8 10 00 00 00 0f 05 <41> 89 c0
3d 00 f0 ff ff 77 1f 48 8b 44 24 18 64 48 2b 04 25 28 00
RSP: 002b:00007ffc8c45a3f0 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
RAX: ffffffffffffffda RBX: 00007ffc8c45a480 RCX: 00007f05704a2aff
RDX: 00007ffc8c45a480 RSI: 00000000c01864ba RDI: 000000000000000e
RBP: 00000000c01864ba R08: 0000000000000077 R09: 00000000cccccccc
R10: 00007f05705a22f0 R11: 0000000000000246 R12: 0000000000000004
R13: 000000000000000e R14: 000000000000000f R15: 00007ffc8c45a8a8
</TASK>
---[ end trace 4deab30bb69df00f ]---
Harry
it any better. The migrate_disable() interface has a fat comment
describing it and it includes the word "undesired" in the headline which
should tickle people to read it before using it.
Initially I assumed it is worded too harsh but now I beg to differ.
The reviewer of the original commit, even not understanding what
migrate_disable() does should ask the following:
- migrate_disable() is added only to the CONFIG_X86 block and it claims
to protect fpu_recursion_depth. Why are the other the architectures
excluded?
- migrate_disable() is added after fpu_recursion_depth was modified.
Shouldn't it be added before the modification or referencing takes
place?
Moving on.
Disabling preemption DOES prevent CPU migration. A task, that can not be
pushed away from the CPU by the scheduler (due to disabled preemption)
can not be pushed or migrated to another CPU.
Disabling migration DOES NOT ensure consistency of per-CPU variables. It
only ensures that the task acts always on the same per-CPU variable. The
task remains preemptible meaning multiple tasks can access the same
per-CPU variable. This in turn leads to inconsistency for the statement
*pcpu -= 1;
with two tasks on one CPU and a preemption point during the RMW
operation:
Task A Task B
read pcpu to reg # 0
inc reg # 0 -> 1
read pcpu to reg # 0
inc reg # 0 -> 1
write reg to pcpu # 1
write reg to pcpu # 1
At the end pcpu reads 1 but should read 2 instead. Boom.
get_cpu_ptr() already contains a preempt_disable() statement. That means
that the per-CPU variable can only be referenced by a single task which
is currently running. The only inconsistency that can occur if the
variable is additionally accessed from an interrupt.
Remove migrate_disable/enable() from dc_fpu_begin/end().
Cc: Tianci Yin <tianci.yin@xxxxxxx>
Cc: Aurabindo Pillai <aurabindo.pillai@xxxxxxx>
Fixes: 0c316556d1249 ("drm/amd/display: Disable migration to ensure consistency of per-CPU variable")
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@xxxxxxxxxxxxx>
---
drivers/gpu/drm/amd/display/amdgpu_dm/dc_fpu.c | 2 --
1 file changed, 2 deletions(-)
diff --git a/drivers/gpu/drm/amd/display/amdgpu_dm/dc_fpu.c b/drivers/gpu/drm/amd/display/amdgpu_dm/dc_fpu.c
index 172aa10a8800f..86f4c0e046548 100644
--- a/drivers/gpu/drm/amd/display/amdgpu_dm/dc_fpu.c
+++ b/drivers/gpu/drm/amd/display/amdgpu_dm/dc_fpu.c
@@ -91,7 +91,6 @@ void dc_fpu_begin(const char *function_name, const int line)
if (*pcpu == 1) {
#if defined(CONFIG_X86) || defined(CONFIG_LOONGARCH)
- migrate_disable();
kernel_fpu_begin();
#elif defined(CONFIG_PPC64)
if (cpu_has_feature(CPU_FTR_VSX_COMP)) {
@@ -132,7 +131,6 @@ void dc_fpu_end(const char *function_name, const int line)
if (*pcpu <= 0) {
#if defined(CONFIG_X86) || defined(CONFIG_LOONGARCH)
kernel_fpu_end();
- migrate_enable();
#elif defined(CONFIG_PPC64)
if (cpu_has_feature(CPU_FTR_VSX_COMP)) {
disable_kernel_vsx();
--
Hamza