Re: [tip: sched/core] sched: Fix performance regression introduced by mm_cid

Mathieu Desnoyers <mathieu.desnoyers@xxxxxxxxxxxx> · Wed, 21 Jun 2023 17:41:03 -0400

On 6/21/23 14:51, Mathieu Desnoyers wrote:

On 6/21/23 12:36, Swapnil Sapkal wrote:

Hello Mathieu,

[...]

I suspect the regression is caused by the mm_count cache line bouncing.

Please try with this additional patch applied:

https://lore.kernel.org/lkml/20230515143536.114960-1-mathieu.desnoyers@xxxxxxxxxxxx/

Thanks for the suggestion. I tried out with the patch you suggested. I 

am seeing

improvement in hackbench numbers with mm_count padding. But this is 

not matching

with what we achieved through reverting the new mm_cid patch.

Below are the results on the 1 Socket 4th Generation EPYC Processor (1 

x 96C/192T):

Threads:

Test:              Base (v6.4-rc1)   Base + new_mmcid_reverted  Base + 

mm_count_padding

  1-groups:         5.23 (0.00 pct)         4.61 (11.85 pct)        

5.11 (2.29 pct)

  2-groups:         4.99 (0.00 pct)         4.72 (5.41 pct)         

5.00 (-0.20 pct)

  4-groups:         5.96 (0.00 pct)         4.87 (18.28 pct)        

5.86 (1.67 pct)

  8-groups:         6.58 (0.00 pct)         5.44 (17.32 pct)        

6.20 (5.77 pct)

16-groups:        11.48 (0.00 pct)         8.07 (29.70 pct)       

10.68 (6.96 pct)

Processes:

Test:              Base (v6.4-rc1)  Base + new_mmcid_reverted   Base + 

mm_count_padding

  1-groups:         5.19 (0.00 pct)         4.90 (5.58 pct)         

5.19 (0.00 pct)

  2-groups:         5.44 (0.00 pct)         5.39 (0.91 pct)         

5.39 (0.91 pct)

  4-groups:         5.69 (0.00 pct)         5.64 (0.87 pct)         

5.64 (0.87 pct)

  8-groups:         6.08 (0.00 pct)         6.01 (1.15 pct)         

6.04 (0.65 pct)

16-groups:        10.87 (0.00 pct)        10.83 (0.36 pct)        

10.93 (-0.55 pct)

The ibs profile shows that function __switch_to_asm() is coming at top 

in baseline

run and is not seen with mm_count padding patch. Will be attaching 

full ibs profile

data for all the 3 runs:

# Base (v6.4-rc1)
Threads:
Total time: 11.486 [sec]

    5.15%  sched-messaging  [kernel.vmlinux]      [k] __switch_to_asm
    4.31%  sched-messaging  [kernel.vmlinux]      [k] copyout

    4.29%  sched-messaging  [kernel.vmlinux]      [k] 

native_queued_spin_lock_slowpath

    4.22%  sched-messaging  [kernel.vmlinux]      [k] copyin

    3.92%  sched-messaging  [kernel.vmlinux]      [k] 

apparmor_file_permission

    2.91%  sched-messaging  [kernel.vmlinux]      [k] __schedule
    2.34%  swapper          [kernel.vmlinux]      [k] __switch_to_asm

    2.10%  sched-messaging  [kernel.vmlinux]      [k] 

prepare_to_wait_event

    2.10%  sched-messaging  [kernel.vmlinux]      [k] try_to_wake_up

    2.07%  sched-messaging  [kernel.vmlinux]      [k] 

finish_task_switch.isra.0

    2.00%  sched-messaging  [kernel.vmlinux]      [k] pipe_write

    1.82%  sched-messaging  [kernel.vmlinux]      [k] 

check_preemption_disabled

    1.73%  sched-messaging  [kernel.vmlinux]      [k] 

exit_to_user_mode_prepare

    1.52%  sched-messaging  [kernel.vmlinux]      [k] __entry_text_start
    1.49%  sched-messaging  [kernel.vmlinux]      [k] osq_lock
    1.45%  sched-messaging  libc.so.6             [.] write
    1.44%  swapper          [kernel.vmlinux]      [k] native_sched_clock
    1.38%  sched-messaging  [kernel.vmlinux]      [k] psi_group_change
    1.38%  sched-messaging  [kernel.vmlinux]      [k] pipe_read
    1.37%  sched-messaging  libc.so.6             [.] read
    1.06%  sched-messaging  [kernel.vmlinux]      [k] vfs_read
    1.01%  swapper          [kernel.vmlinux]      [k] psi_group_change
    1.00%  sched-messaging  [kernel.vmlinux]      [k] update_curr

# Base + mm_count_padding
Threads:
Total time: 11.384 [sec]

    4.43%  sched-messaging  [kernel.vmlinux]         [k] copyin

    4.39%  sched-messaging  [kernel.vmlinux]         [k] 

native_queued_spin_lock_slowpath

    4.07%  sched-messaging  [kernel.vmlinux]         [k] 

apparmor_file_permission

    4.07%  sched-messaging  [kernel.vmlinux]         [k] copyout
    2.49%  sched-messaging  [kernel.vmlinux]         [k] entry_SYSCALL_64
    2.37%  sched-messaging  [kernel.vmlinux]         [k] update_cfs_group
    2.19%  sched-messaging  [kernel.vmlinux]         [k] pipe_write

    2.00%  sched-messaging  [kernel.vmlinux]         [k] 

check_preemption_disabled

    1.93%  swapper          [kernel.vmlinux]         [k] update_load_avg

    1.81%  sched-messaging  [kernel.vmlinux]         [k] 

exit_to_user_mode_prepare

    1.69%  sched-messaging  [kernel.vmlinux]         [k] try_to_wake_up
    1.58%  sched-messaging  libc.so.6                [.] write
    1.53%  sched-messaging  [kernel.vmlinux]         [k] psi_group_change
    1.50%  sched-messaging  libc.so.6                [.] read
    1.50%  sched-messaging  [kernel.vmlinux]         [k] pipe_read
    1.39%  sched-messaging  [kernel.vmlinux]         [k] update_load_avg
    1.39%  sched-messaging  [kernel.vmlinux]         [k] osq_lock
    1.30%  sched-messaging  [kernel.vmlinux]         [k] update_curr
    1.28%  swapper          [kernel.vmlinux]         [k] psi_group_change
    1.16%  sched-messaging  [kernel.vmlinux]         [k] vfs_read
    1.12%  sched-messaging  [kernel.vmlinux]         [k] vfs_write

    1.10%  sched-messaging  [kernel.vmlinux]         [k] 

entry_SYSRETQ_unsafe_stack

    1.09%  sched-messaging  [kernel.vmlinux]         [k] __switch_to_asm
    1.08%  sched-messaging  [kernel.vmlinux]         [k] do_syscall_64

    1.06%  sched-messaging  [kernel.vmlinux]         [k] 

select_task_rq_fair

    1.03%  swapper          [kernel.vmlinux]         [k] update_cfs_group
    1.00%  swapper          [kernel.vmlinux]         [k] rb_insert_color

# Base + reverted_new_mm_cid
Threads:
Total time: 7.847 [sec]

   12.14%  sched-messaging  [kernel.vmlinux]      [k] 

native_queued_spin_lock_slowpath

    8.86%  swapper          [kernel.vmlinux]      [k] 

native_queued_spin_lock_slowpath

    6.13%  sched-messaging  [kernel.vmlinux]      [k] copyin

    5.54%  sched-messaging  [kernel.vmlinux]      [k] 

apparmor_file_permission

    3.59%  sched-messaging  [kernel.vmlinux]      [k] copyout
    2.61%  sched-messaging  [kernel.vmlinux]      [k] osq_lock
    2.48%  sched-messaging  [kernel.vmlinux]      [k] pipe_write

    2.33%  sched-messaging  [kernel.vmlinux]      [k] 

exit_to_user_mode_prepare

    2.01%  sched-messaging  [kernel.vmlinux]      [k] 

check_preemption_disabled

    1.96%  sched-messaging  [kernel.vmlinux]      [k] __entry_text_start
    1.91%  sched-messaging  libc.so.6             [.] write
    1.77%  sched-messaging  libc.so.6             [.] read
    1.64%  sched-messaging  [kernel.vmlinux]      [k] mutex_spin_on_owner
    1.58%  sched-messaging  [kernel.vmlinux]      [k] pipe_read
    1.52%  sched-messaging  [kernel.vmlinux]      [k] try_to_wake_up

    1.38%  sched-messaging  [kernel.vmlinux]      [k] 

ktime_get_coarse_real_ts64

    1.35%  sched-messaging  [kernel.vmlinux]      [k] vfs_write

    1.28%  sched-messaging  [kernel.vmlinux]      [k] 

entry_SYSRETQ_unsafe_stack

    1.28%  sched-messaging  [kernel.vmlinux]      [k] vfs_read
    1.25%  sched-messaging  [kernel.vmlinux]      [k] do_syscall_64
    1.22%  sched-messaging  [kernel.vmlinux]      [k] __fget_light
    1.18%  sched-messaging  [kernel.vmlinux]      [k] mutex_lock
    1.12%  sched-messaging  [kernel.vmlinux]      [k] file_update_time
    1.04%  sched-messaging  [kernel.vmlinux]      [k] _copy_from_iter
    1.01%  sched-messaging  [kernel.vmlinux]      [k] current_time

So with the reverted new_mm_cid patch, we are seeing a lot of time 

being spent in

native_queued_spin_lock_slowpath and yet, hackbench finishes faster.

I keep further digging into this please let me know if you have any 

pointers for me.

Do you have CONFIG_SECURITY_APPARMOR=y ? Can you try without ?

I notice that apparmor_file_permission appears near the top of your
profiles, and apparmor uses an internal aa_buffers_lock spinlock,
which could possibly explain the top hits for
native_queued_spin_lock_slowpath. My current suspicion is that
the raw spinlock that was taken by "Base + reverted_new_mm_cid"
changed the contention pattern on the apparmor lock enough to
speed things up by pure accident.

If apparmor happens to be the culprit here, we should have a hard look 

at this commit:

commit df323337e50 "apparmor: Use a memory pool instead per-CPU caches"

Which turned a per-cpu cache into a global memory pool protected by a 

spinlock. It may benefit RT, but it does not appear to be so great at 

scaling.

Thanks,

Mathieu

--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com