Re: [PATCH 1/2] sched/uclamp: Add a new sysctl to control RT default boost value

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 




[snip]

Hi Mel and Qais,

I was able to synthesize results from some experiments which I conducted
on my machine. You can find them below with descriptions.

1. Description of the configuration and hardware

My machine is a HP server 2 socket 24 CPUs X86 64bit
(4 NUMA nodes, AMD Opteron 6174, L2 512KB/cpu, L3 6MB/node, RAM 40GB/node).

Results presented here are coming from OpenSuse 15.1 (apart from last experiment) with kernel build based on the distro config.
Kernel tag v5.7-rc7.
There are 3 kernels that I have created based on distro config:
a) v5.7-rc7-base - default kernel build (no uclamp)
b) v5.7-rc7-ucl-tsk - base kernel + CONFIG_UCLAMP_TASK
c) v5.7-rc7-ucl-tsk-grp - base kernel + CONFIG_UCLAMP_TASK & CONFIG_UCLAMP_TASK_GROUP

2. Experiments

I have been using the mmtests with configuration as you recommended.
I put under stress the system in different scenarios, to check if some
regression can be observed and under what circumstances.
The descriptions below show these different angles of attacks during
mmtests: w/ or w/o numa pinning, using or not perf, tracing, etc.
I have also checked a bit closer to the suspected functions:
activate_task and deactivate_task, which you might find in the
experiment description.

2.1. Experiment with netperf and two kernels

These tests have been conducted without numactl force settings (all CPUs
allowed). As it can be seen the kernel with uclamp task has worse
performance for UDP, but somehow better for TCP.

UDP tests results:
netperf-udp
                          ./v5.7-rc7-base       ./v5.7-rc7-ucl-tsk
Hmean     send-64          62.15 (   0.00%)       59.65 *  -4.02%*
Hmean     send-128        122.88 (   0.00%)      119.37 *  -2.85%*
Hmean     send-256        244.85 (   0.00%)      234.26 *  -4.32%*
Hmean     send-1024       919.24 (   0.00%)      880.67 *  -4.20%*
Hmean     send-2048      1689.45 (   0.00%)     1647.54 *  -2.48%*
Hmean     send-3312      2542.36 (   0.00%)     2485.23 *  -2.25%*
Hmean     send-4096      2935.69 (   0.00%)     2861.09 *  -2.54%*
Hmean     send-8192      4800.35 (   0.00%)     4680.09 *  -2.51%*
Hmean     send-16384     7473.66 (   0.00%)     7349.60 *  -1.66%*
Hmean     recv-64          62.15 (   0.00%)       59.65 *  -4.03%*
Hmean     recv-128        122.88 (   0.00%)      119.37 *  -2.85%*
Hmean     recv-256        244.84 (   0.00%)      234.26 *  -4.32%*
Hmean     recv-1024       919.24 (   0.00%)      880.67 *  -4.20%*
Hmean     recv-2048      1689.44 (   0.00%)     1647.54 *  -2.48%*
Hmean     recv-3312      2542.36 (   0.00%)     2485.23 *  -2.25%*
Hmean     recv-4096      2935.69 (   0.00%)     2861.09 *  -2.54%*
Hmean     recv-8192      4800.35 (   0.00%)     4678.15 *  -2.55%*
Hmean     recv-16384     7473.63 (   0.00%)     7349.52 *  -1.66%*

TCP test results:
netperf-tcp
                       ./v5.7-rc7-base    ./v5.7-rc7-ucl-tsk
Hmean     64         756.44 (   0.00%)      881.17 *  16.49%*
Hmean     128       1425.09 (   0.00%)     1558.70 *   9.38%*
Hmean     256       2292.65 (   0.00%)     2508.72 *   9.42%*
Hmean     1024      5068.70 (   0.00%)     5612.17 *  10.72%*
Hmean     2048      6506.81 (   0.00%)     6739.87 *   3.58%*
Hmean     3312      7232.42 (   0.00%)     7735.86 *   6.96%*
Hmean     4096      7597.95 (   0.00%)     7698.76 *   1.33%*
Hmean     8192      8402.80 (   0.00%)     8540.36 *   1.64%*
Hmean     16384     8841.60 (   0.00%)     9068.70 *   2.57%*

Using perf for in similar workload:
Perf difference in the activate_task and deactivate_task is not too
small.
v5.7-rc7-base
     0.62%  netperf          [kernel.kallsyms]        [k] activate_task
     0.06%  netserver        [kernel.kallsyms]        [k] deactivate_task

v5.7-rc7-ucl-tsk
     3.43%  netperf          [kernel.kallsyms]        [k] activate_task
     2.39%  netserver        [kernel.kallsyms]        [k] deactivate_task

It's a starting point, just to align with others who see also some
regression.

2.2. Experiment with many tests of a single netperf-udp 64B and tracing

I have tried to measure the suspected functions, which were mentioned
many times. Here are the measurements of functions 'activate_task' and
'deactivate_task', such as:
number of hits, total computation time, average time of one call.
These values have been captured during one single netperf-udp 64B test,
but repeated many time. These tables below show processed statistics for
experiments conducted with 3 different kernels. How many times the test
has been repeated on each kernel is shown in row called 'counts'.
This is the output from pandas data frame, function describe(). In case
of confusion with labels in the first row, please check the web for some
tutorials.

stats: fprof.base (basic kernel v5.7-rc7 nouclamp)
activate_task
               Hit    Time_us  Avg_us  s^2_us
count       138.00     138.00  138.00  138.00
mean     20,387.44  14,587.33    1.15    0.53
std     114,980.19  81,427.51    0.42    0.23
min         110.00     181.68    0.32    0.00
50%         411.00     461.55    1.32    0.54
75%         881.75     760.08    1.47    0.66
90%       2,885.60   1,302.03    1.61    0.80
95%      55,318.05  41,273.41    1.66    0.92
99%     501,660.04 358,939.04    1.77    1.09
max   1,131,457.00 798,097.30    1.80    1.42
deactivate_task
               Hit    Time_us  Avg_us  s^2_us
count       138.00     138.00  138.00  138.00
mean     81,828.83  39,991.61    0.81    0.28
std     260,130.01 126,386.89    0.28    0.14
min          97.00      92.35    0.26    0.00
50%         424.00     340.35    0.94    0.30
75%       1,062.25     684.98    1.05    0.37
90%     330,657.50 168,320.94    1.11    0.46
95%     748,920.70 359,498.23    1.15    0.51
99%   1,094,614.76 528,459.50    1.21    0.56
max   1,630,473.00 789,476.50    1.25    0.60

stats: fprof.uclamp_tsk (kernel v5.7-rc7 + uclamp tasks)
activate_task
               Hit      Time_us  Avg_us  s^2_us
count       113.00       113.00  113.00  113.00
mean     23,006.46    24,133.29    1.36    0.64
std     161,171.74   170,299.61    0.45    0.24
min          98.00       173.13    0.44    0.08
50%         369.00       575.96    1.55    0.62
75%         894.00       883.71    1.69    0.74
90%       1,941.20     1,221.70    1.77    0.90
95%       3,187.40     1,627.21    1.85    1.14
99%     431,604.88   437,291.66    1.92    1.35
max   1,631,657.00 1,729,488.00    2.16    1.35
deactivate_task
               Hit      Time_us  Avg_us  s^2_us
count       113.00       113.00  113.00  113.00
mean    108,067.93    86,020.56    1.00    0.35
std     310,429.35   246,938.68    0.33    0.15
min          89.00       102.46    0.33    0.00
50%         430.00       495.87    1.14    0.35
75%       1,361.00       823.63    1.24    0.44
90%     437,528.40   345,051.10    1.34    0.53
95%     886,978.60   696,796.74    1.40    0.58
99%   1,345,052.40 1,086,567.76    1.44    0.68
max   1,391,534.00 1,116,053.00    1.63    0.80

stats: fprof.uclamp_tsk_grp (kernel v5.7-rc7 + uclamp tasks + uclamp task group)
activate_task
               Hit      Time_us  Avg_us  s^2_us
count       273.00       273.00  273.00  273.00
mean     15,958.34    16,471.84    1.58    0.67
std     105,096.88   108,322.03    0.43    0.32
min           3.00         4.96    0.41    0.00
50%         245.00       400.23    1.70    0.64
75%         384.00       565.53    1.85    0.78
90%       1,602.00     1,069.08    1.95    0.95
95%       3,403.00     1,573.74    2.01    1.13
99%     589,484.56   604,992.57    2.11    1.75
max   1,035,866.00 1,096,975.00    2.40    3.08
deactivate_task
               Hit      Time_us  Avg_us  s^2_us
count       273.00       273.00  273.00  273.00
mean     94,607.02    63,433.12    1.02    0.34
std     325,130.91   216,844.92    0.28    0.16
min           2.00         2.79    0.29    0.00
50%         244.00       291.49    1.11    0.36
75%         496.00       448.72    1.19    0.43
90%     120,304.60    82,964.94    1.25    0.55
95%     945,480.60   626,793.58    1.33    0.60
99%   1,485,959.96 1,010,615.72    1.40    0.68
max   2,120,682.00 1,403,280.00    1.80    1.11

As you can see the data is distributed differently, having
higher 'Hit' and 'Time_us' value at around .95 for kernels
with uclamp.

2.3. Experiment forcing test tasks to run in the same NUMA node

The experiment showing if forcing to use only one NUMA node for all test
tasks can make a difference.

netperf-udp
./v5.7-rc7 ./v5.7-rc7 ./v5.7-rc7 base-numa0 ucl-tsk-numa0 ucl-tsk-grp-numa0 Hmean send-64 60.99 ( 0.00%) 61.19 * 0.32%* 64.58 * 5.88%* Hmean send-128 121.92 ( 0.00%) 121.37 * -0.45%* 128.26 * 5.20%* Hmean send-256 240.74 ( 0.00%) 240.87 * 0.06%* 253.86 * 5.45%* Hmean send-1024 905.17 ( 0.00%) 908.43 * 0.36%* 955.59 * 5.57%* Hmean send-2048 1669.18 ( 0.00%) 1681.30 * 0.73%* 1752.39 * 4.99%* Hmean send-3312 2496.30 ( 0.00%) 2510.48 * 0.57%* 2602.42 * 4.25%* Hmean send-4096 2914.13 ( 0.00%) 2932.19 * 0.62%* 3028.83 * 3.94%* Hmean send-8192 4744.81 ( 0.00%) 4762.90 * 0.38%* 4916.24 * 3.61%* Hmean send-16384 7489.47 ( 0.00%) 7514.17 * 0.33%* 7570.39 * 1.08%* Hmean recv-64 60.98 ( 0.00%) 61.18 * 0.34%* 64.54 * 5.85%* Hmean recv-128 121.86 ( 0.00%) 121.29 * -0.47%* 128.26 * 5.26%* Hmean recv-256 240.65 ( 0.00%) 240.79 * 0.06%* 253.74 * 5.44%* Hmean recv-1024 904.65 ( 0.00%) 908.20 * 0.39%* 955.58 * 5.63%* Hmean recv-2048 1669.18 ( 0.00%) 1680.89 * 0.70%* 1752.39 * 4.99%* Hmean recv-3312 2495.08 ( 0.00%) 2509.68 * 0.59%* 2601.31 * 4.26%* Hmean recv-4096 2911.66 ( 0.00%) 2931.46 * 0.68%* 3028.83 * 4.02%* Hmean recv-8192 4738.70 ( 0.00%) 4762.27 * 0.50%* 4911.90 * 3.66%* Hmean recv-16384 7485.81 ( 0.00%) 7513.41 * 0.37%* 7569.91 * 1.12%*

netperf-tcp
./v5.7-rc7 ./v5.7-rc7 ./v5.7-rc7 base-numa0 ucl-tsk-numa0 ucl-tsk-grp-numa0 Hmean 64 762.29 ( 0.00%) 826.48 * 8.42%* 768.86 * 0.86%* Hmean 128 1418.94 ( 0.00%) 1573.76 * 10.91%* 1444.04 * 1.77%* Hmean 256 2302.76 ( 0.00%) 2518.75 * 9.38%* 2315.00 * 0.53%* Hmean 1024 5076.92 ( 0.00%) 5351.65 * 5.41%* 5061.19 * -0.31%* Hmean 2048 6493.42 ( 0.00%) 6645.99 * 2.35%* 6493.79 * 0.01%* Hmean 3312 7229.76 ( 0.00%) 7373.29 * 1.99%* 7208.45 * -0.29%* Hmean 4096 7604.00 ( 0.00%) 7656.45 * 0.69%* 7574.14 * -0.39%* Hmean 8192 8456.24 ( 0.00%) 8495.95 * 0.47%* 8387.04 * -0.82%* Hmean 16384 8835.74 ( 0.00%) 8775.17 * -0.69%* 8837.48 * 0.02%*

Perf values of suspected functions for each kernel for similar test from
above (pinned to NUMA 0) shows that there is more calls to these
functions, like usually.
 base
     0.57%  netperf          [kernel.kallsyms]        [k] activate_task
     0.11%  netserver        [kernel.kallsyms]        [k] deactivate_task
 ucl-tsk
     3.44%  netperf          [kernel.kallsyms]          [k] activate_task
     2.49%  netserver        [kernel.kallsyms]          [k] deactivate_task
 ucl-tsk-grp
     2.47%  netperf          [kernel.kallsyms]        [k] activate_task
     1.30%  netserver        [kernel.kallsyms]        [k] deactivate_task

This shows there is more work in the related function, but somehow the
machine is able to handle it and the performance results are even better
with uclamp.

2.4. Experiment with one netperf-udp and perf tool.

Repeating nteperd-udp 64B experiment with base kernel vs uclamp task
group of one test run a few times, I could observed in perf that I have:
87bln vs 100bln cycles
~0.8-0.9k  vs ~2.6M context-switches
 ~73bln vs 76-77bln instr
task-clock stays the same: ~48s

2.5. Ubuntu server and distro kernel experiments

Here are some results when I checked different distro, to check if it
can be observed there as well.
This experiment if for different kernel and different distro:
Ubuntu server 18.04, but the same machine.
The results are for kernel uclamp task + task (last column) group might
look really bad.
I convinced myself after processing results from experiment 2.2
that I just might hit worse usecase during these 5 iterations test of
'netperf-udp send-128', a very bad tasks bouncing.
Apart from that, in general, worse performance results can be observed.

./v5.6-custom-nouclamp ./v5.6-custom-uct ./v5.6-custom-uctg Hmean send-64 99.43 ( 0.00%) 94.40 * -5.06%* 90.19 * -9.29%* Hmean send-128 198.81 ( 0.00%) 180.91 * -9.01%* 137.80 * -30.69%* Hmean send-256 393.12 ( 0.00%) 341.89 * -13.03%* 332.72 * -15.36%* Hmean send-1024 1052.48 ( 0.00%) 961.17 * -8.68%* 961.64 * -8.63%* Hmean send-2048 1935.68 ( 0.00%) 1803.86 * -6.81%* 1755.36 * -9.32%* Hmean send-3312 2983.04 ( 0.00%) 2806.50 * -5.92%* 2802.44 * -6.05%* Hmean send-4096 3558.37 ( 0.00%) 3348.70 * -5.89%* 3373.92 * -5.18%* Hmean send-8192 5335.23 ( 0.00%) 5227.89 * -2.01%* 5277.22 * -1.09%* Hmean send-16384 7552.66 ( 0.00%) 7374.27 * -2.36%* 7388.90 * -2.17%*

3. Some hypothesis and summary

These 1.5M extra ctx-switches might cause + 3-4bln instr,
which could consume extra 13bln cycles.
Tasks are jumping around across the CPUs more often.
More frequently there is context switch.
The functions 'activate_task' and 'deactivate_task' have worse
total hit or total computation time in the same netperf-udp test.
This also makes worse average time for them. It might be because of the
pressure on caches and branch predictions. Surprisingly the machine can
handle higher value of bouncing tasks when they are pinned to one single
NUMA node.

I hope it could help you to investigate further this issue and find a
solution. IMHO having this uclamp option as a static key is in my
opinion a good idea.
Thank you Mel for your help in my machine configuration and setup.

Regards,
Lukasz Luba





[Index of Archives]     [Kernel Newbies]     [Security]     [Netfilter]     [Bugtraq]     [Linux FS]     [Yosemite Forum]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Samba]     [Video 4 Linux]     [Device Mapper]     [Linux Resources]

  Powered by Linux