On 04/28/2016 02:57 PM, Daniel Wagner wrote: > As one can see above in the swait_stat output, the fork() path is > using completion. A histogram of a fork bomp (1000 forks) benchmark > shows a slight performance drop by 4%. > > [wagi@handman completion-test-5 (master)]$ cat forky-4.6.0-rc4.txt | perl histo -min 0.12 -max 0.20 -int 0.01 -stars -scale 10 > # NumSamples = 1000; Max = 0.208; Min = 0.123 > # Mean = 0.146406; Variance = 0.000275351163999956; SD = 0.0165937085668019 > # Each * represents a count of 10 > 0.1200 - 0.1300 [ 113]: ************ > 0.1300 - 0.1400 [ 324]: ********************************* > 0.1400 - 0.1500 [ 219]: ********************** > 0.1500 - 0.1600 [ 139]: ************** > 0.1600 - 0.1700 [ 94]: ********** > 0.1700 - 0.1800 [ 54]: ****** > 0.1800 - 0.1900 [ 37]: **** > 0.1900 - 0.2000 [ 18]: ** > > [wagi@handman completion-test-5 (master)]$ cat forky-4.6.0-rc4-00001-g0a16067.txt | perl histo -min 0.12 -max 0.20 -int 0.01 -stars -scale 10 > # NumSamples = 1000; Max = 0.207; Min = 0.121 > # Mean = 0.152056; Variance = 0.000295474863999994; SD = 0.0171893823042014 > # Each * represents a count of 10 > 0.1200 - 0.1300 [ 17]: ** > 0.1300 - 0.1400 [ 282]: ***************************** > 0.1400 - 0.1500 [ 240]: ************************ > 0.1500 - 0.1600 [ 158]: **************** > 0.1600 - 0.1700 [ 114]: ************ > 0.1700 - 0.1800 [ 94]: ********** > 0.1800 - 0.1900 [ 66]: ******* > 0.1900 - 0.2000 [ 25]: *** > 0.2000 - 0.2100 [ 1]: * I redid the above test and changed my fork bomb to this: for (i = 0; i < MAX_CHILDREN; i++) { switch(fork()) { case -1: exit(1); case 0: _exit(0); } } for (i = 0; i < MAX_CHILDREN; i++) { do { pid = waitpid(-1, &status, WUNTRACED ); if (pid < 0 && errno != ECHILD) exit(1); } while (!WIFEXITED(status) && !WIFSIGNALED(status)); } Obviously, fork is not a very good benchmark since we might end up into memory allocation etc. The distribution I get from baseline and this batch look very similiar: [wagi@handman completion (master)]$ cat results/forky-4.6.0-rc4.txt | perl histo -min 0.09 -max 0.11 -int 0.001 -stars -scale 100 0.0910 - 0.0920 [ 3]: * 0.0920 - 0.0930 [ 8]: * 0.0930 - 0.0940 [ 52]: * 0.0940 - 0.0950 [ 404]: ***** 0.0950 - 0.0960 [ 1741]: ****************** 0.0960 - 0.0970 [ 2221]: *********************** 0.0970 - 0.0980 [ 1612]: ***************** 0.0980 - 0.0990 [ 1346]: ************** 0.0990 - 0.1000 [ 1223]: ************* 0.1000 - 0.1010 [ 724]: ******** 0.1010 - 0.1020 [ 362]: **** 0.1020 - 0.1030 [ 186]: ** 0.1030 - 0.1040 [ 71]: * 0.1040 - 0.1050 [ 29]: * 0.1050 - 0.1060 [ 12]: * 0.1060 - 0.1070 [ 4]: * 0.1080 - 0.1090 [ 2]: * [wagi@handman completion (master)]$ cat results/forky-4.6.0-rc4-00001-gc4c770c.txt | perl histo -min 0.09 -max 0.11 -int 0.001 -stars -scale 100 0.0930 - 0.0940 [ 3]: * 0.0940 - 0.0950 [ 9]: * 0.0950 - 0.0960 [ 25]: * 0.0960 - 0.0970 [ 77]: * 0.0970 - 0.0980 [ 324]: **** 0.0980 - 0.0990 [ 1503]: **************** 0.0990 - 0.1000 [ 2247]: *********************** 0.1000 - 0.1010 [ 1708]: ****************** 0.1010 - 0.1020 [ 1486]: *************** 0.1020 - 0.1030 [ 1215]: ************* 0.1030 - 0.1040 [ 729]: ******** 0.1040 - 0.1050 [ 368]: **** 0.1050 - 0.1060 [ 197]: ** 0.1060 - 0.1070 [ 65]: * 0.1070 - 0.1080 [ 32]: * 0.1080 - 0.1090 [ 7]: * 0.1090 - 0.1100 [ 2]: * A t-test (determine if two sets of data are significantly different) returns a p value of 0 (< 1%). That means we reject the null hypothesis of equal avarages. That means we have a 0.3% decrease in perforamnce compared with the baseline. > Compiling a kernel 100 times results in following statistics gather > by 'time make -j200' > > user > mean std var max min > kernbech-4.6.0-rc4 9.126 0.2919 0.08523 9.92 8.55 > kernbech-4.6.0-rc4-00001-g0... 9.24 -1.25% 0.2768 5.17% 0.07664 10.07% 10.11 -1.92% 8.44 1.29% > > > system > mean std var max min > kernbech-4.6.0-rc4 1.676e+03 2.409 5.804 1.681e+03 1.666e+03 > kernbech-4.6.0-rc4-00001-g0... 1.675e+03 0.07% 2.433 -1.01% 5.922 -2.03% 1.682e+03 -0.03% 1.67e+03 -0.20% > > > elapsed > mean std var max min > kernbech-4.6.0-rc4 2.303e+03 26.67 711.1 2.357e+03 2.232e+03 > kernbech-4.6.0-rc4-00001-g0... 2.298e+03 0.23% 28.75 -7.83% 826.8 -16.26% 2.348e+03 0.38% 2.221e+03 0.49% > > > CPU > mean std var max min > kernbech-4.6.0-rc4 4.418e+03 48.9 2.391e+03 4.565e+03 4.347e+03 > kernbech-4.6.0-rc4-00001-g0... 4.424e+03 -0.15% 55.73 -13.98% 3.106e+03 -29.90% 4.572e+03 -0.15% 4.356e+03 -0.21% > > > While the mean is slightly less the var and std are increasing quite > noticeable. The idea behind doing the kernel builds is that I wanted to see if there is an impact observable from a real work load. The above numbers are hard to interpret, though if you only look at the elapsed time you see it takes slightly longer. I repeated this test with 500 runs and the numbers I get are the same as above. So at least it is consisted and repeatable experiment. Obviously, I tried to micro benchmark what's going on, but so far I have had any luck. A kernel module which has two threads which do a ping-pong completion test, A typical trace looks like this: trigger-2376 [000] 218.982609: sched_waking: comm=waiter/0 pid=2375 prio=120 target_cpu=000 trigger-2376 [000] 218.982609: sched_stat_runtime: comm=trigger pid=2376 runtime=1355 [ns] vruntime=40692621118 [ns] trigger-2376 [000] 218.982609: sched_wakeup: waiter/0:2375 [120] success=1 CPU:000 trigger-2376 [000] 218.982610: rcu_utilization: Start context switch trigger-2376 [000] 218.982610: rcu_utilization: End context switch trigger-2376 [000] 218.982610: sched_stat_runtime: comm=trigger pid=2376 runtime=1072 [ns] vruntime=40692622190 [ns] trigger-2376 [000] 218.982611: sched_switch: trigger:2376 [120] S ==> waiter/0:2375 [120] waiter/0-2375 [000] 218.982611: latency_complete: latency=2285 waiter/0-2375 [000] 218.982611: sched_waking: comm=trigger pid=2376 prio=120 target_cpu=000 waiter/0-2375 [000] 218.982611: sched_stat_runtime: comm=waiter/0 pid=2375 runtime=1217 [ns] vruntime=40692622747 [ns] waiter/0-2375 [000] 218.982612: sched_wakeup: trigger:2376 [120] success=1 CPU:000 waiter/0-2375 [000] 218.982612: rcu_utilization: Start context switch waiter/0-2375 [000] 218.982612: rcu_utilization: End context switch waiter/0-2375 [000] 218.982612: sched_stat_runtime: comm=waiter/0 pid=2375 runtime=1099 [ns] vruntime=40692623846 [ns] waiter/0-2375 [000] 218.982613: sched_switch: waiter/0:2375 [120] S ==> trigger:2376 [120] I have plotted the latency_complete (the time it takes from complete() till the waiter is running) https://www.monom.org/data/completion/completion-latency.png The stats for the above plot are: [wagi@handman results (master)]$ csvstat-3 completion-latency-4.6.0-rc4.txt 1. 805 <class 'int'> Nulls: False Min: 643 Max: 351709 Sum: 3396063015 Mean: 715.6573082933786 Median: 706.0 Standard Deviation: 385.24467795803787 Unique values: 4662 5 most frequent values: 697: 121547 703: 120730 693: 112609 699: 112543 701: 112370 Row count: 4745376 [wagi@handman results (master)]$ csvstat-3 completion-latency-4.6.0-rc4-00001-gc4c770c.txt 1. 4949 <class 'int'> Nulls: False Min: 660 Max: 376049 Sum: 3417112614 Mean: 710.0990997187752 Median: 696 Standard Deviation: 500.7461712849926 Unique values: 4930 5 most frequent values: 693: 188698 689: 165564 692: 158333 688: 156896 684: 155032 Row count: 4812163 In short, I haven't figured out yet why the kernel builds get slightly slower. The first idea that the fork path is a problem is not 'proofable' with the the fork bomb. At least if it is executed in a tight loop. cheers, daniel -- To unsubscribe from this list: send the line "unsubscribe linux-rt-users" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html