On Fri, Dec 13, 2013 at 01:16:41PM -0800, Linus Torvalds wrote: > On Fri, Dec 13, 2013 at 12:01 PM, Mel Gorman <mgorman@xxxxxxx> wrote: > > > > ebizzy > > 3.13.0-rc3 3.4.69 3.13.0-rc3 3.13.0-rc3 > > thread vanilla vanilla altershift-v2r1 nowalk-v2r7 > > Mean 1 7377.91 ( 0.00%) 6812.38 ( -7.67%) 7784.45 ( 5.51%) 7804.08 ( 5.78%) > > Mean 2 8262.07 ( 0.00%) 8276.75 ( 0.18%) 9437.49 ( 14.23%) 9450.88 ( 14.39%) > > Mean 3 7895.00 ( 0.00%) 8002.84 ( 1.37%) 8875.38 ( 12.42%) 8914.60 ( 12.91%) > > Mean 4 7658.74 ( 0.00%) 7824.83 ( 2.17%) 8509.10 ( 11.10%) 8399.43 ( 9.67%) > > Mean 5 7275.37 ( 0.00%) 7678.74 ( 5.54%) 8208.94 ( 12.83%) 8197.86 ( 12.68%) > > Mean 6 6875.50 ( 0.00%) 7597.18 ( 10.50%) 7755.66 ( 12.80%) 7807.51 ( 13.56%) > > Mean 7 6722.48 ( 0.00%) 7584.75 ( 12.83%) 7456.93 ( 10.93%) 7480.74 ( 11.28%) > > Mean 8 6559.55 ( 0.00%) 7591.51 ( 15.73%) 6879.01 ( 4.87%) 6881.86 ( 4.91%) > > Hmm. Do you have any idea why 3.4.69 still seems to do better at > higher thread counts? > > No complaints about this patch-series, just wondering.. > Good question. I had insufficient data to answer that quickly and test modifications were required to even start answering it. The following is based on tests from a different machine that happened to complete first. Short answer -- There appears to be a second bug where 3.13-rc3 is less fair to threads getting time on the CPU. Sometimes this means it can produce better benchmark results and other times worse. Which is better depends on the workload and a bit of luck. The long answer is incomplete and dull. First, the cost of the affected paths *appear* to be higher in 3.13-rc3, even with the series applied but 3.4.69 was not necessarily better. The following is test results based on Alex Shi's microbenchmark that was posted around the time of the original series. It has been slightly patched to work around a bug where a global variable is accessed improperly by threads and hangs. It's reporting the cost of accessing memory for each thread. Presumably the cost would be higher if we were flushing TLB entries that are currently hot. Lower values are better. tlbflush micro benchmark 3.13.0-rc3 3.13.0-rc3 3.4.69 vanilla nowalk-v2r7 vanilla Min 1 7.00 ( 0.00%) 6.00 ( 14.29%) 5.00 ( 28.57%) Min 2 8.00 ( 0.00%) 6.00 ( 25.00%) 4.00 ( 50.00%) Min 3 13.00 ( 0.00%) 11.00 ( 15.38%) 9.00 ( 30.77%) Min 4 17.00 ( 0.00%) 19.00 (-11.76%) 15.00 ( 11.76%) Mean 1 11.28 ( 0.00%) 10.66 ( 5.48%) 5.17 ( 54.13%) Mean 2 11.42 ( 0.00%) 11.52 ( -0.85%) 9.04 ( 20.82%) Mean 3 23.43 ( 0.00%) 21.64 ( 7.64%) 10.92 ( 53.39%) Mean 4 35.33 ( 0.00%) 34.17 ( 3.28%) 19.55 ( 44.67%) Range 1 6.00 ( 0.00%) 7.00 (-16.67%) 4.00 ( 33.33%) Range 2 23.00 ( 0.00%) 36.00 (-56.52%) 19.00 ( 17.39%) Range 3 15.00 ( 0.00%) 17.00 (-13.33%) 10.00 ( 33.33%) Range 4 29.00 ( 0.00%) 26.00 ( 10.34%) 9.00 ( 68.97%) Stddev 1 1.01 ( 0.00%) 1.12 ( 10.53%) 0.57 (-43.70%) Stddev 2 1.83 ( 0.00%) 3.03 ( 66.06%) 6.83 (274.00%) Stddev 3 2.82 ( 0.00%) 3.28 ( 16.44%) 1.21 (-57.14%) Stddev 4 6.65 ( 0.00%) 6.32 ( -5.00%) 1.58 (-76.24%) Max 1 13.00 ( 0.00%) 13.00 ( 0.00%) 9.00 ( 30.77%) Max 2 31.00 ( 0.00%) 42.00 (-35.48%) 23.00 ( 25.81%) Max 3 28.00 ( 0.00%) 28.00 ( 0.00%) 19.00 ( 32.14%) Max 4 46.00 ( 0.00%) 45.00 ( 2.17%) 24.00 ( 47.83%) It runs the benchmark for a number of threads up to the number of CPUs in the system (4 in this case). For each number of threads it runs 320 iterations. Each iteration uses a random range of entries between 0 and 256 is selected to be unmapped and flushed. Care is taken so there is a good spread of sizes selected between 0 and 256. It's meant to guess roughly what the average performance is. Access times were simply much better with 3.4.69 but I do not have profiles that might tell us why. What is very interesting is the CPU time and elapsed time for the test 3.13.0-rc3 3.13.0-rc3 3.4.69 vanilla nowalk-v2r7 vanilla User 179.36 165.25 97.29 System 153.59 155.07 128.32 Elapsed 1439.52 1437.69 2802.01 Note that 3.4.69 took much longer to complete the test. The duration of the test depends on how long it takes for a thread to do the unmapping. If the unmapping thread gets more time on the CPU, it completes the test faster and interferes more with the other threads performance (hence the higher access cost) but this is not necessarily a good result. It could indicate a fairness issue where the accessing threads are being starved by the unmapping thread. That is not necessarily the case, it's just one possibility. To see what thread fairness looked like, I looked again at ebizzy. This is the overall performance ebizzy 3.13.0-rc3 3.13.0-rc3 3.4.69 vanilla nowalk-v2r7 vanilla Mean 1 6366.88 ( 0.00%) 6741.00 ( 5.88%) 6658.32 ( 4.58%) Mean 2 6917.56 ( 0.00%) 7952.29 ( 14.96%) 8120.79 ( 17.39%) Mean 3 6231.78 ( 0.00%) 6846.08 ( 9.86%) 7174.98 ( 15.14%) Mean 4 5887.91 ( 0.00%) 6503.12 ( 10.45%) 6903.05 ( 17.24%) Mean 5 5680.77 ( 0.00%) 6185.83 ( 8.89%) 6549.15 ( 15.29%) Mean 6 5692.87 ( 0.00%) 6249.48 ( 9.78%) 6442.21 ( 13.16%) Mean 7 5846.76 ( 0.00%) 6344.94 ( 8.52%) 6279.13 ( 7.40%) Mean 8 5974.57 ( 0.00%) 6406.28 ( 7.23%) 6265.29 ( 4.87%) Range 1 174.00 ( 0.00%) 202.00 (-16.09%) 806.00 (-363.22%) Range 2 286.00 ( 0.00%) 979.00 (-242.31%) 1255.00 (-338.81%) Range 3 530.00 ( 0.00%) 583.00 (-10.00%) 626.00 (-18.11%) Range 4 592.00 ( 0.00%) 691.00 (-16.72%) 630.00 ( -6.42%) Range 5 567.00 ( 0.00%) 417.00 ( 26.46%) 584.00 ( -3.00%) Range 6 588.00 ( 0.00%) 353.00 ( 39.97%) 439.00 ( 25.34%) Range 7 477.00 ( 0.00%) 284.00 ( 40.46%) 343.00 ( 28.09%) Range 8 408.00 ( 0.00%) 182.00 ( 55.39%) 237.00 ( 41.91%) Stddev 1 31.59 ( 0.00%) 32.94 ( -4.27%) 154.26 (-388.34%) Stddev 2 56.95 ( 0.00%) 136.79 (-140.19%) 194.45 (-241.43%) Stddev 3 132.28 ( 0.00%) 101.02 ( 23.63%) 106.60 ( 19.41%) Stddev 4 140.93 ( 0.00%) 136.11 ( 3.42%) 138.26 ( 1.90%) Stddev 5 118.58 ( 0.00%) 86.74 ( 26.85%) 111.73 ( 5.77%) Stddev 6 109.64 ( 0.00%) 77.49 ( 29.32%) 95.52 ( 12.87%) Stddev 7 103.91 ( 0.00%) 51.44 ( 50.50%) 54.43 ( 47.62%) Stddev 8 67.79 ( 0.00%) 31.34 ( 53.76%) 53.08 ( 21.69%) 3.4.69 is still kicking a lot of ass there even though it's slower at higher number of threads in this particular test. I had hacked ebizzy to report on the performance of each thread, not just the overall result and worked out the difference in performance of each thread. In a complete fair test you would expect the performance of each thread to be identical and so the spread would be 0 ebizzy thread spread 3.13.0-rc3 3.13.0-rc3 3.4.69 vanilla nowalk-v2r7 vanilla Mean 1 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) Mean 2 0.34 ( 0.00%) 0.30 (-11.76%) 0.07 (-79.41%) Mean 3 1.29 ( 0.00%) 0.92 (-28.68%) 0.29 (-77.52%) Mean 4 7.08 ( 0.00%) 42.38 (498.59%) 0.22 (-96.89%) Mean 5 193.54 ( 0.00%) 483.41 (149.77%) 0.41 (-99.79%) Mean 6 151.12 ( 0.00%) 198.22 ( 31.17%) 0.42 (-99.72%) Mean 7 115.38 ( 0.00%) 160.29 ( 38.92%) 0.58 (-99.50%) Mean 8 108.65 ( 0.00%) 138.96 ( 27.90%) 0.44 (-99.60%) Range 1 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) Range 2 5.00 ( 0.00%) 6.00 ( 20.00%) 2.00 (-60.00%) Range 3 10.00 ( 0.00%) 17.00 ( 70.00%) 9.00 (-10.00%) Range 4 256.00 ( 0.00%) 1001.00 (291.02%) 5.00 (-98.05%) Range 5 456.00 ( 0.00%) 1226.00 (168.86%) 6.00 (-98.68%) Range 6 298.00 ( 0.00%) 294.00 ( -1.34%) 8.00 (-97.32%) Range 7 192.00 ( 0.00%) 220.00 ( 14.58%) 7.00 (-96.35%) Range 8 171.00 ( 0.00%) 163.00 ( -4.68%) 8.00 (-95.32%) Stddev 1 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) Stddev 2 0.72 ( 0.00%) 0.85 (-17.99%) 0.29 ( 59.72%) Stddev 3 1.42 ( 0.00%) 1.90 (-34.22%) 1.12 ( 21.19%) Stddev 4 33.83 ( 0.00%) 127.26 (-276.15%) 0.79 ( 97.65%) Stddev 5 92.08 ( 0.00%) 225.01 (-144.35%) 1.06 ( 98.85%) Stddev 6 64.82 ( 0.00%) 69.43 ( -7.11%) 1.28 ( 98.02%) Stddev 7 36.66 ( 0.00%) 49.19 (-34.20%) 1.18 ( 96.79%) Stddev 8 30.79 ( 0.00%) 36.23 (-17.64%) 1.06 ( 96.55%) For example, this is saying that with 8 threads on 3.13-rc3 that the difference between the slowest and fastest thread was 171 records/second. Note how in 3.13 that there are major differences between the performance of each particular thread once there are more threads than CPus. The series actually makes it worse but then again the series does alter what happens when IPIs get sent. In comparison, 3.4.69's spreads are very low even when there are more threads than CPUs. So I think there is a separate bug here that was introduced some time after 3.4.69 that has hurt scheduler fairness. It's not necessarily a scheduler bug but it does make a test like ebizzy noisy. Because of this bug, I'd be wary about drawing too many conclusions about ebizzy performance when the number of threads exceed the number of CPUs. -- Mel Gorman SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@xxxxxxxxx. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@xxxxxxxxx"> email@xxxxxxxxx </a>