On Fri, Jul 26, 2013 at 03:45:33PM -0700, Andrew Morton wrote: > On Fri, 19 Jul 2013 16:55:22 -0400 Johannes Weiner <hannes@xxxxxxxxxxx> wrote: > > > The way the page allocator interacts with kswapd creates aging > > imbalances, where the amount of time a userspace page gets in memory > > under reclaim pressure is dependent on which zone, which node the > > allocator took the page frame from. > > > > #1 fixes missed kswapd wakeups on NUMA systems, which lead to some > > nodes falling behind for a full reclaim cycle relative to the other > > nodes in the system > > > > #3 fixes an interaction where kswapd and a continuous stream of page > > allocations keep the preferred zone of a task between the high and > > low watermark (allocations succeed + kswapd does not go to sleep) > > indefinitely, completely underutilizing the lower zones and > > thrashing on the preferred zone > > > > These patches are the aging fairness part of the thrash-detection > > based file LRU balancing. Andrea recommended to submit them > > separately as they are bugfixes in their own right. > > > > The following test ran a foreground workload (memcachetest) with > > background IO of various sizes on a 4 node 8G system (similar results > > were observed with single-node 4G systems): > > > > parallelio > > BAS FAIRALLO > > BASE FAIRALLOC > > Ops memcachetest-0M 5170.00 ( 0.00%) 5283.00 ( 2.19%) > > Ops memcachetest-791M 4740.00 ( 0.00%) 5293.00 ( 11.67%) > > Ops memcachetest-2639M 2551.00 ( 0.00%) 4950.00 ( 94.04%) > > Ops memcachetest-4487M 2606.00 ( 0.00%) 3922.00 ( 50.50%) > > Ops io-duration-0M 0.00 ( 0.00%) 0.00 ( 0.00%) > > Ops io-duration-791M 55.00 ( 0.00%) 18.00 ( 67.27%) > > Ops io-duration-2639M 235.00 ( 0.00%) 103.00 ( 56.17%) > > Ops io-duration-4487M 278.00 ( 0.00%) 173.00 ( 37.77%) > > Ops swaptotal-0M 0.00 ( 0.00%) 0.00 ( 0.00%) > > Ops swaptotal-791M 245184.00 ( 0.00%) 0.00 ( 0.00%) > > Ops swaptotal-2639M 468069.00 ( 0.00%) 108778.00 ( 76.76%) > > Ops swaptotal-4487M 452529.00 ( 0.00%) 76623.00 ( 83.07%) > > Ops swapin-0M 0.00 ( 0.00%) 0.00 ( 0.00%) > > Ops swapin-791M 108297.00 ( 0.00%) 0.00 ( 0.00%) > > Ops swapin-2639M 169537.00 ( 0.00%) 50031.00 ( 70.49%) > > Ops swapin-4487M 167435.00 ( 0.00%) 34178.00 ( 79.59%) > > Ops minorfaults-0M 1518666.00 ( 0.00%) 1503993.00 ( 0.97%) > > Ops minorfaults-791M 1676963.00 ( 0.00%) 1520115.00 ( 9.35%) > > Ops minorfaults-2639M 1606035.00 ( 0.00%) 1799717.00 (-12.06%) > > Ops minorfaults-4487M 1612118.00 ( 0.00%) 1583825.00 ( 1.76%) > > Ops majorfaults-0M 6.00 ( 0.00%) 0.00 ( 0.00%) > > Ops majorfaults-791M 13836.00 ( 0.00%) 10.00 ( 99.93%) > > Ops majorfaults-2639M 22307.00 ( 0.00%) 6490.00 ( 70.91%) > > Ops majorfaults-4487M 21631.00 ( 0.00%) 4380.00 ( 79.75%) > > A reminder whether positive numbers are good or bad would be useful ;) It depends on the datapoint, but a positive percentage number is an improvement, a negative one a regression. > > BAS FAIRALLO > > BASE FAIRALLOC > > User 287.78 460.97 > > System 2151.67 3142.51 > > Elapsed 9737.00 8879.34 > > Confused. Why would the amount of user time increase so much? > > And that's a tremendous increase in system time. Am I interpreting > this correctly? It is because each memcachetest is running for a fixed duration (only the background IO is fixed in size). The time memcachetest previously spent waiting on major faults is now spent doing actual work (more user time, more syscalls). The number of operations memcachetest could actually perform increased. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@xxxxxxxxx. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@xxxxxxxxx"> email@xxxxxxxxx </a>