On Fri, 19 Jul 2013 16:55:22 -0400 Johannes Weiner <hannes@xxxxxxxxxxx> wrote: > The way the page allocator interacts with kswapd creates aging > imbalances, where the amount of time a userspace page gets in memory > under reclaim pressure is dependent on which zone, which node the > allocator took the page frame from. > > #1 fixes missed kswapd wakeups on NUMA systems, which lead to some > nodes falling behind for a full reclaim cycle relative to the other > nodes in the system > > #3 fixes an interaction where kswapd and a continuous stream of page > allocations keep the preferred zone of a task between the high and > low watermark (allocations succeed + kswapd does not go to sleep) > indefinitely, completely underutilizing the lower zones and > thrashing on the preferred zone > > These patches are the aging fairness part of the thrash-detection > based file LRU balancing. Andrea recommended to submit them > separately as they are bugfixes in their own right. > > The following test ran a foreground workload (memcachetest) with > background IO of various sizes on a 4 node 8G system (similar results > were observed with single-node 4G systems): > > parallelio > BAS FAIRALLO > BASE FAIRALLOC > Ops memcachetest-0M 5170.00 ( 0.00%) 5283.00 ( 2.19%) > Ops memcachetest-791M 4740.00 ( 0.00%) 5293.00 ( 11.67%) > Ops memcachetest-2639M 2551.00 ( 0.00%) 4950.00 ( 94.04%) > Ops memcachetest-4487M 2606.00 ( 0.00%) 3922.00 ( 50.50%) > Ops io-duration-0M 0.00 ( 0.00%) 0.00 ( 0.00%) > Ops io-duration-791M 55.00 ( 0.00%) 18.00 ( 67.27%) > Ops io-duration-2639M 235.00 ( 0.00%) 103.00 ( 56.17%) > Ops io-duration-4487M 278.00 ( 0.00%) 173.00 ( 37.77%) > Ops swaptotal-0M 0.00 ( 0.00%) 0.00 ( 0.00%) > Ops swaptotal-791M 245184.00 ( 0.00%) 0.00 ( 0.00%) > Ops swaptotal-2639M 468069.00 ( 0.00%) 108778.00 ( 76.76%) > Ops swaptotal-4487M 452529.00 ( 0.00%) 76623.00 ( 83.07%) > Ops swapin-0M 0.00 ( 0.00%) 0.00 ( 0.00%) > Ops swapin-791M 108297.00 ( 0.00%) 0.00 ( 0.00%) > Ops swapin-2639M 169537.00 ( 0.00%) 50031.00 ( 70.49%) > Ops swapin-4487M 167435.00 ( 0.00%) 34178.00 ( 79.59%) > Ops minorfaults-0M 1518666.00 ( 0.00%) 1503993.00 ( 0.97%) > Ops minorfaults-791M 1676963.00 ( 0.00%) 1520115.00 ( 9.35%) > Ops minorfaults-2639M 1606035.00 ( 0.00%) 1799717.00 (-12.06%) > Ops minorfaults-4487M 1612118.00 ( 0.00%) 1583825.00 ( 1.76%) > Ops majorfaults-0M 6.00 ( 0.00%) 0.00 ( 0.00%) > Ops majorfaults-791M 13836.00 ( 0.00%) 10.00 ( 99.93%) > Ops majorfaults-2639M 22307.00 ( 0.00%) 6490.00 ( 70.91%) > Ops majorfaults-4487M 21631.00 ( 0.00%) 4380.00 ( 79.75%) A reminder whether positive numbers are good or bad would be useful ;) > BAS FAIRALLO > BASE FAIRALLOC > User 287.78 460.97 > System 2151.67 3142.51 > Elapsed 9737.00 8879.34 Confused. Why would the amount of user time increase so much? And that's a tremendous increase in system time. Am I interpreting this correctly? > BAS FAIRALLO > BASE FAIRALLOC > Minor Faults 53721925 57188551 > Major Faults 392195 15157 > Swap Ins 2994854 112770 > Swap Outs 4907092 134982 > Direct pages scanned 0 41824 > Kswapd pages scanned 32975063 8128269 > Kswapd pages reclaimed 6323069 7093495 > Direct pages reclaimed 0 41824 > Kswapd efficiency 19% 87% > Kswapd velocity 3386.573 915.414 > Direct efficiency 100% 100% > Direct velocity 0.000 4.710 > Percentage direct scans 0% 0% > Zone normal velocity 2011.338 550.661 > Zone dma32 velocity 1365.623 369.221 > Zone dma velocity 9.612 0.242 > Page writes by reclaim 18732404.000 614807.000 > Page writes file 13825312 479825 > Page writes anon 4907092 134982 > Page reclaim immediate 85490 5647 > Sector Reads 12080532 483244 > Sector Writes 88740508 65438876 > Page rescued immediate 0 0 > Slabs scanned 82560 12160 > Direct inode steals 0 0 > Kswapd inode steals 24401 40013 > Kswapd skipped wait 0 0 > THP fault alloc 6 8 > THP collapse alloc 5481 5812 > THP splits 75 22 > THP fault fallback 0 0 > THP collapse fail 0 0 > Compaction stalls 0 54 > Compaction success 0 45 > Compaction failures 0 9 > Page migrate success 881492 82278 > Page migrate failure 0 0 > Compaction pages isolated 0 60334 > Compaction migrate scanned 0 53505 > Compaction free scanned 0 1537605 > Compaction cost 914 86 > NUMA PTE updates 46738231 41988419 > NUMA hint faults 31175564 24213387 > NUMA hint local faults 10427393 6411593 > NUMA pages migrated 881492 55344 > AutoNUMA cost 156221 121361 Some nice numbers there. > The overall runtime was reduced, throughput for both the foreground > workload as well as the background IO improved, major faults, swapping > and reclaim activity shrunk significantly, reclaim efficiency more > than quadrupled. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@xxxxxxxxx. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@xxxxxxxxx"> email@xxxxxxxxx </a>