PATCH/RFC 00/10 Mem Policy: More Reference Counting/Fallback Fixes and Miscellaneous mempolicy cleanup Against: 2.6.25-rc8-mm1 I think these could be merged into the -mm tree whenever it's convenient. Note that this series depends on Mel Gorman's zonelist rework, currently already in -mm. Specifically, patch 8 depends on the removal of the remote zonelist from the mempolicy struct. This series of patches introduces a number of "cleanups" [in the eye of the beholder, of course] to the mempolicy code, and reworks the mempolicy reference counting, yet again, to reduce the need to take and release reference counts in the allocation paths. Results of some page fault measurements and change in code size given below. Summary: small net gain in performance overall, and small net decrease in code size -- both on x86_64. Overview of patches: See the descriptions for rationale, ... 1) basic renaming: mpol_free() => mpol_put() mpol_copy() => mpol_dup() 'policy' => 'mode' in struct mempolicy 2) correct fallback of shared/vma policies. 3) aforementioned reference counting rework 4) replacement of MPOL_DEFAULT as system default policy 'mode' with MPOL_PREFERRED + local allocation. Using an internal "mode flag" to indicate "preferred, local" instead of a negative preferred_node. Fewer cachelines, I think. 5) remove knowledge of mempolicy internals from shmem by moving parsing and formatting of tmpfs mount option mempolicy to mempolicy.c. Also, replace "naked" mempolicy mode and nodemask in shmem superblock with pointer to allocated mempolicy. Functional Testing: In addition to various ad hoc memtoy tests, I used the numactl/libnuma regression test to test these changes. All pass. Note that I found a few glitches in the regression tests that result from changes in sysfs in recent kernels. I submitted patches to fix those to our shiny, new numactl/libnuma maintainer. Performance Testing: I used an "enhanced" version of Christoph Lameter's "page fault test" to measure the fault rate obtainable with and w/o these patches. The fault rate is an indication of the page allocation rate. Higher overhead in page allocation results in lower fault rate, and vice versa. The enhancements I made to the page fault test were all and attempt to measure just the faults of interest and the cpu time attributable to those faults. The updated test is available at: http://free.linux.hp.com/~lts/Tools/pft-0.04.tar.gz I ran the tests on an HP Proliant 585: a 4 socked [= 2 numa node], dual core, AMD x86_64 with 32G of memory. I used a test region of 4GB divided up between the number of test threads. Note: I only used 7 threads to reserve the 8th cpu for the master/launch thread. I may not need to reserve that cpu. The following tables give the faults per cpu-second [1st and 3rd columns] and the faults per wall-clock-second [2nd/4th columns] on linux-2.6.25-rc8-mm1, with and without this patch series, for varying number of threads. Each line shows the average of 10 runs. The annotation at the top of each table give the memory region type: anon vs SysV shmem, and the memory policy: system default vs vma/shared policy. In both cases, the effective policy is "preferred, local" allocation. anon+sys-default N no patches mpol rework 1 181041 181000 182174 182131 2 163497 323742 163272 323820 3 161003 475130 159777 469809 4 155266 603399 155456 606295 5 143072 655859 145233 670912 6 134686 757457 137264 778470 7 128615 865516 132672 896737 ~0.6% improvement @ 1 thread; ~0.8% degradation at 2 threads; to ~1.3% improvement @ 7 threads. anon+vma-policy N no patches mpol rework 1 181610 181567 181823 181781 2 154635 305537 162856 323839 3 150144 440599 160255 472724 4 145499 562344 156590 609765 5 134843 625401 145334 669095 6 124932 704900 138217 781865 7 119707 806536 132963 900196 Almost no effect at 1 thread, to ~11% improvement at 7 threads. shmem+sys-default N no patches mpol rework 1 150218 150189 152371 152338 2 121958 242026 128962 255850 3 116335 345513 122205 364152 4 105485 416377 112212 443998 5 93032 456389 100356 490293 6 78882 466109 87685 515296 7 60979 423777 70195 486841 ~1% improvement at 1 thread to ~20% improvement at 7 threads. Note, however, that the fault rate for shmem is much lower. Some of this is may be the result of shared policy lookup via the vma get_policy op. However, no policy has been applied for this test, so it will fall back to system default with no reference counting. Some of the falloff relative to anon memory may be the result of the radix tree management. Something interesting to investigate. shmem+vma-policy N no patches mpol rework 1 146970 146936 150319 150289 2 116237 231194 120756 239616 3 109052 324182 113717 338037 4 98291 387803 104346 412407 5 88979 437758 94189 463928 6 75370 445762 79997 472631 7 60021 417158 63099 438584 ~2% improvement at 1 thread to ~5% improvement at 7 threads. Note that the falloff here, relative to system default policy is likely due to the shared policy lookup and reference counting. Also note that the "win" for the rework version falls off as the number of threads increases. I'm guessing this is due to increased contention on the shared policy rb-tree spin lock becoming more dominant vs ref count cache line effects. For those that prefer to view this graphically, you can find plots here: http://free.linux.hp.com/~lts/Patches/Mempolicy/ Code Sizes on x86_64: Before series applied: size mm/shmem.o mm/mempolicy.o mm/hugetlb.o fs/hugetlbfs/inode.o ipc/shm.o text data bss dec hex filename 18017 424 24 18465 4821 mm/shmem.o 13803 24 24 13851 361b mm/mempolicy.o 7649 147 1892 9688 25d8 mm/hugetlb.o 7142 432 24 7598 1dae fs/hugetlbfs/inode.o 5412 64 0 5476 1564 ipc/shm.o ----------- 52023 After all applied: size mm/shmem.o mm/mempolicy.o mm/hugetlb.o fs/hugetlbfs/inode.o text data bss dec hex filename 17347 424 24 17795 4583 mm/shmem.o 14388 24 24 14436 3864 mm/mempolicy.o 7665 147 1892 9704 25e8 mm/hugetlb.o 7126 432 24 7582 1d9e fs/hugetlbfs/inode.o 5388 64 0 5452 154c ipc/shm.o ----------- 51914 109 net reduction Lee Schermerhorn -- To unsubscribe from this list: send the line "unsubscribe linux-numa" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html