[PATCH 0/15] Mempolicy: More Reference Counting/Fallback Fixes

Lee Schermerhorn <lee.schermerhorn@xxxxxx> · Fri, 04 Apr 2008 10:59:44 -0400

PATCH/RFC 00/10 Mem Policy: More Reference Counting/Fallback Fixes and
		Miscellaneous mempolicy cleanup

Against: 2.6.25-rc8-mm1

I think these could be merged into the -mm tree whenever it's convenient.
Note that this series depends on Mel Gorman's zonelist rework, currently
already in -mm.  Specifically, patch 8 depends on the removal of the
remote zonelist from the mempolicy struct.

This series of patches introduces a number of "cleanups" [in the eye
of the beholder, of course] to the mempolicy code, and reworks the
mempolicy reference counting, yet again, to reduce the need to take
and release reference counts in the allocation paths.  Results of
some page fault measurements and change in code size given below.
Summary:  small net gain in performance overall, and small net decrease
in code size -- both on x86_64.

Overview of patches:

See the descriptions for rationale, ...

1) basic renaming:
	mpol_free() => mpol_put()
	mpol_copy() => mpol_dup()
	'policy' => 'mode' in struct mempolicy

2) correct fallback of shared/vma policies.

3) aforementioned reference counting rework

4) replacement of MPOL_DEFAULT as system default policy 'mode'
   with MPOL_PREFERRED + local allocation.  Using an internal
   "mode flag" to indicate "preferred, local" instead of a
   negative preferred_node.  Fewer cachelines, I think.

5) remove knowledge of mempolicy internals from shmem by
   moving parsing and formatting of tmpfs mount option mempolicy
   to mempolicy.c.  Also, replace "naked" mempolicy mode and
   nodemask in shmem superblock with pointer to allocated mempolicy.

Functional Testing:

In addition to various ad hoc memtoy tests, I used the numactl/libnuma
regression test to test these changes.  All pass.  Note that I found a
few glitches in the regression tests that result from changes in sysfs
in recent kernels.  I submitted patches to fix those to our shiny, new
numactl/libnuma maintainer.

Performance Testing:

I used an "enhanced" version of Christoph Lameter's "page fault test"
to measure the fault rate obtainable with and w/o these patches.  The
fault rate is an indication of the page allocation rate.  Higher 
overhead in page allocation results in lower fault rate, and vice
versa.  The enhancements I made to the page fault test were all and
attempt to measure just the faults of interest and the cpu time
attributable to those faults.  The updated test is available at:

	http://free.linux.hp.com/~lts/Tools/pft-0.04.tar.gz 

I ran the tests on an HP Proliant 585:  a 4 socked [= 2 numa
node], dual core, AMD x86_64 with 32G of memory.  I used a test
region of 4GB divided up between the number of test threads.

Note:  I only used 7 threads to reserve the 8th cpu for the 
master/launch thread.  I may not need to reserve that cpu.

The following tables give the faults per cpu-second [1st and 3rd
columns] and the faults per wall-clock-second [2nd/4th columns]
on linux-2.6.25-rc8-mm1, with and without this patch series, for
varying number of threads.  Each line shows the average of 10 runs.
The annotation at the top of each table give the memory region
type:  anon vs SysV shmem, and the memory policy:  system default
vs vma/shared policy.  In both cases, the effective policy is
"preferred, local" allocation.

		anon+sys-default
N	   no patches	  mpol rework
1	181041	181000	182174	182131
2	163497	323742	163272	323820
3	161003	475130	159777	469809
4	155266	603399	155456	606295
5	143072	655859	145233	670912
6	134686	757457	137264	778470
7	128615	865516	132672	896737

~0.6% improvement @ 1 thread; ~0.8% degradation at 2 threads;
to ~1.3% improvement @ 7 threads.

		anon+vma-policy
N	   no patches	  mpol rework
1	181610	181567	181823	181781
2	154635	305537	162856	323839
3	150144	440599	160255	472724
4	145499	562344	156590	609765
5	134843	625401	145334	669095
6	124932	704900	138217	781865
7	119707	806536	132963	900196

Almost no effect at 1 thread, to ~11% improvement at 7 threads.

		shmem+sys-default
N	   no patches	  mpol rework
1	150218	150189	152371	152338
2	121958	242026	128962	255850
3	116335	345513	122205	364152
4	105485	416377	112212	443998
5	93032	456389	100356	490293
6	78882	466109	87685	515296
7	60979	423777	70195	486841

~1% improvement at 1 thread to ~20% improvement at 7 threads.
Note, however, that the fault rate for shmem is much lower.
Some of this is may be the result of shared policy lookup via
the vma get_policy op.  However, no policy has been applied
for this test, so it will fall back to system default with
no reference counting.  Some of the falloff relative to anon
memory may be the result of the radix tree management.
Something interesting to investigate.

		shmem+vma-policy
N	   no patches	  mpol rework
1	146970	146936	150319	150289
2	116237	231194	120756	239616
3	109052	324182	113717	338037
4	98291	387803	104346	412407
5	88979	437758	94189	463928
6	75370	445762	79997	472631
7	60021	417158	63099	438584

~2% improvement at 1 thread to ~5% improvement at 7 threads.
Note that the falloff here, relative to system default policy
is likely due to the shared policy lookup and reference counting.
Also note that the "win" for the rework version falls off as 
the number of threads increases.  I'm guessing this is due to
increased contention on the shared policy rb-tree spin lock
becoming more dominant vs ref count cache line effects.

For those that prefer to view this graphically, you can find
plots here:

	http://free.linux.hp.com/~lts/Patches/Mempolicy/

Code Sizes on x86_64:

Before series applied:

size mm/shmem.o mm/mempolicy.o mm/hugetlb.o fs/hugetlbfs/inode.o ipc/shm.o
   text    data     bss     dec     hex filename
  18017     424      24   18465    4821 mm/shmem.o
  13803      24      24   13851    361b mm/mempolicy.o
   7649     147    1892    9688    25d8 mm/hugetlb.o
   7142     432      24    7598    1dae fs/hugetlbfs/inode.o
   5412      64       0    5476    1564 ipc/shm.o
-----------
  52023

After all applied:

size mm/shmem.o mm/mempolicy.o mm/hugetlb.o fs/hugetlbfs/inode.o
   text    data     bss     dec     hex filename
  17347     424      24   17795    4583 mm/shmem.o
  14388      24      24   14436    3864 mm/mempolicy.o
   7665     147    1892    9704    25e8 mm/hugetlb.o
   7126     432      24    7582    1d9e fs/hugetlbfs/inode.o
   5388      64       0    5452    154c ipc/shm.o
-----------
  51914

    109  net reduction

Lee Schermerhorn

--
To unsubscribe from this list: send the line "unsubscribe linux-numa" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html