Re: [PATCH/RFC 0/8] numa - Migrate-on-Fault

Lee Schermerhorn <Lee.Schermerhorn@xxxxxx> · Wed, 17 Nov 2010 12:03:56 -0500

On Mon, 2010-11-15 at 15:33 +0100, Andrea Arcangeli wrote:
> Hi everyone,
> 
> On Mon, Nov 15, 2010 at 08:13:14AM -0600, Christoph Lameter wrote:
> > On Sun, 14 Nov 2010, KOSAKI Motohiro wrote:
> > 
> > > Nice!
> > 
> > Lets not get overenthused. There has been no conclusive proof that the
> > overhead introduced by automatic migration schemes is consistently less
> > than the benefit obtained by moving the data. Quite to the contrary. We
> > have over a decades worth of research and attempts on this issue and there
> > was no general improvement to be had that way.
> > 
> > The reason that the manual placement interfaces exist is because there was
> > no generally beneficial migration scheme available. The manual interfaces
> > allow the writing of various automatic migrations schemes in user space.
> > 
> > If wecan come up with something that is an improvement then lets go
> > this way but I am skeptical.
> 
> I generally find the patchset very interesting but I think like
> Christoph.

Christoph is correct that we have no concrete data on modern processors
for these patch sets.  I did present some results back on '07 from a
4-node, 16 processor ia64 server.  The slides from that presentation are
here:
http://mirror.linux.org.au/pub/linux.conf.au/2007/video/talks/197.pdf

Slide 18 shows the effects on stream benchmark execution [per pass] of
restoring the locality after a transient job [a parallel kernel build]
causes a flurry of load balancing.  The streams jobs return to 'best
case' performance after the perturbation even tho' they started in a
less than optimal locality configuration.

Slides 29 shows the effects on a stand-alone parallel kernel build of
the patches--disabled and enabled, with and without auto-migration of
page cache pages.  Not much change in real nor user time.
Auto-migrating page cache pages chewed up a lot of system time for the
stand alone kernel build because the shared pages of the tool chain
executables and libraries were seriously thrashing.  With auto-migration
of anon pages only, we see a slight [OK, tiny!] but repeatable
improvement in real time for a ~2% increase in system time.

Slide 30 shows, IMO, a more interesting result.  On a heavily loaded
system with the stream benchmark running on all nodes, the interconnect
bandwidth becomes a precious resource, so locality matters more.
Comparing a parallel kernel build in this environment with automigration
[anon pages only] enabled vs disabled, I observed:

	~18% improvement in real time
	~4% improvement in user time
	~21% improvement in system time

Slide 27 gives you an idea of what was happening during a parallel
kernel build.  "Swap faults" on that slide are faults on anon pages that
have been moved to the migration cache by automigration.  Those stats
were taken with ad hoc instrumentation.  I've added some vmstats since
then.

So, if an objective is to pack more jobs [guest vms] on a single system,
one might suppose that we'd have a heavier loaded system, perhaps
spending a lot of time in the kernel handling various faults.  Something
like this approach might help, even on current generation numa
platforms, altho' I'd expect more benefit on larger socket count
systems.  Should be testable.

> 
> It's good to give the patchset more visibility as it's quite unique in
> this area, but when talking with Lee I also thought the synchronous
> migrate on fault was probably too aggressive and I like an algorithm
> where memory follows cpus and cpus follow memory in a total dynamic
> way.
> 
> I suggested Lee during our chat (and also to others during KS+Plubers)
> that we need a more dynamic algorithm that works in the background
> asynchronously. Specifically I want the cpu to follow memory closely
> whenever idle status allows it (change cpu in context switch is cheap,
> I don't like pinning or "single" home node concept) and then memory
> slowly also in tandem follow cpu in the background with kernel
> thread. So that both having cpu follow memory fast, and memory follow
> cpu slow, eventually things over time should converge in a optimal
> behavior. I like the migration done from a kthread like
> khugepaged/ksmd, not synchronously adding latency to page fault (or
> having to take down ptes to trigger the migrate on fault, migrate
> never need to require the app to exit kernel and take a fault just to
> migrate, it happens transparently as far as userland is concerned,
> well of course unless it trips on the migration pte just at the wrong
> time :).

I don't know about the background migration thread.  Christoph mentioned
the decades of research and attempts to address this issue.  IMO, most
of these stumbled on the cost of collecting sufficient data to know what
pages to migrate where.  And, if you don't know which pages to migrate,
you can end up doing a lot of work for little gain.   I recall Christoph
or someone at SGI calling it "just too late" migration.

With lazy migration, we KNOW what pages the task is referencing in the
fault path, so me move only the pages actually needed right now.
Because any time now the scheduler could decide to move the task to a
different node.  I did add a "migration interval" control to experiment
with different length delays in inter-node migration to give a task time
to amortize the automigration overhead.  Needs more "experimentation".

> 
> So the patchset looks very interesting, and it may actually be optimal
> for some slower hardware, but I've the perception these days the
> memory being remote isn't as a big deal as not keeping all two memory
> controllers in action simultaneously (using just one controller is
> worse than using both simultaneously from the wrong end, locality not
> as important as not stepping in each other toes). So in general
> synchronous migrate on fault seems a bit too aggressive to me and not
> ideal for newer hardware. Still this is one of the most interesting
> patchsets at this time in this area I've seen so far.

As I mentioned above, my results were on older hardware, so it will be
interesting to see the results on modern hardware with lots of guest vms
as the workload.  Maybe I'll get to this eventually.  And I believe
you're correct that on modern systems being remote is not a big deal if
the interconnect and target node are lightly loaded.  But, in my
experience with recent 4 and 8-node servers, locality still matters very
much as load increases.

> 
> The homenode logic ironically may be optimal with the most important
> bench because the way that bench is setup all vm are fairly small and
> there are plenty of them so it'll never happen that a vm has more
> memory than what can fit in the ram of a single node, but I like
> dynamic approach that works best in all environments, even if it's not
> clearly as simple and maybe not as optimal in the one relevant
> benchmark we care about. I'm unsure what the homenode is supposed to
> decide when the task has two three four times the ram that fits in a
> single node (and that may not be a so uncommon scenario after all).
> I admit not having read enough on this homenode logic, but I never got
> any attraction to it personally as there should never be any single
> "home" to any task in my view.

Well, as we discussed,  now we have an implicit "home node" anyway:  the
node where a task's kernel data structures are first allocated.  A task
that spends much time in the kernel will always run faster on the node
where its task struct and thread info/stack live.  So, until we can
migrate these [was easier in unix based kernels], we'll always have an
implicit home node.

I'm attaching some statistics I collected while running a stress load on
the patches before posting them.  The 'vmstress-stats' file includes a
description of the statistics.  I've also attached a simple script to
watch the automigration stats if you decided to try out these patches.

Heads up:  as I mentioned to Kosaki-san in other mail, lazy migration
[migrate-on-fault] incurs a null pointer deref in swap_cgroup_record()
in the most recent mmotm [09nov on 37-rc1].  The patches seem quite
robust [modulo a migration remove/duplicate race, I think, under heavy
load :(] on the mmotm version referenced in the patches:  03nov on
2.6.36.  You may be able to find a copy from Andrew, but I've placed a
copy here:

http://free.linux.hp.com/~lts/Patches/PageMigration/2.6.36-mmotm-101103-1217/mmotm-101103-1217.tar.gz

Regards,
Lee
Final stats for 58 minute usex vm stress workload.

pgs  loc  pages    pages   |  tasks     pages      pages    pages  | ----------------mig cache------------------
checked  misplacd migrated | migrated  scanned    selected  failed | pgs added  pgs removd duplicates refs freed
46496887 44500155 63046843 |   965933  348568709  159804782    595 |  151431996  151416378  187409069  338825351
46503015 44505977 63052665 |   967033  349001693  159996085    595 |  151616958  151602326  187615344  339217482
46508962 44511591 63123815 |   968090  349431720  160191652    595 |  151806705  151796445  187825762  339622018
46514565 44516837 63129061 |   969031  349818983  160368637    595 |  151984095  151972326  188023077  339994848
46520744 44522719 63200479 |   970066  350273208  160586754    595 |  152191950  152172881  188256406  340429177
46526660 44528377 63206137 |   971096  351226149  161308899    595 |  152427014  152369709  188512466  340882087
46533120 44534533 63212293 |   972074  351664143  161522889    595 |  152662157  152583427  188776833  341360178
46538985 44540194 63217953 |   973051  352104161  161738022    595 |  152891544  152785119  189029203  341814220
46808222 44809022 63748925 |   975330  352692128  161930205    596 |  153503983  153174539  189746717  342921179
47056161 45056479 63996383 |   978160  353338853  162087525    597 |  153632047  153543820  190009418  343553163
47375470 45375057 64380497 |   981041  354513459  162767074    598 |  154280022  154043105  190781347  344824378
47620387 45619074 64624514 |   983463  355193001  162993768    598 |  154468417  154452153  191055045  345507110
47626207 45624603 64630043 |   984479  355619380  163192799    598 |  154656878  154644741  191271746  345915993
47632616 45630721 64636160 |   985631  356053915  163378618    598 |  154844351  154830927  191478638  346309235
47638285 45635994 64641433 |   986591  356529658  163639764    598 |  155030094  155021407  191689679  346710556
47643792 45641209 64646649 |   987526  356924666  163822582    598 |  155219308  155193598  191908717  347092074
47649221 45646312 64651752 |   988476  357318754  164004331    598 |  155411758  155379515  192129976  347489210
47654343 45651155 64656595 |   989351  357724761  164207764    598 |  155636719  155567538  192410941  347926759
47660396 45656848 64924432 |   990406  358188870  164430730    598 |  155869117  155795886  192680512  348447071
47661993 45658349 64925933 |   990663  358289854  164475908    598 |  155921468  155921168  192736716  348657884

Migrate on Fault stats:

pgs loc checked -- pages found in swap/migration cache by do_swap_page()
	with zero page_mapcount() and otherwise "stable".

pages misplacd -- of the "loc checked" pages, the number that were found
	to be misplaced relative to the mempolicy in effect--vma, task
	or system default.

pages migrated -- All pages migrated:  migrate-on-fault, mbind(), ...
	Exceeds the misplaced pages in the stats above because the
	test load included programs that migrated memory regions about
	using mbind() with the MPOL_MF_MOVE flag.

Auto-migration Stats:

tasks migrated - number of internode task migrations.  Each of these
	migrations resulted in the task walking its address space
	looking for anon pages in vmas with local allocation policy.
	Includes kicks via /proc/<pid>/migrate.

pages scanned -- total number of pages examined as candidates for
	auto-migration in mm/mempolicy.c:check_range() as a result
	of internode task migration or /proc/<pid>/migrate scans.

pages selected -- Anon pages selected for auto-migration.
	If lazy auto-migration is enabled [default], these pages
	will be unmapped to allow migrate-on-fault to migrate
	them if and when a task faults the page.  If lazy auto-
	migration is disabled, these pages will be directly
	migrated [pulled] to the destination node.

pages failed -- the number of the selected pages that the
	kernel failed to unmap for lazy migration or failed
	to direct migrate.

Migration Cache Statistics:

pgs added -- the number of pages added to the migration cache.
	This occurs when pages are unmapped for lazy migration.

pgs removd -- the number of pages removed from the migration
	cache.  This occurs when the last pte referencing the
	cache entry is replaced with a present page pte.
	The nummber of pages added less the number removed
	is the number of pages still in the cache.

duplicates -- count of migration_duplicate() calls, usually
	via swap_duplicate(), to add a reference to a migration
	cache entry.  This occurs when a page in the migration
	cache is unmapped in try_to_unmap_one() and when a task
	with anon pages in the migration cache forks and all of
	its anon pages become COW shared with the child in
	copy_one_pte().

refs freed -- count of migration cache entry reference freed,
	usually via one of the swap cache free functions.
	When the reference count on a migration cache entry
	goes zero, the entry is removed from the cache.  Thus,
	the number of pages added plus the number of duplicates
	should equal the number of refs freed plus the number
	of pages still in the cache [adds - removes].

Attachment:
automig_stats

Description: application/shellscript