On Mon, 2010-11-15 at 15:33 +0100, Andrea Arcangeli wrote: > Hi everyone, > > On Mon, Nov 15, 2010 at 08:13:14AM -0600, Christoph Lameter wrote: > > On Sun, 14 Nov 2010, KOSAKI Motohiro wrote: > > > > > Nice! > > > > Lets not get overenthused. There has been no conclusive proof that the > > overhead introduced by automatic migration schemes is consistently less > > than the benefit obtained by moving the data. Quite to the contrary. We > > have over a decades worth of research and attempts on this issue and there > > was no general improvement to be had that way. > > > > The reason that the manual placement interfaces exist is because there was > > no generally beneficial migration scheme available. The manual interfaces > > allow the writing of various automatic migrations schemes in user space. > > > > If wecan come up with something that is an improvement then lets go > > this way but I am skeptical. > > I generally find the patchset very interesting but I think like > Christoph. Christoph is correct that we have no concrete data on modern processors for these patch sets. I did present some results back on '07 from a 4-node, 16 processor ia64 server. The slides from that presentation are here: http://mirror.linux.org.au/pub/linux.conf.au/2007/video/talks/197.pdf Slide 18 shows the effects on stream benchmark execution [per pass] of restoring the locality after a transient job [a parallel kernel build] causes a flurry of load balancing. The streams jobs return to 'best case' performance after the perturbation even tho' they started in a less than optimal locality configuration. Slides 29 shows the effects on a stand-alone parallel kernel build of the patches--disabled and enabled, with and without auto-migration of page cache pages. Not much change in real nor user time. Auto-migrating page cache pages chewed up a lot of system time for the stand alone kernel build because the shared pages of the tool chain executables and libraries were seriously thrashing. With auto-migration of anon pages only, we see a slight [OK, tiny!] but repeatable improvement in real time for a ~2% increase in system time. Slide 30 shows, IMO, a more interesting result. On a heavily loaded system with the stream benchmark running on all nodes, the interconnect bandwidth becomes a precious resource, so locality matters more. Comparing a parallel kernel build in this environment with automigration [anon pages only] enabled vs disabled, I observed: ~18% improvement in real time ~4% improvement in user time ~21% improvement in system time Slide 27 gives you an idea of what was happening during a parallel kernel build. "Swap faults" on that slide are faults on anon pages that have been moved to the migration cache by automigration. Those stats were taken with ad hoc instrumentation. I've added some vmstats since then. So, if an objective is to pack more jobs [guest vms] on a single system, one might suppose that we'd have a heavier loaded system, perhaps spending a lot of time in the kernel handling various faults. Something like this approach might help, even on current generation numa platforms, altho' I'd expect more benefit on larger socket count systems. Should be testable. > > It's good to give the patchset more visibility as it's quite unique in > this area, but when talking with Lee I also thought the synchronous > migrate on fault was probably too aggressive and I like an algorithm > where memory follows cpus and cpus follow memory in a total dynamic > way. > > I suggested Lee during our chat (and also to others during KS+Plubers) > that we need a more dynamic algorithm that works in the background > asynchronously. Specifically I want the cpu to follow memory closely > whenever idle status allows it (change cpu in context switch is cheap, > I don't like pinning or "single" home node concept) and then memory > slowly also in tandem follow cpu in the background with kernel > thread. So that both having cpu follow memory fast, and memory follow > cpu slow, eventually things over time should converge in a optimal > behavior. I like the migration done from a kthread like > khugepaged/ksmd, not synchronously adding latency to page fault (or > having to take down ptes to trigger the migrate on fault, migrate > never need to require the app to exit kernel and take a fault just to > migrate, it happens transparently as far as userland is concerned, > well of course unless it trips on the migration pte just at the wrong > time :). I don't know about the background migration thread. Christoph mentioned the decades of research and attempts to address this issue. IMO, most of these stumbled on the cost of collecting sufficient data to know what pages to migrate where. And, if you don't know which pages to migrate, you can end up doing a lot of work for little gain. I recall Christoph or someone at SGI calling it "just too late" migration. With lazy migration, we KNOW what pages the task is referencing in the fault path, so me move only the pages actually needed right now. Because any time now the scheduler could decide to move the task to a different node. I did add a "migration interval" control to experiment with different length delays in inter-node migration to give a task time to amortize the automigration overhead. Needs more "experimentation". > > So the patchset looks very interesting, and it may actually be optimal > for some slower hardware, but I've the perception these days the > memory being remote isn't as a big deal as not keeping all two memory > controllers in action simultaneously (using just one controller is > worse than using both simultaneously from the wrong end, locality not > as important as not stepping in each other toes). So in general > synchronous migrate on fault seems a bit too aggressive to me and not > ideal for newer hardware. Still this is one of the most interesting > patchsets at this time in this area I've seen so far. As I mentioned above, my results were on older hardware, so it will be interesting to see the results on modern hardware with lots of guest vms as the workload. Maybe I'll get to this eventually. And I believe you're correct that on modern systems being remote is not a big deal if the interconnect and target node are lightly loaded. But, in my experience with recent 4 and 8-node servers, locality still matters very much as load increases. > > The homenode logic ironically may be optimal with the most important > bench because the way that bench is setup all vm are fairly small and > there are plenty of them so it'll never happen that a vm has more > memory than what can fit in the ram of a single node, but I like > dynamic approach that works best in all environments, even if it's not > clearly as simple and maybe not as optimal in the one relevant > benchmark we care about. I'm unsure what the homenode is supposed to > decide when the task has two three four times the ram that fits in a > single node (and that may not be a so uncommon scenario after all). > I admit not having read enough on this homenode logic, but I never got > any attraction to it personally as there should never be any single > "home" to any task in my view. Well, as we discussed, now we have an implicit "home node" anyway: the node where a task's kernel data structures are first allocated. A task that spends much time in the kernel will always run faster on the node where its task struct and thread info/stack live. So, until we can migrate these [was easier in unix based kernels], we'll always have an implicit home node. I'm attaching some statistics I collected while running a stress load on the patches before posting them. The 'vmstress-stats' file includes a description of the statistics. I've also attached a simple script to watch the automigration stats if you decided to try out these patches. Heads up: as I mentioned to Kosaki-san in other mail, lazy migration [migrate-on-fault] incurs a null pointer deref in swap_cgroup_record() in the most recent mmotm [09nov on 37-rc1]. The patches seem quite robust [modulo a migration remove/duplicate race, I think, under heavy load :(] on the mmotm version referenced in the patches: 03nov on 2.6.36. You may be able to find a copy from Andrew, but I've placed a copy here: http://free.linux.hp.com/~lts/Patches/PageMigration/2.6.36-mmotm-101103-1217/mmotm-101103-1217.tar.gz Regards, Lee
Final stats for 58 minute usex vm stress workload. pgs loc pages pages | tasks pages pages pages | ----------------mig cache------------------ checked misplacd migrated | migrated scanned selected failed | pgs added pgs removd duplicates refs freed 46496887 44500155 63046843 | 965933 348568709 159804782 595 | 151431996 151416378 187409069 338825351 46503015 44505977 63052665 | 967033 349001693 159996085 595 | 151616958 151602326 187615344 339217482 46508962 44511591 63123815 | 968090 349431720 160191652 595 | 151806705 151796445 187825762 339622018 46514565 44516837 63129061 | 969031 349818983 160368637 595 | 151984095 151972326 188023077 339994848 46520744 44522719 63200479 | 970066 350273208 160586754 595 | 152191950 152172881 188256406 340429177 46526660 44528377 63206137 | 971096 351226149 161308899 595 | 152427014 152369709 188512466 340882087 46533120 44534533 63212293 | 972074 351664143 161522889 595 | 152662157 152583427 188776833 341360178 46538985 44540194 63217953 | 973051 352104161 161738022 595 | 152891544 152785119 189029203 341814220 46808222 44809022 63748925 | 975330 352692128 161930205 596 | 153503983 153174539 189746717 342921179 47056161 45056479 63996383 | 978160 353338853 162087525 597 | 153632047 153543820 190009418 343553163 47375470 45375057 64380497 | 981041 354513459 162767074 598 | 154280022 154043105 190781347 344824378 47620387 45619074 64624514 | 983463 355193001 162993768 598 | 154468417 154452153 191055045 345507110 47626207 45624603 64630043 | 984479 355619380 163192799 598 | 154656878 154644741 191271746 345915993 47632616 45630721 64636160 | 985631 356053915 163378618 598 | 154844351 154830927 191478638 346309235 47638285 45635994 64641433 | 986591 356529658 163639764 598 | 155030094 155021407 191689679 346710556 47643792 45641209 64646649 | 987526 356924666 163822582 598 | 155219308 155193598 191908717 347092074 47649221 45646312 64651752 | 988476 357318754 164004331 598 | 155411758 155379515 192129976 347489210 47654343 45651155 64656595 | 989351 357724761 164207764 598 | 155636719 155567538 192410941 347926759 47660396 45656848 64924432 | 990406 358188870 164430730 598 | 155869117 155795886 192680512 348447071 47661993 45658349 64925933 | 990663 358289854 164475908 598 | 155921468 155921168 192736716 348657884 Migrate on Fault stats: pgs loc checked -- pages found in swap/migration cache by do_swap_page() with zero page_mapcount() and otherwise "stable". pages misplacd -- of the "loc checked" pages, the number that were found to be misplaced relative to the mempolicy in effect--vma, task or system default. pages migrated -- All pages migrated: migrate-on-fault, mbind(), ... Exceeds the misplaced pages in the stats above because the test load included programs that migrated memory regions about using mbind() with the MPOL_MF_MOVE flag. Auto-migration Stats: tasks migrated - number of internode task migrations. Each of these migrations resulted in the task walking its address space looking for anon pages in vmas with local allocation policy. Includes kicks via /proc/<pid>/migrate. pages scanned -- total number of pages examined as candidates for auto-migration in mm/mempolicy.c:check_range() as a result of internode task migration or /proc/<pid>/migrate scans. pages selected -- Anon pages selected for auto-migration. If lazy auto-migration is enabled [default], these pages will be unmapped to allow migrate-on-fault to migrate them if and when a task faults the page. If lazy auto- migration is disabled, these pages will be directly migrated [pulled] to the destination node. pages failed -- the number of the selected pages that the kernel failed to unmap for lazy migration or failed to direct migrate. Migration Cache Statistics: pgs added -- the number of pages added to the migration cache. This occurs when pages are unmapped for lazy migration. pgs removd -- the number of pages removed from the migration cache. This occurs when the last pte referencing the cache entry is replaced with a present page pte. The nummber of pages added less the number removed is the number of pages still in the cache. duplicates -- count of migration_duplicate() calls, usually via swap_duplicate(), to add a reference to a migration cache entry. This occurs when a page in the migration cache is unmapped in try_to_unmap_one() and when a task with anon pages in the migration cache forks and all of its anon pages become COW shared with the child in copy_one_pte(). refs freed -- count of migration cache entry reference freed, usually via one of the swap cache free functions. When the reference count on a migration cache entry goes zero, the entry is removed from the cache. Thus, the number of pages added plus the number of duplicates should equal the number of refs freed plus the number of pages still in the cache [adds - removes].
Attachment:
automig_stats
Description: application/shellscript