On Wed, Nov 17, 2010 at 12:03:56PM -0500, Lee Schermerhorn wrote: > Slide 30 shows, IMO, a more interesting result. On a heavily loaded > system with the stream benchmark running on all nodes, the interconnect > bandwidth becomes a precious resource, so locality matters more. > Comparing a parallel kernel build in this environment with automigration > [anon pages only] enabled vs disabled, I observed: > > ~18% improvement in real time > ~4% improvement in user time > ~21% improvement in system time Nice results indeed. > With lazy migration, we KNOW what pages the task is referencing in the > fault path, so me move only the pages actually needed right now. > Because any time now the scheduler could decide to move the task to a > different node. I did add a "migration interval" control to experiment > with different length delays in inter-node migration to give a task time > to amortize the automigration overhead. Needs more "experimentation". Ok my idea is even we don't collect any "stat", we can just clear the young bits instead of creating swap entries. That's quite fast especially on hugepages as there are very few hugepmd. Then later from the kernel thread we scan the hugepmd again to check if there's any young bit set and we validate the placement of the PageTransHuge against the actual cpu the thread is running (or it was last running on). That should lead to the exact same results but with lower overhead and no page fault at all if the placement was already correct (and no page fault unless userland accesses the page exactly during the migration copy for the migration entry). The scan of the hugepmd to see which one has the young bit set is equivalent to your unmapping of the pages as a "timing issue". Before you unmap the pages things weren't guaranteed to be running in the right node with your current migrate-on-fault too. So I don't see why it should behave any different from an heuristic prospective. I just don't like unmapping the pages and I'd prefer to just clear the young bit for it. Now there are architectures that may not have any young bit at all, and those should approximate to emulate the young bit in software so ideally the same logic should work for those archs too (and maybe those archs are obsolete or they don't need numa). > As I mentioned above, my results were on older hardware, so it will be > interesting to see the results on modern hardware with lots of guest vms > as the workload. Maybe I'll get to this eventually. And I believe Agreed. > Well, as we discussed, now we have an implicit "home node" anyway: the > node where a task's kernel data structures are first allocated. A task > that spends much time in the kernel will always run faster on the node > where its task struct and thread info/stack live. So, until we can > migrate these [was easier in unix based kernels], we'll always have an > implicit home node. Yes I remember your point. But there is no guarantee the kernel stack and other kernel data structures were satisfied from the local node where fork() was executed. So in effect there is no home node. Also by the time execve runs a couple of times in a row, the concept of home node as far as kernel is concerned may be well lost forever and be equivalent to having allocate the task struct and kernel stack in the wrong node during fork... So as of now there is no real home node, just we try to do the obvious thing when fork runs, but then it goes random over time, exactly like it happens for userland. You're fixing userland for long lived, kernel is the same issue but I guess we're not going to address the kernel side any time soon (no migrate for slab, unless we slowdown dramatically all kernel allocations ;). migrate OTOH is a nice abstraction that we can advantage of for plenty of things like memory compation and here to fixup the memory locality of the tasks. About the pagecache, the problem in slide 26 is that with make -j 99% of the mmapped pagecache (that would be migrated-on-fault) is shared across all tasks so it trashes badly and for no good. The exact same regression and trashing would happen for shared anonymous memory I guess... (fork + reads in a loop). So it'd be better to start migrating everything (pagecache included!) if page_mapcount == 1, that should be a safe start... Migrating shared entities is way more troublesome unless we're able to verify all users sits on the same node (only in that case it'd be safe, and it'd also have a chance to reduce the trashing). So my suggestion would be to create a patch with a kernel thread that scans a thousand pte or hugepmd per second and for each pte or hugepmd encountered we check the mapcount is 1, if it's 1 we run test_and_clear_young_notify() (which takes care of shadow ptes too and won't actually trigger any linux page fault and no gup_fast for secondary page fault if the page is in the right place), if it returns true we validate the page is in the right node and if it's not we migrate it right away... That to the pagecache too (only if mapcount == 1). I would remove any hook from do_swap_page or the migrate-on-fault concept as a whole and I'd move the whole thing over test_and_clear_young_notify() instead. There's sure stuff to keep in this patchset (all validation of the page, migration invocation etc..). Also currently migration won't be able to migrate a THP, it'd work but it'd split it, so it's not so nice that khugepaged is required to get the THP performance back after migration. Unfortunately hugetlbfs has grown even more to mimic the core VM code, and now migrate has special hooks to migrate hugetlbfs pages, which will make it more difficult to teach migrate to differentiate between hugetlbfs migrate and THP migrate to have both working simultaneously... I'm not sure why hugetlbfs has to grow so much (especially outside of vm_flags & VM_HUGETLB checks that in the past kept it more separate and less intrusive into the core VM). -- To unsubscribe from this list: send the line "unsubscribe linux-numa" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html