Re: [PATCH/RFC 0/8] numa - Migrate-on-Fault

Andrea Arcangeli <aarcange@xxxxxxxxxx> · Wed, 17 Nov 2010 22:27:58 +0100

On Wed, Nov 17, 2010 at 12:03:56PM -0500, Lee Schermerhorn wrote:
> Slide 30 shows, IMO, a more interesting result.  On a heavily loaded
> system with the stream benchmark running on all nodes, the interconnect
> bandwidth becomes a precious resource, so locality matters more.
> Comparing a parallel kernel build in this environment with automigration
> [anon pages only] enabled vs disabled, I observed:
> 
> 	~18% improvement in real time
> 	~4% improvement in user time
> 	~21% improvement in system time

Nice results indeed.

> With lazy migration, we KNOW what pages the task is referencing in the
> fault path, so me move only the pages actually needed right now.
> Because any time now the scheduler could decide to move the task to a
> different node.  I did add a "migration interval" control to experiment
> with different length delays in inter-node migration to give a task time
> to amortize the automigration overhead.  Needs more "experimentation".

Ok my idea is even we don't collect any "stat", we can just clear the
young bits instead of creating swap entries. That's quite fast
especially on hugepages as there are very few hugepmd. Then later from
the kernel thread we scan the hugepmd again to check if there's any
young bit set and we validate the placement of the PageTransHuge
against the actual cpu the thread is running (or it was last running
on). That should lead to the exact same results but with lower
overhead and no page fault at all if the placement was already correct
(and no page fault unless userland accesses the page exactly during the
migration copy for the migration entry).

The scan of the hugepmd to see which one has the young bit set is
equivalent to your unmapping of the pages as a "timing issue". Before
you unmap the pages things weren't guaranteed to be running in the
right node with your current migrate-on-fault too. So I don't see why
it should behave any different from an heuristic prospective.

I just don't like unmapping the pages and I'd prefer to just clear the
young bit for it.

Now there are architectures that may not have any young bit at all,
and those should approximate to emulate the young bit in software so
ideally the same logic should work for those archs too (and maybe
those archs are obsolete or they don't need numa).

> As I mentioned above, my results were on older hardware, so it will be
> interesting to see the results on modern hardware with lots of guest vms
> as the workload.  Maybe I'll get to this eventually.  And I believe

Agreed.

> Well, as we discussed,  now we have an implicit "home node" anyway:  the
> node where a task's kernel data structures are first allocated.  A task
> that spends much time in the kernel will always run faster on the node
> where its task struct and thread info/stack live.  So, until we can
> migrate these [was easier in unix based kernels], we'll always have an
> implicit home node.

Yes I remember your point. But there is no guarantee the kernel stack
and other kernel data structures were satisfied from the local node
where fork() was executed. So in effect there is no home node. Also by
the time execve runs a couple of times in a row, the concept of home
node as far as kernel is concerned may be well lost forever and be
equivalent to having allocate the task struct and kernel stack in the
wrong node during fork...

So as of now there is no real home node, just we try to do the obvious
thing when fork runs, but then it goes random over time, exactly like
it happens for userland. You're fixing userland for long lived, kernel
is the same issue but I guess we're not going to address the kernel
side any time soon (no migrate for slab, unless we slowdown
dramatically all kernel allocations ;). migrate OTOH is a nice
abstraction that we can advantage of for plenty of things like memory
compation and here to fixup the memory locality of the tasks.

About the pagecache, the problem in slide 26 is that with make -j 99%
of the mmapped pagecache (that would be migrated-on-fault) is shared
across all tasks so it trashes badly and for no good. The exact same
regression and trashing would happen for shared anonymous memory I
guess... (fork + reads in a loop).

So it'd be better to start migrating everything (pagecache included!)
if page_mapcount == 1, that should be a safe start... Migrating shared
entities is way more troublesome unless we're able to verify all users
sits on the same node (only in that case it'd be safe, and it'd also
have a chance to reduce the trashing).

So my suggestion would be to create a patch with a kernel thread that
scans a thousand pte or hugepmd per second and for each pte or hugepmd
encountered we check the mapcount is 1, if it's 1 we run
test_and_clear_young_notify() (which takes care of shadow ptes too and
won't actually trigger any linux page fault and no gup_fast for
secondary page fault if the page is in the right place), if it returns
true we validate the page is in the right node and if it's not we
migrate it right away... That to the pagecache too (only if mapcount
== 1). I would remove any hook from do_swap_page or the
migrate-on-fault concept as a whole and I'd move the whole thing over
test_and_clear_young_notify() instead. There's sure stuff to keep in
this patchset (all validation of the page, migration invocation
etc..). Also currently migration won't be able to migrate a THP, it'd
work but it'd split it, so it's not so nice that khugepaged is
required to get the THP performance back after
migration. Unfortunately hugetlbfs has grown even more to mimic the
core VM code, and now migrate has special hooks to migrate hugetlbfs
pages, which will make it more difficult to teach migrate to
differentiate between hugetlbfs migrate and THP migrate to have both
working simultaneously... I'm not sure why hugetlbfs has to grow so
much (especially outside of vm_flags & VM_HUGETLB checks that in the
past kept it more separate and less intrusive into the core VM).
--
To unsubscribe from this list: send the line "unsubscribe linux-numa" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html