Re: [RFC PATCH 3/3] mm/migrate: Create move_phys_pages syscall

Gregory Price <gregory.price@xxxxxxxxxxxx> · Tue, 19 Sep 2023 15:59:57 -0400

On Tue, Sep 19, 2023 at 11:52:27AM -0700, Andy Lutomirski wrote:
> 
> 
> On Tue, Sep 19, 2023, at 11:20 AM, Gregory Price wrote:
> > On Tue, Sep 19, 2023 at 10:59:33AM -0700, Andy Lutomirski wrote:
> >> 
> >> I'm not complaining about the name.  I'm objecting about the semantics.
> >> 
> >> Apparently you have a system to collect usage statistics of physical addresses, but you have no idea what those pages map do (without crawling /proc or /sys, anyway).  But that means you have no idea when the logical contents of those pages *changes*.  So you fundamentally have a nasty race: anything else that swaps or migrates those pages will mess up your statistics, and you'll start trying to migrate the wrong thing.
> >
> > How does this change if I use virtual address based migration?
> >
> > I could do sampling based on virtual address (page faults, IBS/PEBs,
> > whatever), and by the time I make a decision, the kernel could have
> > migrated the data or even my task from Node A to Node B.  The sample I
> > took is now stale, and I could make a poor migration decision.
> 
> The window is a lot narrower. If you’re sampling by VA, you collect stats and associate them with the logical page (the tuple (mapping, VA), for example).  The kernel can do this without races from page faults handlers.  If you sample based on PA, you fundamentally race against anything that does migration.
> 

Race conditions are race conditions, narrow or otherwise.  The narrow-ness
of the condition is either irrelevant, or you can accept some level of race
because the goal allows for slop within a certain window.

In fact, the entire purpose of this interface is to very explicity
reduce the time it takes to go from sampling to migration decision.

Certainly a user could simply use a heatmap with cgroups.numa_stat
to determine what processes they should interrogate - but for systems
with many tennants/processes/tasks, sometimes just acting on the overall
view of memory would be better served if *major* changes have to be made
to the overall distribution.

Similarly, collection of data can likewise we made more performant.

Faults introduce runtime overhead.  Pushing heatmap collection to
devices alleviates the need for introducing that overhead.  Use of
IBS/PEBS is (by nature of sampling and a variety of quirks having to do
with the prefetcher) quite inaccurate and costly to compute after being
collected.

Ultimately if your goal is extremely high precision for the performance
of a single process, I agree with your analysis.  However, I never claimed
the goal is precision on the level a single process.  On the contrary,
the goal is system-wide tiering based on cheap-to-acquire information.

> >
> > If I do move_pages(pid, some_virt_addr, some_node) and it migrates the
> > page from NodeA to NodeB, then the device-side collection is likewise
> > no longer valid.  This problem doesn't change because I used virtual
> > address compared to physical address.
> 
> Sure it does, as long as you collect those samples when you migrate. And I think the kernel migrating to or from device memory (or more generally allocating and freeing device memory and possibly even regular memory) *should* be aware of whatever hotness statistics are in use.
>

I'm not keen on "the kernel should..." implying "user land should not".

I don't disagree that the kernel could/should/would use this same
information.  My proposal here is that there is value in enabling
userland to do the same thing (so long as security is not compromised).

> >
> > But if i have a 512GB memory device, and i can see a wide swath of that
> > 512GB is hot, while a good chunk of my local DRAM is not - then I
> > probably don't care *what* gets migrated up to DRAM, i just care that a
> > vast majority of that hot data does.
> >
> > The goal here isn't 100% precision, you will never get there. The goal
> > here is broad-scope performance enhancements of the overall system
> > while minimizing the cost to compute the migration actions to be taken.
> >
> > I don't think the contents of the page are always relevant.  The entire
> > concept here is to enable migration without caring about what programs
> > are using the memory for - just so long as the memcg's and zoning is
> > respected.
> >
> 
> At the very least I think you need to be aware of page *size*.  And if you want to avoid excessive fragmentation, you probably also want to be aware of the boundaries of a logical allocation.
>

Re page size: this was brought up in a prior email, and I agree,
but that also doesn't really change for move_pages.  The man page for
move_pages already says this:

"""
pages is an array of pointers to the pages that should be moved.
These are pointers that should be aligned to page boundaries.
"""

But to an extent I agree with you.

A device collecting data as I describe won't necessarily know the
page size (or there may even be multiple sized pages hosted on the
device).  Operating on that data blind does have some risks.

Some work probably could be done to ensure the data being used doesn't,
for example, cause a hugepage to be migrated multiple times.  In that
regard, if the address doesn't match the base of the page the migration
should fail.

I think there are simple ways to avoid these footguns.

> 
> I think that doing this entire process by PA, blind, from userspace will end up stuck in a not-so-good solution, and the ABI will be set in stone, and it will not be a great situation for long term maintainability or performance.

"entire process" and "blind" are a little bit of an straw man here. The
goal is to marry many pieces of data to make decisions about how and
what to move, with as many (flexible) tools available to enact changes
quickly to reduce the staleness of information issue.

Sometimes it would be "more blind" than others, yes.  In other cases,
such a device-provided heatmap would simply be informational for overall
system health.

To me, unless I'm misunderstanding, it sounds like you think system-wide
tiering decisions should not be a function userland is empowered to do
directly, but instead should be implemented entirely in the kernel?

~Gregory