Re: [PATCH 0/3] memory tiering: hot page selection

Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx> · Fri, 8 Apr 2022 21:07:09 -0700

On Fri,  8 Apr 2022 15:12:19 +0800 Huang Ying <ying.huang@xxxxxxxxx> wrote:

> To optimize page placement in a memory tiering system with NUMA
> balancing, the hot pages in the slow memory node need to be
> identified.  Essentially, the original NUMA balancing implementation
> selects and promote the mostly recently accessed (MRU) pages.  But the
> recently accessed pages may be cold.

Wait.  What.  How on earth could the most recently accessed page be
considered "cold"?

Changelog needs work, please.

> So in this patchset, we
> implement a new hot page identification algorithm based on the latency
> between NUMA balancing page table scanning and hint page fault.

That sounds reasonable.

> And the hot page promotion can incur some overhead in the system.  To
> control the overhead a simple promotion rate limit mechanism is
> implemented.

That sounds like a hack to fix a problem you just added?  Maybe a
reasonable one..

> The hot threshold used to identify the hot pages is workload dependent
> usually.  So we also implemented a hot threshold automatic adjustment
> algorithm.  The basic idea is to increase/decrease the hot threshold
> to make the number of pages that pass the hot threshold (promote
> candidate) near the rate limit.

hot threshold is tweakable via debugfs.  So it's an unstable interface,
undocumented, may be withdrawn at any time, etc.

Please justify this.  Is it intended that this interface be removed? 
If yes, please describe the plan for this in the changelog.  If no then
this interface should be in sysfs, it should be fully documented in the
kernel tree and it should be fully changelogged (preferably with usage
examples) so that reviewers can best understand what is a permanent
extension to the kernel API which we must maintain for all time.

Does this code build correctly if !LAST_CPUPID_NOT_IN_PAGE_FLAGS?

> TODO: Add ABI document for new sysctl knob.

um, yes.

check_cpupid() is a poor function name.  Check for what?  Liver damage?
A better identifier would be cpupid_valid(), perhaps.  And perhaps it
should be implemented in mm.h.  And even documented.

Oddly, the proposed changes do a decent job of documenting
intra-function changes but a poor job of documenting newly added
functions.  Please review every new function and decide whether it is
so simple and obvious that it can be added to Linux without any inline
description at all.  

pgdat_free_space_enough() and  numa_hint_fault_latency().

numa_migration_check_rate_limit() particularly.  There's that "check"
word again.  Check for what?

The numa_balancing_rate_limit_mbps sysctl.  I assume "b" means "bytes".
That's better than "pages", but still.  Machines vary a lot in how
many bytes they have and in the speed at which they process those
bytes.  Is there some metric which we can use here which at least
partially insulates us from this variability?  So that sysadmins don't
need to set all their machines differently, based upon their size and
speed?

numa_threshold_ts is apparently in units of jiffies.  This was not
documented at the definition site, and it should be.  But jiffies can
vary between machines and configs.  Why add this inter-machine and
inter-config uncertainty/variability when we have include/linux/ktime.h?