Re: [RFC PATCH v4 1/7] mm/demotion: Add support for explicit memory tiers

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 6/6/22 1:23 PM, Ying Huang wrote:
On Mon, 2022-06-06 at 11:57 +0530, Aneesh Kumar K.V wrote:
Aneesh Kumar K V <aneesh.kumar@xxxxxxxxxxxxx> writes:

On 6/6/22 11:03 AM, Ying Huang wrote:
On Mon, 2022-06-06 at 09:26 +0530, Aneesh Kumar K V wrote:
On 6/6/22 8:19 AM, Ying Huang wrote:
On Thu, 2022-06-02 at 14:07 +0800, Ying Huang wrote:
On Fri, 2022-05-27 at 17:55 +0530, Aneesh Kumar K.V wrote:
From: Jagdish Gediya <jvgediya@xxxxxxxxxxxxx>

In the current kernel, memory tiers are defined implicitly via a
demotion path relationship between NUMA nodes, which is created
during the kernel initialization and updated when a NUMA node is
hot-added or hot-removed.  The current implementation puts all
nodes with CPU into the top tier, and builds the tier hierarchy
tier-by-tier by establishing the per-node demotion targets based
on the distances between nodes.

This current memory tier kernel interface needs to be improved for
several important use cases,

The current tier initialization code always initializes
each memory-only NUMA node into a lower tier.  But a memory-only
NUMA node may have a high performance memory device (e.g. a DRAM
device attached via CXL.mem or a DRAM-backed memory-only node on
a virtual machine) and should be put into a higher tier.

The current tier hierarchy always puts CPU nodes into the top
tier. But on a system with HBM or GPU devices, the
memory-only NUMA nodes mapping these devices should be in the
top tier, and DRAM nodes with CPUs are better to be placed into the
next lower tier.

With current kernel higher tier node can only be demoted to selected nodes on the
next lower tier as defined by the demotion path, not any other
node from any lower tier.  This strict, hard-coded demotion order
does not work in all use cases (e.g. some use cases may want to
allow cross-socket demotion to another node in the same demotion
tier as a fallback when the preferred demotion node is out of
space), This demotion order is also inconsistent with the page
allocation fallback order when all the nodes in a higher tier are
out of space: The page allocation can fall back to any node from
any lower tier, whereas the demotion order doesn't allow that.

The current kernel also don't provide any interfaces for the
userspace to learn about the memory tier hierarchy in order to
optimize its memory allocations.

This patch series address the above by defining memory tiers explicitly.

This patch adds below sysfs interface which is read-only and
can be used to read nodes available in specific tier.

/sys/devices/system/memtier/memtierN/nodelist

Tier 0 is the highest tier, while tier MAX_MEMORY_TIERS - 1 is the
lowest tier. The absolute value of a tier id number has no specific
meaning. what matters is the relative order of the tier id numbers.

All the tiered memory code is guarded by CONFIG_TIERED_MEMORY.
Default number of memory tiers are MAX_MEMORY_TIERS(3). All the
nodes are by default assigned to DEFAULT_MEMORY_TIER(1).

Default memory tier can be read from,
/sys/devices/system/memtier/default_tier

Max memory tier can be read from,
/sys/devices/system/memtier/max_tiers

This patch implements the RFC spec sent by Wei Xu <weixugc@xxxxxxxxxx> at [1].

[1] https://lore.kernel.org/linux-mm/CAAPL-u-DGLcKRVDnChN9ZhxPkfxQvz9Sb93kVoX_4J2oiJSkUw@xxxxxxxxxxxxxx/

Signed-off-by: Jagdish Gediya <jvgediya@xxxxxxxxxxxxx>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@xxxxxxxxxxxxx>

IMHO, we should change the kernel internal implementation firstly, then
implement the kerne/user space interface.  That is, make memory tier
explicit inside kernel, then expose it to user space.

Why ignore this comment for v5?  If you don't agree, please respond me.


I am not sure what benefit such a rearrange would bring in? Right now I
am writing the series from the point of view of introducing all the
plumbing and them switching the existing demotion logic to use the new
infrastructure. Redoing the code to hide all the userspace sysfs till we
switch the demotion logic to use the new infrastructure doesn't really
bring any additional clarity to patch review and would require me to
redo the series with a lot of conflicts across the patches in the patchset.

IMHO, we shouldn't introduce regression even in the middle of a
patchset.  Each step should only rely on previous patches in the series
to work correctly.  In your current way of organization, after patch
[1/7], on a system with 2 memory tiers, the user space interface will
output wrong information (only 1 memory tier).  So I think the correct
way is to make it right inside the kenrel firstly, then expose the right
information to user space.


The patchset doesn't add additional tier until "mm/demotion/dax/kmem:
Set node's memory tier to MEMORY_TIER_PMEM". ie, there is no additional
tiers done till all the demotion logic is in place. So even if the
system got dax/kmem, the support for adding dax/kmem as a memory tier
comes later in the patch series.

Let me clarify this a bit more. This patchset doesn't change the
existing kernel behavior till "mm/demotion: Build demotion targets
based on explicit memory tiers". So there is no regression till then.
It adds a parallel framework (memory tiers to the existing demotion
logic).

I can move the patch "mm/demotion/dax/kmem: Set node's memory tier to
MEMORY_TIER_PMEM" before switching the demotion logic so that on systems
with two memory tiers (DRAM and pmem) the demotion continues to work
as expected after patch 3 ("mm/demotion: Build demotion targets based on
explicit memory tiers"). With that, there will not be any regression in
between the patch series.


Thanks!  Please do that.  And I think you can add sysfs interface after
that patch too.  That is, in [1/7]


I am not sure why you insist on moving sysfs interfaces later. They are introduced based on the helper added. It make patch review easier to look at both the helpers and the user of the helper together in a patch.

+struct memory_tier {
+	nodemask_t nodelist;
+};

And struct device can be added after the kernel has switched the
implementation based on explicit memory tiers.

+struct memory_tier {
+	struct device dev;
+	nodemask_t nodelist;
+};



Can you elaborate on this? or possibly review the v5 series indicating what change you are suggesting here?


But I don't think it's a good idea to have "struct device" embedded in
"struct memory_tier".  We don't have "struct device" embedded in "struct
pgdata_list"...


I avoided creating an array for memory_tier (memory_tier[]) so that we can keep it dynamic. Keeping dev embedded in struct memory_tier simplify the life cycle management of that dynamic list. We free the struct memory_tier allocation via device release function (memtier->dev.release = memory_tier_device_release )

Why do you think it is not a good idea?

-aneesh





[Index of Archives]     [Linux ARM Kernel]     [Linux ARM]     [Linux Omap]     [Fedora ARM]     [IETF Annouce]     [Bugtraq]     [Linux OMAP]     [Linux MIPS]     [eCos]     [Asterisk Internet PBX]     [Linux API]

  Powered by Linux