RE: [PATCH v11 1/8] mm/demotion: Add support for explicit memory tiers

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Aneesh Kumar K.V wrote:
> In the current kernel, memory tiers are defined implicitly via a demotion path
> relationship between NUMA nodes, which is created during the kernel
> initialization and updated when a NUMA node is hot-added or hot-removed. The
> current implementation puts all nodes with CPU into the highest tier, and builds
> the tier hierarchy tier-by-tier by establishing the per-node demotion targets
> based on the distances between nodes.
> 
> This current memory tier kernel implementation needs to be improved for several
> important use cases,
> 
> The current tier initialization code always initializes each memory-only NUMA
> node into a lower tier. But a memory-only NUMA node may have a high performance
> memory device (e.g. a DRAM-backed memory-only node on a virtual machine) that
> should be put into a higher tier.
> 
> The current tier hierarchy always puts CPU nodes into the top tier. But on a
> system with HBM or GPU devices, the memory-only NUMA nodes mapping these devices
> should be in the top tier, and DRAM nodes with CPUs are better to be placed into
> the next lower tier.
> 
> With current kernel higher tier node can only be demoted to nodes with shortest
> distance on the next lower tier as defined by the demotion path, not any other
> node from any lower tier. This strict, demotion order does not work in all use
> cases (e.g. some use cases may want to allow cross-socket demotion to another
> node in the same demotion tier as a fallback when the preferred demotion node is
> out of space), This demotion order is also inconsistent with the page allocation
> fallback order when all the nodes in a higher tier are out of space: The page
> allocation can fall back to any node from any lower tier, whereas the demotion
> order doesn't allow that.
> 
> This patch series address the above by defining memory tiers explicitly.
> 
> Linux kernel presents memory devices as NUMA nodes and each memory device is of
> a specific type. The memory type of a device is represented by its abstract
> distance. A memory tier corresponds to a range of abstract distance. This allows
> for classifying memory devices with a specific performance range into a memory
> tier.
> 
> This patch configures the range/chunk size to be 128. The default DRAM
> abstract distance is 512. We can have 4 memory tiers below the default DRAM
> abstract distance which cover the range 0 - 127, 127 - 255, 256- 383, 384 - 511.
> Slower memory devices like persistent memory will have abstract distance below
> the default DRAM level and hence will be placed in these 4 lower tiers.
> 
> A kernel parameter is provided to override the default memory tier.
> 
> Link: https://lore.kernel.org/linux-mm/CAAPL-u9Wv+nH1VOZTj=9p9S70Y3Qz3+63EkqncRDdHfubsrjfw@xxxxxxxxxxxxxx
> Link: https://lore.kernel.org/linux-mm/7b72ccf4-f4ae-cb4e-f411-74d055482026@xxxxxxxxxxxxx
> 
> Signed-off-by: Jagdish Gediya <jvgediya@xxxxxxxxxxxxx>
> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@xxxxxxxxxxxxx>
> ---
>  include/linux/memory-tiers.h |  17 ++++++
>  mm/Makefile                  |   1 +
>  mm/memory-tiers.c            | 102 +++++++++++++++++++++++++++++++++++
>  3 files changed, 120 insertions(+)
>  create mode 100644 include/linux/memory-tiers.h
>  create mode 100644 mm/memory-tiers.c
> 
> diff --git a/include/linux/memory-tiers.h b/include/linux/memory-tiers.h
> new file mode 100644
> index 000000000000..8d7884b7a3f0
> --- /dev/null
> +++ b/include/linux/memory-tiers.h
> @@ -0,0 +1,17 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +#ifndef _LINUX_MEMORY_TIERS_H
> +#define _LINUX_MEMORY_TIERS_H
> +
> +/*
> + * Each tier cover a abstrace distance chunk size of 128
> + */
> +#define MEMTIER_CHUNK_BITS	7
> +#define MEMTIER_CHUNK_SIZE	(1 << MEMTIER_CHUNK_BITS)
> +/*
> + * For now let's have 4 memory tier below default DRAM tier.
> + */
> +#define MEMTIER_ADISTANCE_DRAM	(1 << (MEMTIER_CHUNK_BITS + 2))
> +/* leave one tier below this slow pmem */
> +#define MEMTIER_ADISTANCE_PMEM	(1 << MEMTIER_CHUNK_BITS)

Why is memory type encoded in these values? There is no reason to
believe that PMEM is of a lower performance tier than DRAM. Consider
high performance energy backed DRAM that makes it "PMEM", consider CXL
attached DRAM over a switch topology and constrained links that makes it
a lower performance tier than locally attached DRAM. The names should be
associated with tiers that indicate their usage. Something like HOT,
GENERAL, and COLD. Where, for example, HOT is low capacity high
performance compared to the general purpose pool, and COLD is high
capacity low performance intended to offload the general purpose tier.

It does not need to be exactly that ontology, but please try to not
encode policy meaning behind memory types. There has been explicit
effort to avoid that to date because types are fraught for declaring
relative performance characteristics, and the relative performance
changes based on what memory types are assembled in a given system.




[Index of Archives]     [Linux ARM Kernel]     [Linux ARM]     [Linux Omap]     [Fedora ARM]     [IETF Annouce]     [Bugtraq]     [Linux OMAP]     [Linux MIPS]     [eCos]     [Asterisk Internet PBX]     [Linux API]

  Powered by Linux