Re: [PATCH v11 1/8] mm/demotion: Add support for explicit memory tiers

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Dan Williams <dan.j.williams@xxxxxxxxx> writes:

> Huang, Ying wrote:
>> Dan Williams <dan.j.williams@xxxxxxxxx> writes:
>> 
>> > Aneesh Kumar K.V wrote:
>> >> In the current kernel, memory tiers are defined implicitly via a demotion path
>> >> relationship between NUMA nodes, which is created during the kernel
>> >> initialization and updated when a NUMA node is hot-added or hot-removed. The
>> >> current implementation puts all nodes with CPU into the highest tier, and builds
>> >> the tier hierarchy tier-by-tier by establishing the per-node demotion targets
>> >> based on the distances between nodes.
>> >> 
>> >> This current memory tier kernel implementation needs to be improved for several
>> >> important use cases,
>> >> 
>> >> The current tier initialization code always initializes each memory-only NUMA
>> >> node into a lower tier. But a memory-only NUMA node may have a high performance
>> >> memory device (e.g. a DRAM-backed memory-only node on a virtual machine) that
>> >> should be put into a higher tier.
>> >> 
>> >> The current tier hierarchy always puts CPU nodes into the top tier. But on a
>> >> system with HBM or GPU devices, the memory-only NUMA nodes mapping these devices
>> >> should be in the top tier, and DRAM nodes with CPUs are better to be placed into
>> >> the next lower tier.
>> >> 
>> >> With current kernel higher tier node can only be demoted to nodes with shortest
>> >> distance on the next lower tier as defined by the demotion path, not any other
>> >> node from any lower tier. This strict, demotion order does not work in all use
>> >> cases (e.g. some use cases may want to allow cross-socket demotion to another
>> >> node in the same demotion tier as a fallback when the preferred demotion node is
>> >> out of space), This demotion order is also inconsistent with the page allocation
>> >> fallback order when all the nodes in a higher tier are out of space: The page
>> >> allocation can fall back to any node from any lower tier, whereas the demotion
>> >> order doesn't allow that.
>> >> 
>> >> This patch series address the above by defining memory tiers explicitly.
>> >> 
>> >> Linux kernel presents memory devices as NUMA nodes and each memory device is of
>> >> a specific type. The memory type of a device is represented by its abstract
>> >> distance. A memory tier corresponds to a range of abstract distance. This allows
>> >> for classifying memory devices with a specific performance range into a memory
>> >> tier.
>> >> 
>> >> This patch configures the range/chunk size to be 128. The default DRAM
>> >> abstract distance is 512. We can have 4 memory tiers below the default DRAM
>> >> abstract distance which cover the range 0 - 127, 127 - 255, 256- 383, 384 - 511.
>> >> Slower memory devices like persistent memory will have abstract distance below
>> >> the default DRAM level and hence will be placed in these 4 lower tiers.
>> >> 
>> >> A kernel parameter is provided to override the default memory tier.
>> >> 
>> >> Link: https://lore.kernel.org/linux-mm/CAAPL-u9Wv+nH1VOZTj=9p9S70Y3Qz3+63EkqncRDdHfubsrjfw@xxxxxxxxxxxxxx
>> >> Link: https://lore.kernel.org/linux-mm/7b72ccf4-f4ae-cb4e-f411-74d055482026@xxxxxxxxxxxxx
>> >> 
>> >> Signed-off-by: Jagdish Gediya <jvgediya@xxxxxxxxxxxxx>
>> >> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@xxxxxxxxxxxxx>
>> >> ---
>> >>  include/linux/memory-tiers.h |  17 ++++++
>> >>  mm/Makefile                  |   1 +
>> >>  mm/memory-tiers.c            | 102 +++++++++++++++++++++++++++++++++++
>> >>  3 files changed, 120 insertions(+)
>> >>  create mode 100644 include/linux/memory-tiers.h
>> >>  create mode 100644 mm/memory-tiers.c
>> >> 
>> >> diff --git a/include/linux/memory-tiers.h b/include/linux/memory-tiers.h
>> >> new file mode 100644
>> >> index 000000000000..8d7884b7a3f0
>> >> --- /dev/null
>> >> +++ b/include/linux/memory-tiers.h
>> >> @@ -0,0 +1,17 @@
>> >> +/* SPDX-License-Identifier: GPL-2.0 */
>> >> +#ifndef _LINUX_MEMORY_TIERS_H
>> >> +#define _LINUX_MEMORY_TIERS_H
>> >> +
>> >> +/*
>> >> + * Each tier cover a abstrace distance chunk size of 128
>> >> + */
>> >> +#define MEMTIER_CHUNK_BITS	7
>> >> +#define MEMTIER_CHUNK_SIZE	(1 << MEMTIER_CHUNK_BITS)
>> >> +/*
>> >> + * For now let's have 4 memory tier below default DRAM tier.
>> >> + */
>> >> +#define MEMTIER_ADISTANCE_DRAM	(1 << (MEMTIER_CHUNK_BITS + 2))
>> >> +/* leave one tier below this slow pmem */
>> >> +#define MEMTIER_ADISTANCE_PMEM	(1 << MEMTIER_CHUNK_BITS)
>> >
>> > Why is memory type encoded in these values? There is no reason to
>> > believe that PMEM is of a lower performance tier than DRAM. Consider
>> > high performance energy backed DRAM that makes it "PMEM", consider CXL
>> > attached DRAM over a switch topology and constrained links that makes it
>> > a lower performance tier than locally attached DRAM. The names should be
>> > associated with tiers that indicate their usage. Something like HOT,
>> > GENERAL, and COLD. Where, for example, HOT is low capacity high
>> > performance compared to the general purpose pool, and COLD is high
>> > capacity low performance intended to offload the general purpose tier.
>> >
>> > It does not need to be exactly that ontology, but please try to not
>> > encode policy meaning behind memory types. There has been explicit
>> > effort to avoid that to date because types are fraught for declaring
>> > relative performance characteristics, and the relative performance
>> > changes based on what memory types are assembled in a given system.
>> 
>> Yes.  MEMTIER_ADISTANCE_PMEM is something over simplified.  That is only
>> used in this very first version to make it as simple as possible.  
>
> I am failing to see the simplicity of using names that convey a
> performance contract that are invalid depending on the system.
>
>> I think we can come up with something better in the later version.
>> For example, identify the abstract distance of a PMEM device based on
>> HMAT, etc. 
>
> Memory tiering has nothing to do with persistence why is PMEM in the
> name at all?
>
>>  And even in this first version, we should put MEMTIER_ADISTANCE_PMEM
>>  in dax/kmem.c.  Because it's just for that specific type of memory
>>  used now, not for all PMEM.
>
> dax/kmem.c also handles HBM and "soft reserved" memory in general. There
> is also nothing PMEM specific about the device-dax subsystem.

Ah... I see the issue here.  For the systems in our hand, dax/kmem.c is
used to online PMEM only.  Even the "soft reserved" memory is used for
PMEM or simulating PMEM too.  So to make the code as simple as possible,
we treat all memory devices onlined by dax/kmem as PMEM in the first
version.  And plan to support more memory types in the future versions.

But from your above words, our assumption are wrong here.  dax/kmem.c
can online HBM and other memory devices already.  If so, how do we
distinguish between them and how to get the performance character of
these devices?  We can start with SLIT?

>> In the current design, memory type is used to report the performance of
>> the hardware, in terms of abstract distance, per Johannes' suggestion.
>
> That sounds fine, just pick an abstract name, not an explicit memory
> type.
>
>> Which is an abstraction of memory latency and bandwidth.  Policy is
>> described via memory tiers.  Several memory types may be put in one
>> memory tier.  The abstract distance chunk size of the memory tier may
>> be adjusted according to policy.
>
> That part all sounds good. That said, I do not see the benefit of
> waiting to run away from these inadequate names.

Good!

Best Regards,
Huang, Ying




[Index of Archives]     [Linux ARM Kernel]     [Linux ARM]     [Linux Omap]     [Fedora ARM]     [IETF Annouce]     [Bugtraq]     [Linux OMAP]     [Linux MIPS]     [eCos]     [Asterisk Internet PBX]     [Linux API]

  Powered by Linux