Re: [PATCH v2 0/5] Add movablecore_map boot option

Jiang Liu <jiang.liu@xxxxxxxxxx> · Wed, 28 Nov 2012 16:47:42 +0800

Hi all,
	Seems it's a great chance to discuss about the memory hotplug feature
within this thread. So I will try to give some high level thoughts about memory
hotplug feature on x86/IA64. Any comments are welcomed!
	First of all, I think usability really matters. Ideally, memory hotplug
feature should just work out of box, and we shouldn't expect administrators to 
add several extra platform dependent parameters to enable memory hotplug. 
But how to enable memory (or CPU/node) hotplug out of box? I think the key point
is to cooperate with BIOS/ACPI/firmware/device management teams. 
	I still position memory hotplug as an advanced feature for high end 
servers and those systems may/should provide some management interfaces to 
configure CPU/memory/node hotplug features. The configuration UI may be provided
by BIOS, BMC or centralized system management suite. Once administrator enables
hotplug feature through those management UI, OS should support system device
hotplug out of box. For example, HP SuperDome2 management suite provides interface
to configure a node as floating node(hot-removable). And OpenSolaris supports
CPU/memory hotplug out of box without any extra configurations. So we should
shape interfaces between firmware and OS to better support system device hotplug.
	On the other hand, I think there are no commercial available x86/IA64
platforms with system device hotplug capabilities in the field yet, at least only
limited quantity if any. So backward compatibility is not a big issue for us now.
So I think it's doable to rely on firmware to provide better support for system
device hotplug.
	Then what should be enhanced to better support system device hotplug?

1) ACPI specification should be enhanced to provide a static table to describe
components with hotplug features, so OS could reserve special resources for
hotplug at early boot stages. For example, to reserve enough CPU ids for CPU
hot-add. Currently we guess maximum number of CPUs supported by the platform
by counting CPU entries in APIC table, that's not reliable.

2) BIOS should implement SRAT, MPST and PMTT tables to better support memory
hotplug. SRAT associates memory ranges with proximity domains with an extra
"hotpluggable" flag. PMTT provides memory device topology information, such
as "socket->memory controller->DIMM". MPST is used for memory power management
and provides a way to associate memory ranges with memory devices in PMTT.
With all information from SRAT, MPST and PMTT, OS could figure out hotplug
memory ranges automatically, so no extra kernel parameters needed.

3) Enhance ACPICA to provide a method to scan static ACPI tables before
memory subsystem has been initialized because OS need to access SRAT,
MPST and PMTT when initializing memory subsystem.

4) The last and the most important issue is how to minimize performance
drop caused by memory hotplug. As proposed by this patchset, once we
configure all memory of a NUMA node as movable, it essentially disable
NUMA optimization of kernel memory allocation from that node. According
to experience, that will cause huge performance drop. We have observed
10-30% performance drop with memory hotplug enabled. And on another
OS the average performance drop caused by memory hotplug is about 10%.
If we can't resolve the performance drop, memory hotplug is just a feature
for demo:( With help from hardware, we do have some chances to reduce
performance penalty caused by memory hotplug.
	As we know, Linux could migrate movable page, but can't migrate
non-movable pages used by kernel/DMA etc. And the most hard part is how
to deal with those unmovable pages when hot-removing a memory device.
Now hardware has given us a hand with a technology named memory migration,
which could transparently migrate memory between memory devices. There's
no OS visible changes except NUMA topology before and after hardware memory
migration.
	And if there are multiple memory devices within a NUMA node,
we could configure some memory devices to host unmovable memory and the
other to host movable memory. With this configuration, there won't be
bigger performance drop because we have preserved all NUMA optimizations.
We also could achieve memory hotplug remove by:
1) Use existing page migration mechanism to reclaim movable pages.
2) For memory devices hosting unmovable pages, we need:
2.1) find a movable memory device on other nodes with enough capacity
and reclaim it.
2.2) use hardware migration technology to migrate unmovable memory to
the just reclaimed memory device on other nodes.

	I hope we could expect users to adopt memory hotplug technology
with all these implemented.

	Back to this patch, we could rely on the mechanism provided
by it to automatically mark memory ranges as movable with information
from ACPI SRAT/MPST/PMTT tables. So we don't need administrator to
manually configure kernel parameters to enable memory hotplug.

	Again, any comments are welcomed!

Regards!
Gerry

On 2012-11-23 18:44, Tang Chen wrote:
> [What we are doing]
> This patchset provide a boot option for user to specify ZONE_MOVABLE memory
> map for each node in the system.
> 
> movablecore_map=nn[KMG]@ss[KMG]
> 
> This option make sure memory range from ss to ss+nn is movable memory.
> 
> 
> [Why we do this]
> If we hot remove a memroy, the memory cannot have kernel memory,
> because Linux cannot migrate kernel memory currently. Therefore,
> we have to guarantee that the hot removed memory has only movable
> memoroy.
> 
> Linux has two boot options, kernelcore= and movablecore=, for
> creating movable memory. These boot options can specify the amount
> of memory use as kernel or movable memory. Using them, we can
> create ZONE_MOVABLE which has only movable memory.
> 
> But it does not fulfill a requirement of memory hot remove, because
> even if we specify the boot options, movable memory is distributed
> in each node evenly. So when we want to hot remove memory which
> memory range is 0x80000000-0c0000000, we have no way to specify
> the memory as movable memory.
> 
> So we proposed a new feature which specifies memory range to use as
> movable memory.
> 
> 
> [Ways to do this]
> There may be 2 ways to specify movable memory.
>  1. use firmware information
>  2. use boot option
> 
> 1. use firmware information
>   According to ACPI spec 5.0, SRAT table has memory affinity structure
>   and the structure has Hot Pluggable Filed. See "5.2.16.2 Memory
>   Affinity Structure". If we use the information, we might be able to
>   specify movable memory by firmware. For example, if Hot Pluggable
>   Filed is enabled, Linux sets the memory as movable memory.
> 
> 2. use boot option
>   This is our proposal. New boot option can specify memory range to use
>   as movable memory.
> 
> 
> [How we do this]
> We chose second way, because if we use first way, users cannot change
> memory range to use as movable memory easily. We think if we create
> movable memory, performance regression may occur by NUMA. In this case,
> user can turn off the feature easily if we prepare the boot option.
> And if we prepare the boot optino, the user can select which memory
> to use as movable memory easily. 
> 
> 
> [How to use]
> Specify the following boot option:
> movablecore_map=nn[KMG]@ss[KMG]
> 
> That means physical address range from ss to ss+nn will be allocated as
> ZONE_MOVABLE.
> 
> And the following points should be considered.
> 
> 1) If the range is involved in a single node, then from ss to the end of
>    the node will be ZONE_MOVABLE.
> 2) If the range covers two or more nodes, then from ss to the end of
>    the node will be ZONE_MOVABLE, and all the other nodes will only
>    have ZONE_MOVABLE.
> 3) If no range is in the node, then the node will have no ZONE_MOVABLE
>    unless kernelcore or movablecore is specified.
> 4) This option could be specified at most MAX_NUMNODES times.
> 5) If kernelcore or movablecore is also specified, movablecore_map will have
>    higher priority to be satisfied.
> 6) This option has no conflict with memmap option.
> 
> 
> 
> Tang Chen (4):
>   page_alloc: add movable_memmap kernel parameter
>   page_alloc: Introduce zone_movable_limit[] to keep movable limit for
>     nodes
>   page_alloc: Make movablecore_map has higher priority
>   page_alloc: Bootmem limit with movablecore_map
> 
> Yasuaki Ishimatsu (1):
>   x86: get pg_data_t's memory from other node
> 
>  Documentation/kernel-parameters.txt |   17 +++
>  arch/x86/mm/numa.c                  |   11 ++-
>  include/linux/memblock.h            |    1 +
>  include/linux/mm.h                  |   11 ++
>  mm/memblock.c                       |   15 +++-
>  mm/page_alloc.c                     |  216 ++++++++++++++++++++++++++++++++++-
>  6 files changed, 263 insertions(+), 8 deletions(-)
> 
> 
> .
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html