Re: [PATCH v1] memory-hotplug.rst: complete admin-guide overhaul

Mike Rapoport <rppt@xxxxxxxxxx> · Mon, 7 Jun 2021 14:35:47 +0300

On Tue, May 25, 2021 at 12:26:04PM +0200, David Hildenbrand wrote:
> The memory hot(un)plug documentation is outdated and incomplete. Most of
> the content dates back to 2007, so it's time for a major overhaul.
> 
> Let's rewrite, reorganize and update most parts of the documentation. In
> addition to memory hot(un)plug, also add some details regarding
> ZONE_MOVABLE, with memory hotunplug being one of its main consumers.
> 
> The style of the document is also properly fixed that e.g., "restview"
> renders it cleanly now.
> 
> In the future, we might add some more details about virt users like
> virtio-mem, the XEN balloon, the Hyper-V balloon and ppc64 dlpar.
> 
> Cc: Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx>
> Cc: Oscar Salvador <osalvador@xxxxxxx>
> Cc: Michal Hocko <mhocko@xxxxxxxx>
> Cc: Mike Kravetz <mike.kravetz@xxxxxxxxxx>
> Cc: Mike Rapoport <rppt@xxxxxxxxxx>
> Cc: Dave Hansen <dave.hansen@xxxxxxxxxxxxxxx>
> Cc: Matthew Wilcox <willy@xxxxxxxxxxxxx>
> Cc: Anshuman Khandual <anshuman.khandual@xxxxxxx>
> Cc: Muchun Song <songmuchun@xxxxxxxxxxxxx>
> Cc: Pavel Tatashin <pasha.tatashin@xxxxxxxxxx>
> Cc: Jonathan Corbet <corbet@xxxxxxx>
> Cc: Stephen Rothwell <sfr@xxxxxxxxxxxxxxxx>
> Cc: linux-doc@xxxxxxxxxxxxxxx
> Signed-off-by: David Hildenbrand <david@xxxxxxxxxx>
> ---
> 
> Based on linux-next, which includes hugetlb vmemmap changes to the doc
> that are not upstream yet.
> 
> ---
>  .../admin-guide/mm/memory-hotplug.rst         | 738 +++++++++++-------
>  1 file changed, 440 insertions(+), 298 deletions(-)
> 
> diff --git a/Documentation/admin-guide/mm/memory-hotplug.rst b/Documentation/admin-guide/mm/memory-hotplug.rst
> index c6bae2d77160..c95f5c2b30dd 100644
> --- a/Documentation/admin-guide/mm/memory-hotplug.rst
> +++ b/Documentation/admin-guide/mm/memory-hotplug.rst

...

> +ZONE_MOVABLE
> +============
> +
> +ZONE_MOVABLE is an important mechanism for more reliable memory offlining.
> +Further, having system RAM managed by ZONE_MOVABLE instead of one of the
> +kernel zones can increase the number of possible transparent huge pages and
> +dynamically allocated huge pages.
> +

I'd move the first two paragraphs from "Zone Imbalances" here to provide
some context what is movable and what is unmovable allocation.

> +Only movable allocations are served from ZONE_MOVABLE, resulting in
> +unmovable allocations being limited to the kernel zones. Without ZONE_MOVABLE,
> +there is absolutely no guarantee whether a memory block can be offlined
> +successfully.
> +
> +Zone Imbalances
> +---------------
> +
> +Most kernel allocations are unmovable. Important examples include the memmap
> +(usually 1/64 of memory), page tables, and kmalloc(). Such allocations
> +can only be served from the kernel zones.
> +
> +Most user space pages, such as anonymous memory, and page cache pages
> +are movable. Such allocations can be served from ZONE_MOVABLE and the kernel
> +zones.
> +
> +Having too much system RAM managed by ZONE_MOVABLE is called a zone imbalance,
> +which can harm the system or degrade performance. As one example, the kernel
> +might crash because it runs out of free memory for unmovable allocations,
> +although there is still plenty of free memory left in ZONE_MOVABLE.
> +
> +Usually, MOVABLE:KERNEL ratios of up to 3:1 or even 4:1 are fine. Ratios of 63:1
> +are definitely impossible due to the memmap overhead.
> +
> +Actual safe zone ratios depend on the workload. Extreme cases, like excessive
> +long-term pinning of pages, might not be able to deal with ZONE_MOVABLE at all.
>  
>  .. note::
> -   Techniques that rely on long-term pinnings of memory (especially, RDMA and
> -   vfio) are fundamentally problematic with ZONE_MOVABLE and, therefore, memory
> -   hot remove. Pinned pages cannot reside on ZONE_MOVABLE, to guarantee that
> -   memory can still get hot removed - be aware that pinning can fail even if
> -   there is plenty of free memory in ZONE_MOVABLE. In addition, using
> -   ZONE_MOVABLE might make page pinning more expensive, because pages have to be
> -   migrated off that zone first.
>  
> -.. _memory_hotplug_how_to_offline_memory:
> +  CMA memory part of a kernel zone essentially behaves like memory in
> +  ZONE_MOVABLE and similar considerations apply, especially when combining
> +  CMA with ZONE_MOVABLE.
>  
> -How to offline memory
> ----------------------
> +Considerations

ZONE_MOVABLE Sizing Considerations ?

I'd also move the contents of "Boot Memory and ZONE_MOVABLE" here (with
some adjustments):

  By default, all the memory configured at boot time is managed by the kernel
  zones and ZONE_MOVABLE is not used.

  To enable ZONE_MOVABLE to include the memory present at boot and to
  control the ratio between movable and kernel zones there are two command
  line options: ``kernelcore=`` and ``movablecore=``. See
  Documentation/admin-guide/kernel-parameters.rst for their description.

> +--------------
>  
> -You can offline a memory block by using the same sysfs interface that was used
> -in memory onlining::
> +We usually expect that a large portion of available system RAM will actually
> +be consumed by user space, either directly or indirectly via the page cache. In
> +the normal case, ZONE_MOVABLE can be used when allocating such pages just fine.
>  
> -	% echo offline > /sys/devices/system/memory/memoryXXX/state
> +With that in mind, it makes sense that we can have a big portion of system RAM
> +managed by ZONE_MOVABLE. However, there are some things to consider when
> +using ZONE_MOVABLE, especially when fine-tuning zone ratios:
>  
> -If offline succeeds, the state of the memory block is changed to be "offline".
> -If it fails, some error core (like -EBUSY) will be returned by the kernel.
> -Even if a memory block does not belong to ZONE_MOVABLE, you can try to offline
> -it.  If it doesn't contain 'unmovable' memory, you'll get success.
> +- Having a lot of offline memory blocks. Even offline memory blocks consume
> +  memory for metadata and page tables in the direct map; having a lot of
> +  offline memory blocks is not a typical case, though.
> +
> +- Memory ballooning. Some memory ballooning implementations, such as
> +  the Hyper-V balloon, the XEN balloon, the vbox balloon and the VMWare

So, everyone except virtio-mem? ;-)
I'd drop the names because if some of those will implement balloon
compaction they surely will forget to update the docs.

> +  balloon with huge pages don't support balloon compaction and, thereby
> +  ZONE_MOVABLE.
> +
> +  Further, CONFIG_BALLOON_COMPACTION might be disabled. In that case, balloon
> +  inflation will only perform unmovable allocations and silently create a
> +  zone imbalance, usually triggered by inflation requests from the
> +  hypervisor.
> +
> +- Gigantic pages are unmovable, resulting in user space consuming a
> +  lot of unmovable memory.
> +
> +- Huge pages are unmovable when an architectures does not support huge
> +  page migration, resulting in a similar issue as with gigantic pages.
> +
> +- Page tables are unmovable. Excessive swapping, mapping extremely large
> +  files or ZONE_DEVICE memory can be problematic, although only
> +  really relevant in corner cases. When we manage a lot of user space memory
> +  that has been swapped out or is served from a file/pmem/... we still need

                                                     ^ persistent memory

> +  a lot of page tables to manage that memory once user space accessed that
> +  memory once.
> +
> +- DAX: when we have a lot of ZONE_DEVICE memory added to the system as DAX
> +  and we are not using an altmap to allocate the memmap from device memory
> +  directly, we will have to allocate the memmap for this memory from the
> +  kernel zones.

I'm not sure admin-guide reader will know when we use altmap when we don't.
Maybe 

  DAX: in certain DAX configurations the memory map for the device memory will
  be allocated from the kernel zones.

> -A memory block under ZONE_MOVABLE is considered to be able to be offlined
> -easily.  But under some busy state, it may return -EBUSY. Even if a memory
> -block cannot be offlined due to -EBUSY, you can retry offlining it and may be
> -able to offline it (or not). (For example, a page is referred to by some kernel
> -internal call and released soon.)
> +- Long-term pinning of pages. Techniques that rely on long-term pinnings
> +  (especially, RDMA and vfio/mdev) are fundamentally problematic with
> +  ZONE_MOVABLE, and therefore, memory offlining. Pinned pages cannot reside
> +  on ZONE_MOVABLE as that would turn these pages unmovable. Therefore, they
> +  have to be migrated off that zone while pinning. Pinning a page can fail
> +  even if there is plenty of free memory in ZONE_MOVABLE.
>  
> -Consideration:
> -  Memory hotplug's design direction is to make the possibility of memory
> -  offlining higher and to guarantee unplugging memory under any situation. But
> -  it needs more work. Returning -EBUSY under some situation may be good because
> -  the user can decide to retry more or not by himself. Currently, memory
> -  offlining code does some amount of retry with 120 seconds timeout.
> +  In addition, using ZONE_MOVABLE might make page pinning more expensive,
> +  because of the page migration overhead.
>  
> -Physical memory remove
> -======================
> +Boot Memory and ZONE_MOVABLE
> +----------------------------
>  
> -Need more implementation yet....
> - - Notification completion of remove works by OS to firmware.
> - - Guard from remove if not yet.
> +Without further configuration, all boot memory will be managed by kernel zones
> +when booting up in most configurations. ZONE_MOVABLE is not used as default.
>  
> +However, there is a mechanism to configure that behavior during boot via the
> +cmdline: ``kernelcore=`` and ``movablecore=``. See
> +Documentation/admin-guide/kernel-parameters.rst for details.
> +
> +Memory Offlining and ZONE_MOVABLE
> +---------------------------------
> +
> +Even with ZONE_MOVABLE, there are some corner cases where offlining a memory
> +block might fail:
> +
> +- Memory blocks with memory holes; this applies to memory blocks present during
> +  boot and can apply to memory blocks hotplugged via the XEN balloon and the
> +  Hyper-V balloon.
> +
> +- Mixed NUMA nodes and mixed zones within a single memory block prevent memory
> +  offlining; this applies to memory blocks present during boot only.
> +
> +- Special memory blocks prevented by the system from getting offlined. Examples
> +  include any memory available during boot on arm64 or memory blocks spanning
> +  the crashkernel area on s390x; this usually applies to memory blocks present
> +  during boot only.
> +
> +- Memory blocks overlapping with CMA areas cannot be offlined, this applies to
> +  memory blocks present during boot only.
> +
> +- Concurrent activity that operates on the same physical memory area, such as
> +  allocating gigantic pages, can result in temporary offlining failures.
> +
> +- Out of memory when dissolving huge pages, especially when freeing unused
> +  vmemmap pages associated with each hugetlb page is enabled.
> +
> +  Offlining code may be able to migrate huge page contents, but may not be able
> +  to dissolve the source huge page because it fails allocating (unmovable) pages
> +  for the vmemmap, because the system might not have free memory in the kernel
> +  zones left.
> +
> +  Users that depend on memory hotplug to succeed for movable zones should
> +  carefully consider whether the memory savings gained from this feature are
> +  worth the risk of possibly not being able to offline memory in certain
> +  situations.
> +
> +Further, when running into out of memory situations while migrating pages, or
> +when still encountering permanently unmovable pages within ZONE_MOVABLE
> +(-> BUG), memory offlining will keep retrying until it eventually succeeds.
>  
>  Locking Internals
>  =================
> @@ -440,8 +594,8 @@ As the device is visible to user space before taking the device_lock(), this
>  can result in a lock inversion.
>  
>  onlining/offlining of memory should be done via device_online()/
> -device_offline() - to make sure it is properly synchronized to actions
> -via sysfs. Holding device_hotplug_lock is advised (to e.g. protect online_type)
> +device_offline() -- to make sure it is properly synchronized to actions
> +via sysfs -- while holding the device_hotplug_lock.
>  
>  When adding/removing/onlining/offlining memory or adding/removing
>  heterogeneous/device memory, we should always hold the mem_hotplug_lock in
> @@ -452,15 +606,3 @@ In addition, mem_hotplug_lock (in contrast to device_hotplug_lock) in read
>  mode allows for a quite efficient get_online_mems/put_online_mems
>  implementation, so code accessing memory can protect from that memory
>  vanishing.
> -
> -
> -Future Work
> -===========
> -
> -  - allowing memory hot-add to ZONE_MOVABLE. maybe we need some switch like
> -    sysctl or new control file.
> -  - showing memory block and physical device relationship.
> -  - test and make it better memory offlining.
> -  - support HugeTLB page migration and offlining.
> -  - memmap removing at memory offline.
> -  - physical remove memory.
> -- 
> 2.31.1
> 

-- 
Sincerely yours,
Mike.