Re: [PATCH 0/7] riscv: Memory Hot(Un)Plug support

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 12.05.23 16:57, Björn Töpel wrote:
From: Björn Töpel <bjorn@xxxxxxxxxxxx>

Memory Hot(Un)Plug support for the RISC-V port
==============================================

Introduction
------------

To quote "Documentation/admin-guide/mm/memory-hotplug.rst": "Memory
hot(un)plug allows for increasing and decreasing the size of physical
memory available to a machine at runtime."

This series attempts to add memory hot(un)plug support for the RISC-V
Linux port.

I'm sending the series as a v1, but it's borderline RFC. It definitely
needs more testing time, but it would be nice with some early input.

Implementation
--------------

 From an arch perspective, a couple of callbacks needs to be
implemented to support hot plugging:

arch_add_memory()
This callback is responsible for updating the linear/direct map, and
call into the memory hot plugging generic code via __add_pages().

arch_remove_memory()
In this callback the linear/direct map is tore down.

vmemmap_free()
The function tears down the vmemmap mappings (if
CONFIG_SPARSEMEM_VMEMMAP is in-use), and also deallocates the backing
vmemmap pages. Note that for persistent memory, an alternative
allocator for the backing pages can be used -- the vmem_altmap. This
means that when the backing pages are cleared, extra care is needed so
that the correct deallocation method is used. Note that RISC-V
populates the vmemmap using vmemmap_populate_basepages(), so currently
no hugepages are used for the backing store.

The page table unmap/teardown functions are heavily based (copied!)
from the x86 tree. The same remove_pgd_mapping() is used in both
vmemmap_free() and arch_remove_memory(), but in the latter function
the backing pages are not removed.

On RISC-V, the PGD level kernel mappings needs to synchronized with
all page-tables (e.g. via sync_kernel_mappings()). Synchronization
involves special care, like locking. Instead, this patch series takes
a different approach (introduced by Jörg Rödel in the x86-tree);
Pre-allocate the PGD-leaves (P4D, PUD, or PMD depending on the paging
setup) at mem_init(), for vmemmap and the direct map.

Pre-allocating the PGD-leaves waste some memory, but is only enabled
for CONFIG_MEMORY_HOTPLUG. The number pages, potentially unused, are
~128 * 4K.

Patch 1: Preparation for hotplugging support, by pre-allocating the
          PGD leaves.

Patch 2: Changes the __init attribute to __meminit, to avoid that the
          functions are removed after init. __meminit keeps the
          functions after init, if memory hotplugging is enabled for
          the build.
Patch 3: Refactor the direct map setup, so it can be used for hot add.

Patch 4: The actual add/remove code. Mostly a page-table-walk
          exercise.

Patch 5: Turn on the arch support in Kconfig

Patch 6: Now that memory hotplugging is enabled, make virtio-mem
          usable for RISC-V
Patch 7: Pre-allocate vmalloc PGD-leaves as well, which removes the
          need for vmalloc faulting.
RFC
---

  * TLB flushes. The current series uses Big Hammer flush-it-all.
  * Pre-allocation vs explicit syncs

Testing
-------

ACPI support is still in the making for RISC-V, so tests that involve
CXL and similar fanciness is currently not possible. Virtio-mem,
however, works without proper ACPI support. In order to try this out
in Qemu, some additional patches for Qemu are needed:

  * Enable virtio-mem for RISC-V
  * Add proper hotplug support for virtio-mem
The patch for Qemu can be found is commit 5d90a7ef1bc0
("hw/riscv/virt: Support for virtio-mem-pci"), and can be found here

   https://github.com/bjoto/qemu/tree/riscv-virtio-mem

I will try to upstream that work in parallel with this.
Thanks to David Hildenbrand for valuable input for the Qemu side of
things.

The series is based on the RISC-V fixes tree
   https://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux.git/log/?h=fixes


Cool stuff! I'm fairly busy right now, so some high-level questions upfront:

What is the memory section size (which implies the memory block size and)? This implies the minimum DIMM granularity and the high-level granularity in which virtio-mem adds memory.

What is the pageblock size, implying the minimum granularity that virtio-mem can operate on?

On x86-64 and arm64 we currently use the ACPI SRAT to expose the maximum physical address where we can see memory getting hotplugged. [1] From that, we can derive the "max_possible_pfn" and prepare the kernel virtual memory layourt (especially, direct map).

Is something similar required on RISC-V? On s390x, I'm planning on adding a paravirtualized mechanism to detect where memory devices might be located. (I had a running RFC, but was distracted by all other kinds of stuff)


[1] https://virtio-mem.gitlab.io/developer-guide.html

--
Thanks,

David / dhildenb

_______________________________________________
Virtualization mailing list
Virtualization@xxxxxxxxxxxxxxxxxxxxxxxxxx
https://lists.linuxfoundation.org/mailman/listinfo/virtualization




[Index of Archives]     [KVM Development]     [Libvirt Development]     [Libvirt Users]     [CentOS Virtualization]     [Netdev]     [Ethernet Bridging]     [Linux Wireless]     [Kernel Newbies]     [Security]     [Linux for Hams]     [Netfilter]     [Bugtraq]     [Yosemite Forum]     [MIPS Linux]     [ARM Linux]     [Linux RAID]     [Linux Admin]     [Samba]

  Powered by Linux