David Hildenbrand <david@xxxxxxxxxx> writes: > On 13.07.23 21:12, Jeff Moyer wrote: >> David Hildenbrand <david@xxxxxxxxxx> writes: >> >>> On 16.06.23 00:00, Vishal Verma wrote: >>>> The dax/kmem driver can potentially hot-add large amounts of memory >>>> originating from CXL memory expanders, or NVDIMMs, or other 'device >>>> memories'. There is a chance there isn't enough regular system memory >>>> available to fit ythe memmap for this new memory. It's therefore >>>> desirable, if all other conditions are met, for the kmem managed memory >>>> to place its memmap on the newly added memory itself. >>>> >>>> Arrange for this by first allowing for a module parameter override for >>>> the mhp_supports_memmap_on_memory() test using a flag, adjusting the >>>> only other caller of this interface in dirvers/acpi/acpi_memoryhotplug.c, >>>> exporting the symbol so it can be called by kmem.c, and finally changing >>>> the kmem driver to add_memory() in chunks of memory_block_size_bytes(). >>> >>> 1) Why is the override a requirement here? Just let the admin >>> configure it then then add conditional support for kmem. >>> >>> 2) I recall that there are cases where we don't want the memmap to >>> land on slow memory (which online_movable would achieve). Just imagine >>> the slow PMEM case. So this might need another configuration knob on >>> the kmem side. >> >> From my memory, the case where you don't want the memmap to land on >> *persistent memory* is when the device is small (such as NVDIMM-N), and >> you want to reserve as much space as possible for the application data. >> This has nothing to do with the speed of access. > > Now that you mention it, I also do remember the origin of the altmap -- > to achieve exactly that: place the memmap on the device. > > commit 4b94ffdc4163bae1ec73b6e977ffb7a7da3d06d3 > Author: Dan Williams <dan.j.williams@xxxxxxxxx> > Date: Fri Jan 15 16:56:22 2016 -0800 > > x86, mm: introduce vmem_altmap to augment vmemmap_populate() > In support of providing struct page for large persistent memory > capacities, use struct vmem_altmap to change the default policy for > allocating memory for the memmap array. The default vmemmap_populate() > allocates page table storage area from the page allocator. Given > persistent memory capacities relative to DRAM it may not be feasible to > store the memmap in 'System Memory'. Instead vmem_altmap represents > pre-allocated "device pages" to satisfy vmemmap_alloc_block_buf() > requests. > > In PFN_MODE_PMEM (and only then), we use the altmap (don't see a way to > configure it). Configuration is done at pmem namespace creation time. The metadata for the namespace indicates where the memmap resides. See the ndctl-create-namespace man page: -M, --map= A pmem namespace in "fsdax" or "devdax" mode requires allocation of per-page metadata. The allocation can be drawn from either: · "mem": typical system memory · "dev": persistent memory reserved from the namespace Given relative capacities of "Persistent Memory" to "System RAM" the allocation defaults to reserving space out of the namespace directly ("--map=dev"). The overhead is 64-bytes per 4K (16GB per 1TB) on x86. > BUT that case is completely different from the "System RAM" mode. The memmap > of an NVDIMM in pmem mode is barely used by core-mm (i.e., not the buddy). Right. (btw, I don't think system ram mode existed back then.) > In comparison, if the buddy and everybody else works on the memmap in > "System RAM", it's much more significant if that resides on slow memory. Agreed. > Looking at > > commit 9b6e63cbf85b89b2dbffa4955dbf2df8250e5375 > Author: Michal Hocko <mhocko@xxxxxxxx> > Date: Tue Oct 3 16:16:19 2017 -0700 > > mm, page_alloc: add scheduling point to memmap_init_zone > memmap_init_zone gets a pfn range to initialize and it can be > really > large resulting in a soft lockup on non-preemptible kernels > NMI watchdog: BUG: soft lockup - CPU#31 stuck for 23s! > [kworker/u642:5:1720] > [...] > task: ffff88ecd7e902c0 ti: ffff88eca4e50000 task.ti: ffff88eca4e50000 > RIP: move_pfn_range_to_zone+0x185/0x1d0 > [...] > Call Trace: > devm_memremap_pages+0x2c7/0x430 > pmem_attach_disk+0x2fd/0x3f0 [nd_pmem] > nvdimm_bus_probe+0x64/0x110 [libnvdimm] > > > It's hard to tell if that was only required due to the memmap for these devices > being that large, or also partially because the access to the memmap is slower > that it makes a real difference. I believe the main driver was the size. At the time, Intel was advertising 3TiB/socket for pmem. I can't remember the exact DRAM configuration sizes from the time. > I recall that we're also often using ZONE_MOVABLE on such slow memory > to not end up placing other kernel data structures on there: especially, > user space page tables as I've been told. Part of the issue was preserving the media. The page structure gets lots of updates, and that could cause premature wear. > @Dan, any insight on the performance aspects when placing the memmap on > (slow) memory and having that memory be consumed by the buddy where we frequently > operate on the memmap? I'm glad you're asking these questions. We definitely want to make sure we don't conflate requirements based on some particular technology/implementation. Also, I wouldn't make any assumptions about the performance of CXL devices. As I understand it, there could be a broad spectrum of performance profiles. And now Dan can correct anything I got wrong. ;-) Cheers, Jeff