On Wed, Jun 05, 2019 at 02:58:42PM -0700, Dan Williams wrote: > The libnvdimm sub-system has suffered a series of hacks and broken > workarounds for the memory-hotplug implementation's awkward > section-aligned (128MB) granularity. For example the following backtrace > is emitted when attempting arch_add_memory() with physical address > ranges that intersect 'System RAM' (RAM) with 'Persistent Memory' (PMEM) > within a given section: > > WARNING: CPU: 0 PID: 558 at kernel/memremap.c:300 devm_memremap_pages+0x3b5/0x4c0 > devm_memremap_pages attempted on mixed region [mem 0x200000000-0x2fbffffff flags 0x200] > [..] > Call Trace: > dump_stack+0x86/0xc3 > __warn+0xcb/0xf0 > warn_slowpath_fmt+0x5f/0x80 > devm_memremap_pages+0x3b5/0x4c0 > __wrap_devm_memremap_pages+0x58/0x70 [nfit_test_iomap] > pmem_attach_disk+0x19a/0x440 [nd_pmem] > > Recently it was discovered that the problem goes beyond RAM vs PMEM > collisions as some platform produce PMEM vs PMEM collisions within a > given section. The libnvdimm workaround for that case revealed that the > libnvdimm section-alignment-padding implementation has been broken for a > long while. A fix for that long-standing breakage introduces as many > problems as it solves as it would require a backward-incompatible change > to the namespace metadata interpretation. Instead of that dubious route > [1], address the root problem in the memory-hotplug implementation. > > [1]: https://lore.kernel.org/r/155000671719.348031.2347363160141119237.stgit@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx > Cc: Michal Hocko <mhocko@xxxxxxxx> > Cc: Vlastimil Babka <vbabka@xxxxxxx> > Cc: Logan Gunthorpe <logang@xxxxxxxxxxxx> > Cc: Oscar Salvador <osalvador@xxxxxxx> > Cc: Pavel Tatashin <pasha.tatashin@xxxxxxxxxx> > Signed-off-by: Dan Williams <dan.j.williams@xxxxxxxxx> > --- > include/linux/memory_hotplug.h | 2 > mm/memory_hotplug.c | 7 - > mm/page_alloc.c | 2 > mm/sparse.c | 225 +++++++++++++++++++++++++++------------- > 4 files changed, 155 insertions(+), 81 deletions(-) > [...] > @@ -325,6 +332,15 @@ static void __meminit sparse_init_one_section(struct mem_section *ms, > unsigned long pnum, struct page *mem_map, > struct mem_section_usage *usage) > { > + /* > + * Given that SPARSEMEM_VMEMMAP=y supports sub-section hotplug, > + * ->section_mem_map can not be guaranteed to point to a full > + * section's worth of memory. The field is only valid / used > + * in the SPARSEMEM_VMEMMAP=n case. > + */ > + if (IS_ENABLED(CONFIG_SPARSEMEM_VMEMMAP)) > + mem_map = NULL; Will this be a problem when reading mem_map with the crash-tool? I do not expect it to be, but I am not sure if crash internally tries to read ms->section_mem_map and do some sort of translation. And since ms->section_mem_map SECTION_HAS_MEM_MAP, it might be that it expects a valid mem_map? > +static void section_deactivate(unsigned long pfn, unsigned long nr_pages, > + struct vmem_altmap *altmap) > +{ > + DECLARE_BITMAP(map, SUBSECTIONS_PER_SECTION) = { 0 }; > + DECLARE_BITMAP(tmp, SUBSECTIONS_PER_SECTION) = { 0 }; > + struct mem_section *ms = __pfn_to_section(pfn); > + bool early_section = is_early_section(ms); > + struct page *memmap = NULL; > + unsigned long *subsection_map = ms->usage > + ? &ms->usage->subsection_map[0] : NULL; > + > + subsection_mask_set(map, pfn, nr_pages); > + if (subsection_map) > + bitmap_and(tmp, map, subsection_map, SUBSECTIONS_PER_SECTION); > + > + if (WARN(!subsection_map || !bitmap_equal(tmp, map, SUBSECTIONS_PER_SECTION), > + "section already deactivated (%#lx + %ld)\n", > + pfn, nr_pages)) > + return; > + > + /* > + * There are 3 cases to handle across two configurations > + * (SPARSEMEM_VMEMMAP={y,n}): > + * > + * 1/ deactivation of a partial hot-added section (only possible > + * in the SPARSEMEM_VMEMMAP=y case). > + * a/ section was present at memory init > + * b/ section was hot-added post memory init > + * 2/ deactivation of a complete hot-added section > + * 3/ deactivation of a complete section from memory init > + * > + * For 1/, when subsection_map does not empty we will not be > + * freeing the usage map, but still need to free the vmemmap > + * range. > + * > + * For 2/ and 3/ the SPARSEMEM_VMEMMAP={y,n} cases are unified > + */ > + bitmap_xor(subsection_map, map, subsection_map, SUBSECTIONS_PER_SECTION); > + if (bitmap_empty(subsection_map, SUBSECTIONS_PER_SECTION)) { > + unsigned long section_nr = pfn_to_section_nr(pfn); > + > + if (!early_section) { > + kfree(ms->usage); > + ms->usage = NULL; > + } > + memmap = sparse_decode_mem_map(ms->section_mem_map, section_nr); > + ms->section_mem_map = sparse_encode_mem_map(NULL, section_nr); > + } > + > + if (early_section && memmap) > + free_map_bootmem(memmap); > + else > + depopulate_section_memmap(pfn, nr_pages, altmap); > +} > + > +static struct page * __meminit section_activate(int nid, unsigned long pfn, > + unsigned long nr_pages, struct vmem_altmap *altmap) > +{ > + DECLARE_BITMAP(map, SUBSECTIONS_PER_SECTION) = { 0 }; > + struct mem_section *ms = __pfn_to_section(pfn); > + struct mem_section_usage *usage = NULL; > + unsigned long *subsection_map; > + struct page *memmap; > + int rc = 0; > + > + subsection_mask_set(map, pfn, nr_pages); > + > + if (!ms->usage) { > + usage = kzalloc(mem_section_usage_size(), GFP_KERNEL); > + if (!usage) > + return ERR_PTR(-ENOMEM); > + ms->usage = usage; > + } > + subsection_map = &ms->usage->subsection_map[0]; > + > + if (bitmap_empty(map, SUBSECTIONS_PER_SECTION)) > + rc = -EINVAL; > + else if (bitmap_intersects(map, subsection_map, SUBSECTIONS_PER_SECTION)) > + rc = -EEXIST; > + else > + bitmap_or(subsection_map, map, subsection_map, > + SUBSECTIONS_PER_SECTION); > + > + if (rc) { > + if (usage) > + ms->usage = NULL; > + kfree(usage); > + return ERR_PTR(rc); > + } We should not be really looking at subsection_map stuff when running on !CONFIG_SPARSE_VMEMMAP, right? Would it make sense to hide the bitmap dance behind if(IS_ENABLED(CONFIG_SPARSEMEM_VMEMMAP)) ? Sorry for nagging here > /** > - * sparse_add_one_section - add a memory section > + * sparse_add_section - add a memory section, or populate an existing one > * @nid: The node to add section on > * @start_pfn: start pfn of the memory range > + * @nr_pages: number of pfns to add in the section > * @altmap: device page map > * > * This is only intended for hotplug. Below this, the return codes are specified: --- * Return: * * 0 - On success. * * -EEXIST - Section has been present. * * -ENOMEM - Out of memory. */ --- We can get rid of -EEXIST since we do not return that anymore. -- Oscar Salvador SUSE L3