On Mon, Jan 25, 2021 at 11:39:55AM +0100, Oscar Salvador wrote: > > Interresting, so we automatically support differeing sizeof(struct > > page). I guess it will be problematic in case of sizeof(struct page) != > > 64, because then, we might not have multiples of 2MB for the memmap of a > > memory block. > > > > IIRC, it could happen that if we add two consecutive memory blocks, that > > the second one might reuse parts of the vmemmap residing on the first > > memory block. If you remove the first one, you might be in trouble. > > > > E.g., on x86-64 > > vmemmap_populate()->vmemmap_populate_hugepages()->vmemmap_alloc_block_buf(): > > - Populate a huge page > > > > vmemmap_free()->remove_pagetable()...->remove_pmd_table(): > > - memchr_inv() will leave the hugepage populated. > > > > Do we want to fence that off, maybe in mhp_supports_memmap_on_memory()? > > Or do we somehow want to fix that? We should never populate partial huge > > pages from an altmap ... > > > > But maybe I am missing something. > > No, you are not missing anything. > > I think that remove_pmd_table() should be fixed. > vmemmap_populate_hugepages() allocates PMD_SIZE chunk and marks the PMD as > PAGE_KERNEL_LARGE, so when remove_pmd_table() sees that 1) the PMD > is large and 2) the chunk is not aligned, the memset() should be writing > PAGE_INUSE for a PMD chunk. > > I do not really a see a reason why this should not happen. > Actually, we will be leaving pagetables behind as we never get to free > the PMD chunk when sizeof(struct page) is not a multiple of 2MB. > > I plan to fix remove_pmd_table(), but for now let us not allow to use this feature > if the size of a struct page is not 64. > Let us keep it simply for the time being, shall we? My first intention was: diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c index b5a3fa4033d3..0c9756a2eb24 100644 --- a/arch/x86/mm/init_64.c +++ b/arch/x86/mm/init_64.c @@ -1044,32 +1044,14 @@ remove_pmd_table(pmd_t *pmd_start, unsigned long addr, unsigned long end, continue; if (pmd_large(*pmd)) { - if (IS_ALIGNED(addr, PMD_SIZE) && - IS_ALIGNED(next, PMD_SIZE)) { - if (!direct) - free_hugepage_table(pmd_page(*pmd), - altmap); - - spin_lock(&init_mm.page_table_lock); - pmd_clear(pmd); - spin_unlock(&init_mm.page_table_lock); - pages++; - } else { - /* If here, we are freeing vmemmap pages. */ - memset((void *)addr, PAGE_INUSE, next - addr); - - page_addr = page_address(pmd_page(*pmd)); - if (!memchr_inv(page_addr, PAGE_INUSE, - PMD_SIZE)) { - free_hugepage_table(pmd_page(*pmd), - altmap); - - spin_lock(&init_mm.page_table_lock); - pmd_clear(pmd); - spin_unlock(&init_mm.page_table_lock); - } - } + if (!direct) + free_hugepage_table(pmd_page(*pmd), + altmap); + spin_lock(&init_mm.page_table_lock); + pmd_clear(pmd); + spin_unlock(&init_mm.page_table_lock); + pages++; continue; } I did not try it out yet and this might explode badly, but AFAICS, a PMD size chunk is always allocated even when the section does not spand 2MB. E.g: sizeof(struct page) = 56. PAGES_PER_SECTION * 64 = 2097152 PAGES_PER_SECTION * 56 = 1835008 Even in the latter case, vmemmap_populate_hugepages will allocate a PMD size chunk to satisfy the unaligned range. So, unless I am missing something, it should not be a problem to free that 2MB in remove_pmd_table when 1) the PMD is large and 2) the range is not aligned. -- Oscar Salvador SUSE L3