On 23.04.19 09:45, Anshuman Khandual wrote: > > > On 04/23/2019 01:07 PM, David Hildenbrand wrote: >> On 23.04.19 09:31, Anshuman Khandual wrote: >>> >>> >>> On 04/18/2019 10:58 AM, Anshuman Khandual wrote: >>>> On 04/17/2019 11:09 PM, Mark Rutland wrote: >>>>> On Wed, Apr 17, 2019 at 10:15:35PM +0530, Anshuman Khandual wrote: >>>>>> On 04/17/2019 07:51 PM, Mark Rutland wrote: >>>>>>> On Wed, Apr 17, 2019 at 03:28:18PM +0530, Anshuman Khandual wrote: >>>>>>>> On 04/15/2019 07:18 PM, Mark Rutland wrote: >>>>>>>>> On Sun, Apr 14, 2019 at 11:29:13AM +0530, Anshuman Khandual wrote: >>>>> >>>>>>>>>> + spin_unlock(&init_mm.page_table_lock); >>>>>>>>> >>>>>>>>> What precisely is the page_table_lock intended to protect? >>>>>>>> >>>>>>>> Concurrent modification to kernel page table (init_mm) while clearing entries. >>>>>>> >>>>>>> Concurrent modification by what code? >>>>>>> >>>>>>> If something else can *modify* the portion of the table that we're >>>>>>> manipulating, then I don't see how we can safely walk the table up to >>>>>>> this point without holding the lock, nor how we can safely add memory. >>>>>>> >>>>>>> Even if this is to protect something else which *reads* the tables, >>>>>>> other code in arm64 which modifies the kernel page tables doesn't take >>>>>>> the lock. >>>>>>> >>>>>>> Usually, if you can do a lockless walk you have to verify that things >>>>>>> didn't change once you've taken the lock, but we don't follow that >>>>>>> pattern here. >>>>>>> >>>>>>> As things stand it's not clear to me whether this is necessary or >>>>>>> sufficient. >>>>>> >>>>>> Hence lets take more conservative approach and wrap the entire process of >>>>>> remove_pagetable() under init_mm.page_table_lock which looks safe unless >>>>>> in the worst case when free_pages() gets stuck for some reason in which >>>>>> case we have bigger memory problem to deal with than a soft lock up. >>>>> >>>>> Sorry, but I'm not happy with _any_ solution until we understand where >>>>> and why we need to take the init_mm ptl, and have made some effort to >>>>> ensure that the kernel correctly does so elsewhere. It is not sufficient >>>>> to consider this code in isolation. >>>> >>>> We will have to take the kernel page table lock to prevent assumption regarding >>>> present or future possible kernel VA space layout. Wrapping around the entire >>>> remove_pagetable() will be at coarse granularity but I dont see why it should >>>> not sufficient atleast from this particular tear down operation regardless of >>>> how this might affect other kernel pgtable walkers. >>>> >>>> IIUC your concern is regarding other parts of kernel code (arm64/generic) which >>>> assume that kernel page table wont be changing and hence they normally walk the >>>> table without holding pgtable lock. Hence those current pgtabe walker will be >>>> affected after this change. >>>> >>>>> >>>>> IIUC, before this patch we never clear non-leaf entries in the kernel >>>>> page tables, so readers don't presently need to take the ptl in order to >>>>> safely walk down to a leaf entry. >>>> >>>> Got it. Will look into this. >>>> >>>>> >>>>> For example, the arm64 ptdump code never takes the ptl, and as of this >>>>> patch it will blow up if it races with a hot-remove, regardless of >>>>> whether the hot-remove code itself holds the ptl. >>>> >>>> Got it. Are there there more such examples where this can be problematic. I >>>> will be happy to investigate all such places and change/add locking scheme >>>> in there to make them work with memory hot remove. >>>> >>>>> >>>>> Note that the same applies to the x86 ptdump code; we cannot assume that >>>>> just because x86 does something that it happens to be correct. >>>> >>>> I understand. Will look into other non-x86 platforms as well on how they are >>>> dealing with this. >>>> >>>>> >>>>> I strongly suspect there are other cases that would fall afoul of this, >>>>> in both arm64 and generic code. >>> >>> On X86 >>> >>> - kernel_physical_mapping_init() takes the lock for pgtable page allocations as well >>> as all leaf level entries including large mappings. >>> >>> On Power >>> >>> - remove_pagetable() take an overall high level init_mm.page_table_lock as I had >>> suggested before. __map_kernel_page() calls [pud|pmd|pte]_alloc_[kernel] which >>> allocates page table pages are protected with init_mm.page_table_lock but then >>> the actual setting of the leaf entries are not (unlike x86) >>> >>> arch_add_memory() >>> create_section_mapping() >>> radix__create_section_mapping() >>> create_physical_mapping() >>> __map_kernel_page() >>> On arm64. >>> >>> Both kernel page table dump and linear mapping (__create_pgd_mapping on init_mm) >>> will require init_mm.page_table_lock to be really safe against this new memory >>> hot remove code. I will do the necessary changes as part of this series next time >>> around. IIUC there is no equivalent generic feature for ARM64_PTDUMP_CORE/DEBUGFS. >>> > >>>> Will start looking into all such possible cases both on arm64 and generic. >>>> Mean while more such pointers would be really helpful. >>> >>> Generic usage for init_mm.pagetable_lock >>> >>> Unless I have missed something else these are the generic init_mm kernel page table >>> modifiers at runtime (at least which uses init_mm.page_table_lock) >>> >>> 1. ioremap_page_range() /* Mapped I/O memory area */ >>> 2. apply_to_page_range() /* Change existing kernel linear map */ >>> 3. vmap_page_range() /* Vmalloc area */ >>> >>> A. IOREMAP >>> >>> ioremap_page_range() >>> ioremap_p4d_range() >>> p4d_alloc() >>> ioremap_try_huge_p4d() -> p4d_set_huge() -> set_p4d() >>> ioremap_pud_range() >>> pud_alloc() >>> ioremap_try_huge_pud() -> pud_set_huge() -> set_pud() >>> ioremap_pmd_range() >>> pmd_alloc() >>> ioremap_try_huge_pmd() -> pmd_set_huge() -> set_pmd() >>> ioremap_pte_range() >>> pte_alloc_kernel() >>> set_pte_at() -> set_pte() >>> B. APPLY_TO_PAGE_RANGE >>> >>> apply_to_page_range() >>> apply_to_p4d_range() >>> p4d_alloc() >>> apply_to_pud_range() >>> pud_alloc() >>> apply_to_pmd_range() >>> pmd_alloc() >>> apply_to_pte_range() >>> pte_alloc_kernel() >>> >>> C. VMAP_PAGE_RANGE >>> >>> vmap_page_range() >>> vmap_page_range_noflush() >>> vmap_p4d_range() >>> p4d_alloc() >>> vmap_pud_range() >>> pud_alloc() >>> vmap_pmd_range() >>> pmd_alloc() >>> vmap_pte_range() >>> pte_alloc_kernel() >>> set_pte_at() >>> >>> In all of the above. >>> >>> - Page table pages [p4d|pud|pmd|pte]_alloc_[kernel] settings are protected with init_mm.page_table_lock >>> - Should not it require init_mm.page_table_lock for all leaf level (PUD|PMD|PTE) modification as well ? >>> - Should not this require init_mm.page_table_lock for page table walk itself ? >>> >>> Not taking an overall lock for all these three operations will potentially race with an ongoing memory >>> hot remove operation which takes an overall lock as proposed. Wondering if this has this been safe till >>> now ? >>> >> >> All memory add/remove operations are currently guarded by >> mem_hotplug_lock as far as I know. > > Right but it seems like it guards against concurrent memory hot add or remove operations with > respect to memory block, sections, sysfs etc. But does it cover with respect to other init_mm > modifiers or accessors ? > Don't think so, it's purely for memory add/remove via add_memory/remove_memory/devm_memremap_pages, not anything beyond that. Whoever uses get_online_mems/put_online_mems is save in respect to that - mostly slab/slub. -- Thanks, David / dhildenb