Re: [PATCH V2 2/2] arm64/mm: Enable memory hot remove

David Hildenbrand <david@xxxxxxxxxx> · Tue, 23 Apr 2019 09:37:58 +0200

On 23.04.19 09:31, Anshuman Khandual wrote:
> 
> 
> On 04/18/2019 10:58 AM, Anshuman Khandual wrote:
>> On 04/17/2019 11:09 PM, Mark Rutland wrote:
>>> On Wed, Apr 17, 2019 at 10:15:35PM +0530, Anshuman Khandual wrote:
>>>> On 04/17/2019 07:51 PM, Mark Rutland wrote:
>>>>> On Wed, Apr 17, 2019 at 03:28:18PM +0530, Anshuman Khandual wrote:
>>>>>> On 04/15/2019 07:18 PM, Mark Rutland wrote:
>>>>>>> On Sun, Apr 14, 2019 at 11:29:13AM +0530, Anshuman Khandual wrote:
>>>
>>>>>>>> +	spin_unlock(&init_mm.page_table_lock);
>>>>>>>
>>>>>>> What precisely is the page_table_lock intended to protect?
>>>>>>
>>>>>> Concurrent modification to kernel page table (init_mm) while clearing entries.
>>>>>
>>>>> Concurrent modification by what code?
>>>>>
>>>>> If something else can *modify* the portion of the table that we're
>>>>> manipulating, then I don't see how we can safely walk the table up to
>>>>> this point without holding the lock, nor how we can safely add memory.
>>>>>
>>>>> Even if this is to protect something else which *reads* the tables,
>>>>> other code in arm64 which modifies the kernel page tables doesn't take
>>>>> the lock.
>>>>>
>>>>> Usually, if you can do a lockless walk you have to verify that things
>>>>> didn't change once you've taken the lock, but we don't follow that
>>>>> pattern here.
>>>>>
>>>>> As things stand it's not clear to me whether this is necessary or
>>>>> sufficient.
>>>>
>>>> Hence lets take more conservative approach and wrap the entire process of
>>>> remove_pagetable() under init_mm.page_table_lock which looks safe unless
>>>> in the worst case when free_pages() gets stuck for some reason in which
>>>> case we have bigger memory problem to deal with than a soft lock up.
>>>
>>> Sorry, but I'm not happy with _any_ solution until we understand where
>>> and why we need to take the init_mm ptl, and have made some effort to
>>> ensure that the kernel correctly does so elsewhere. It is not sufficient
>>> to consider this code in isolation.
>>
>> We will have to take the kernel page table lock to prevent assumption regarding
>> present or future possible kernel VA space layout. Wrapping around the entire
>> remove_pagetable() will be at coarse granularity but I dont see why it should
>> not sufficient atleast from this particular tear down operation regardless of
>> how this might affect other kernel pgtable walkers.
>>
>> IIUC your concern is regarding other parts of kernel code (arm64/generic) which
>> assume that kernel page table wont be changing and hence they normally walk the
>> table without holding pgtable lock. Hence those current pgtabe walker will be
>> affected after this change.
>>
>>>
>>> IIUC, before this patch we never clear non-leaf entries in the kernel
>>> page tables, so readers don't presently need to take the ptl in order to
>>> safely walk down to a leaf entry.
>>
>> Got it. Will look into this.
>>
>>>
>>> For example, the arm64 ptdump code never takes the ptl, and as of this
>>> patch it will blow up if it races with a hot-remove, regardless of
>>> whether the hot-remove code itself holds the ptl.
>>
>> Got it. Are there there more such examples where this can be problematic. I
>> will be happy to investigate all such places and change/add locking scheme
>> in there to make them work with memory hot remove.
>>
>>>
>>> Note that the same applies to the x86 ptdump code; we cannot assume that
>>> just because x86 does something that it happens to be correct.
>>
>> I understand. Will look into other non-x86 platforms as well on how they are
>> dealing with this.
>>
>>>
>>> I strongly suspect there are other cases that would fall afoul of this,
>>> in both arm64 and generic code.
> 
> On X86
> 
> - kernel_physical_mapping_init() takes the lock for pgtable page allocations as well
>   as all leaf level entries including large mappings.
> 
> On Power
> 
> - remove_pagetable() take an overall high level init_mm.page_table_lock as I had
>   suggested before. __map_kernel_page() calls [pud|pmd|pte]_alloc_[kernel] which
>   allocates page table pages are protected with init_mm.page_table_lock but then
>   the actual setting of the leaf entries are not (unlike x86)
> 
> 	arch_add_memory()
> 		create_section_mapping()
> 			radix__create_section_mapping()
> 				create_physical_mapping()
> 					__map_kernel_page()
> On arm64.
> 
> Both kernel page table dump and linear mapping (__create_pgd_mapping on init_mm)
> will require init_mm.page_table_lock to be really safe against this new memory
> hot remove code. I will do the necessary changes as part of this series next time
> around. IIUC there is no equivalent generic feature for ARM64_PTDUMP_CORE/DEBUGFS.
> 	 > 
>> Will start looking into all such possible cases both on arm64 and generic.
>> Mean while more such pointers would be really helpful.
> 
> Generic usage for init_mm.pagetable_lock
> 
> Unless I have missed something else these are the generic init_mm kernel page table
> modifiers at runtime (at least which uses init_mm.page_table_lock)
> 
> 	1. ioremap_page_range()		/* Mapped I/O memory area */
> 	2. apply_to_page_range()	/* Change existing kernel linear map */
> 	3. vmap_page_range()		/* Vmalloc area */
> 
> A. IOREMAP
> 
> ioremap_page_range()
> 	ioremap_p4d_range()
> 		p4d_alloc()
> 		ioremap_try_huge_p4d() -> p4d_set_huge() -> set_p4d()
> 		ioremap_pud_range()
> 			pud_alloc()
> 			ioremap_try_huge_pud() -> pud_set_huge() -> set_pud()
> 			ioremap_pmd_range()
> 				pmd_alloc()
> 				ioremap_try_huge_pmd() -> pmd_set_huge() -> set_pmd()
> 				ioremap_pte_range()
> 					pte_alloc_kernel()
> 						set_pte_at() -> set_pte()
> B. APPLY_TO_PAGE_RANGE
> 
> apply_to_page_range()
> 	apply_to_p4d_range()
> 		p4d_alloc()
> 		apply_to_pud_range()
> 			pud_alloc()
> 			apply_to_pmd_range()
> 				pmd_alloc()
> 				apply_to_pte_range()
> 					pte_alloc_kernel()
> 
> C. VMAP_PAGE_RANGE
> 
> vmap_page_range()
> vmap_page_range_noflush()
> 	vmap_p4d_range()
> 		p4d_alloc()
> 		vmap_pud_range()
> 			pud_alloc()
> 			vmap_pmd_range()
> 				pmd_alloc()
> 				vmap_pte_range()
> 					pte_alloc_kernel()
> 					set_pte_at()
> 
> In all of the above.
> 
> - Page table pages [p4d|pud|pmd|pte]_alloc_[kernel] settings are protected with init_mm.page_table_lock
> - Should not it require init_mm.page_table_lock for all leaf level (PUD|PMD|PTE) modification as well ?
> - Should not this require init_mm.page_table_lock for page table walk itself ?
> 
> Not taking an overall lock for all these three operations will potentially race with an ongoing memory
> hot remove operation which takes an overall lock as proposed. Wondering if this has this been safe till
> now ?
> 

All memory add/remove operations are currently guarded by
mem_hotplug_lock as far as I know.

-- 

Thanks,

David / dhildenb