On Wed, Apr 24, 2019 at 11:29:28AM +0530, Anshuman Khandual wrote: > On 04/23/2019 09:35 PM, Mark Rutland wrote: > > On Tue, Apr 23, 2019 at 01:01:58PM +0530, Anshuman Khandual wrote: > >> Generic usage for init_mm.pagetable_lock > >> > >> Unless I have missed something else these are the generic init_mm kernel page table > >> modifiers at runtime (at least which uses init_mm.page_table_lock) > >> > >> 1. ioremap_page_range() /* Mapped I/O memory area */ > >> 2. apply_to_page_range() /* Change existing kernel linear map */ > >> 3. vmap_page_range() /* Vmalloc area */ > > > > Internally, those all use the __p??_alloc() functions to handle racy > > additions by transiently taking the PTL when installing a new table, but > > otherwise walk kernel tables _without_ the PTL held. Note that none of > > these ever free an intermediate level of table. > > Right they dont free intermediate level page table but I was curious about the > only the leaf level modifications. Sure thing; I just wanted to point that out explicitly for everyone else's benefit. :) > > I believe that the idea is that operations on separate VMAs should never > > I guess you meant kernel virtual range with 'VMA' but not the actual VMA which is > vm_area_struct applicable only for the user space not the kernel. Sure. In the kernel we'd reserve a kernel VA range with a vm_struct via get_vm_area() or similar. The key point is that we reserve page-granular VA ranges which cannot overlap at the leaf level (but may share intermediate levels of table). Whoever owns that area is in charge of necessary mutual exclusion for the leaf entries. > > conflict at the leaf level, and operations on the same VMA should be > > serialised somehow w.r.t. that VMA. > > AFAICT see there is nothing other than hotplug lock i.e mem_hotplug_lock which > prevents concurrent init_mm modifications and the current situation is only safe > because some how these VA areas dont overlap with respect to intermediate page > table level spans. Here I was ignoring hotplug to describe the general principle (which I've expanded upon above). > > AFAICT, these functions are _never_ called on the linear/direct map or > > vmemmap VA ranges, and whether or not these can conflict with hot-remove > > is entirely dependent on whether those ranges can share a level of table > > with the vmalloc region. > > Right but all these VA ranges (linear, vmemmap, vmalloc) are wired in on init_mm > hence wondering if it is prudent to assume layout scheme which varies a lot based > on different architectures while deciding possible race protections. One thing to consider is that we could turn this implicit assumption into a requirement, if this isn't too invasive. > Wondering why these user should not call [get|put]_online_mems() to prevent > race with hotplug. I suspect that this is because they were written before memory hotplug was added, and they were never reconsidered in the presence of hot-remove. > Will try this out. > > Unless generic MM expects these VA ranges (linear, vmemmap, vmalloc) layout to be > in certain manner from the platform guaranteeing non-overlap at intermediate level > page table spans. Only then we would not a lock. I think that we might be able to make this a requirement for hot-remove. I suspect this happens to be the case in practice by chance, even if it isn't strictly guaranteed. > > Do you know how likely that is to occur? e.g. what proportion of the > > TBH I dont know. > > > vmalloc region may share a level of table with the linear or vmemmap > > regions in a typical arm64 or x86 configuration? Can we deliberately > > provoke this failure case? > > I have not enumerated those yet but there are multiple configs on arm64 and > probably on x86 which decides kernel VA space layout causing these potential > races. But regardless its not right to assume on vmalloc range span and not > take a lock. > > Not sure how to provoke this failure case from user space with simple hotplug > because vmalloc physical allocation normally cannot be controlled without a > hacked kernel change. I believe that we can write a module which: - Looks at the various memory ranges, and determines whether they may share an intermediate level of table. - reserves a potentially-conflicting region with __get_vm_area_node() - in a loop, maps/unmaps something in that range ... while in parallel, adding/removing a potentially-conflicting region of memory. So long as we have the same sort of serialization we'd have for a regular vmalloc(), ioremap() of vmap(), that would be sufficient to demonstrate that this is a real problem. [...] > > Is it possible to avoid these specific conflicts (ignoring ptdump) by > > aligning VA regions such that they cannot share intermediate levels of > > table? > > Kernel VA space layout is platform specific where core MM does not mandate much. > Hence generic modifiers should not make any assumptions regarding it but protect > themselves with locks. Doing any thing other than that is just pushing the problem > to future. My point is that we can make this a _requirement_ of the core code, which we could document and enforce. Thanks, Mark.