On Wed, Feb 07, 2024 at 12:11:25PM +0000, Will Deacon wrote: > On Wed, Feb 07, 2024 at 11:21:17AM +0000, Matthew Wilcox wrote: > > On Wed, Feb 07, 2024 at 11:12:52AM +0000, Will Deacon wrote: > > > On Sat, Jan 27, 2024 at 01:04:15PM +0800, Nanyong Sun wrote: > > > > > > > > On 2024/1/26 2:06, Catalin Marinas wrote: > > > > > On Sat, Jan 13, 2024 at 05:44:33PM +0800, Nanyong Sun wrote: > > > > > > HVO was previously disabled on arm64 [1] due to the lack of necessary > > > > > > BBM(break-before-make) logic when changing page tables. > > > > > > This set of patches fix this by adding necessary BBM sequence when > > > > > > changing page table, and supporting vmemmap page fault handling to > > > > > > fixup kernel address translation fault if vmemmap is concurrently accessed. > > > > > I'm not keen on this approach. I'm not even sure it's safe. In the > > > > > second patch, you take the init_mm.page_table_lock on the fault path but > > > > > are we sure this is unlocked when the fault was taken? > > > > I think this situation is impossible. In the implementation of the second > > > > patch, when the page table is being corrupted > > > > (the time window when a page fault may occur), vmemmap_update_pte() already > > > > holds the init_mm.page_table_lock, > > > > and unlock it until page table update is done.Another thread could not hold > > > > the init_mm.page_table_lock and > > > > also trigger a page fault at the same time. > > > > If I have missed any points in my thinking, please correct me. Thank you. > > > > > > It still strikes me as incredibly fragile to handle the fault and trying > > > to reason about all the users of 'struct page' is impossible. For example, > > > can the fault happen from irq context? > > > > The pte lock cannot be taken in irq context (which I think is what > > you're asking?) While it is not possible to reason about all users of > > struct page, we are somewhat relieved of that work by noting that this is > > only for hugetlbfs, so we don't need to reason about slab, page tables, > > netmem or zsmalloc. > > My concern is that an interrupt handler tries to access a 'struct page' > which faults due to another core splitting a pmd mapping for the vmemmap. > In this case, I think we'll end up trying to resolve the fault from irq > context, which will try to take the spinlock. I think that (as per my comments on patch 2), a similar deadlock can happen on RT even if the vmemmap is only accessed in regular process context, and at minimum this needs better comentary and/or lockdep assertions. I'd also prefer that we dropped this for now. > Avoiding the fault would make this considerably more robust and the > architecture has introduced features to avoid break-before-make in some > circumstances (see FEAT_BBM and its levels), so having this optimisation > conditional on that would seem to be a better approach in my opinion. FWIW, that's my position too. Mark.