On 11/04/22 11:02, Peter Xu wrote: > Hi, Mike, > > On Thu, Nov 03, 2022 at 05:21:46PM -0700, Mike Kravetz wrote: > > On 10/30/22 17:29, Peter Xu wrote: > > > Resolution > > > ========== > > > > > > What this patch proposed is, besides using the vma lock, we can also use > > > RCU to protect the pgtable page from being freed from under us when > > > huge_pte_offset() is used. The idea is kind of similar to RCU fast-gup. > > > Note that fast-gup is very safe regarding pmd unsharing even before vma > > > lock, because fast-gup relies on RCU to protect walking any pgtable page, > > > including another mm's. > > > > > > To apply the same idea to huge_pte_offset(), it means with proper RCU > > > protection the pte_t* pointer returned from huge_pte_offset() can also be > > > always safe to access and de-reference, along with the pgtable lock that > > > was bound to the pgtable page. > > > > > > Patch Layout > > > ============ > > > > > > Patch 1 is a trivial cleanup that I noticed when working on this. Please > > > shoot if anyone think I should just post it separately, or hopefully I can > > > still just carry it over. > > > > > > Patch 2 is the gut of the patchset, describing how we should use the helper > > > huge_pte_offset() correctly. Only a comment patch but should be the most > > > important one, as the follow up patches are just trying to follow the rule > > > it setup here. > > > > > > The rest patches resolve all the call sites of huge_pte_offset() to make > > > sure either it's with the vma lock (which is perfectly good enough for > > > safety in this case; the last patch commented on all those callers to make > > > sure we won't miss a single case, and why they're safe). Besides, each of > > > the patch will add rcu protection to one caller of huge_pte_offset(). > > > > > > Tests > > > ===== > > > > > > Only lightly tested on hugetlb kselftests including uffd, no more errors > > > triggered than current mm-unstable (hugetlb-madvise fails before/after > > > here, with error "Unexpected number of free huge pages line 207"; haven't > > > really got time to look into it). > > > > Do not worry about the madvise test failure, that is caused by a recent > > change. > > > > Unless I am missing something, the basic strategy in this series is to > > wrap calls to huge_pte_offset and subsequent ptep access with > > rcu_read_lock/unlock calls. I must embarrassingly admit that it has > > been a loooong time since I had to look at rcu usage and may not know > > what I am talking about. However, I seem to recall that one needs to > > somehow flag the data items being protected from update/freeing. I > > do not see anything like that in the huge_pmd_unshare routine where > > pmd page pointer is updated. Or, is it where the pmd page pointer is > > referenced in huge_pte_offset? > > Right. The RCU proposed here is trying to protect the pmd pgtable page > that will normally be freed in rcu pattern. Please refer to > tlb_remove_table_free() (which can be called from tlb_finish_mmu()) where > it's released with RCU API: > > call_rcu(&batch->rcu, tlb_remove_table_rcu); > Thanks! That is the piece of the puzzle I was missing. > I mentioned fast-gup just to refererence on the same usage as fast-gup has > the same risk if without RCU or similar protections that is IPI-based, but > I definitely can be even clearer, and I will enrich the cover letter in the > next post. > > In short, my understanding is pgtable pages (including the shared PUD page > for hugetlb) needs to be freed with caution because there can be softwares > that are walking the pages with no locks. In our case, even though > huge_pte_offset() is with the mmap lock, due to the pmd sharing it's not > always having the same mmap lock as when the pgtable needs to be freed, so > it's similar to having no lock here, imo. Then huge_pte_offset() needs to > be protected just like what we do with fast-gup. > > Please also feel free to refer to the comment chunk at the start of > asm-generic/tlb.h for more information on the mmu gather API. > > > > > Please ignore if you are certain of this rcu usage, otherwise I will > > spend some time reeducating myself. Sorry for any misunderstanding. I am very happy with the RFC and the work you have done. I was just missing the piece about rcu synchronization when the page table was removed. > I'm not certain, and I'd like to get any form of comment. :) > > Sorry if this RFC version is confusing, but if it can try to at least > explain what the problem we have and if we can agree on the problem first > then that'll already be a step forward to me. So far that's more important > than how we resolve it, using RCU or vma lock or anything else. > > For a non-rfc series, I think I need to be more careful on some details, > e.g., the RCU protection for pgtable page is only used when the arch > supports MMU_GATHER_RCU_TABLE_FREE. I thought that's always supported at > least for pmd sharing enabled archs, but I'm actually wrong: > > arch/arm64/Kconfig: select ARCH_WANT_HUGE_PMD_SHARE if ARM64_4K_PAGES || (ARM64_16K_PAGES && !ARM64_VA_BITS_36) > arch/riscv/Kconfig: select ARCH_WANT_HUGE_PMD_SHARE if 64BIT > arch/x86/Kconfig: select ARCH_WANT_HUGE_PMD_SHARE > > arch/arm/Kconfig: select MMU_GATHER_RCU_TABLE_FREE if SMP && ARM_LPAE > arch/arm64/Kconfig: select MMU_GATHER_RCU_TABLE_FREE > arch/powerpc/Kconfig: select MMU_GATHER_RCU_TABLE_FREE > arch/s390/Kconfig: select MMU_GATHER_RCU_TABLE_FREE > arch/sparc/Kconfig: select MMU_GATHER_RCU_TABLE_FREE if SMP > arch/sparc/include/asm/tlb_64.h:#ifdef CONFIG_MMU_GATHER_RCU_TABLE_FREE > arch/x86/Kconfig: select MMU_GATHER_RCU_TABLE_FREE if PARAVIRT > > I think it means at least on RISCV RCU_TABLE_FREE is not enabled and we'll > need to rely on the IPIs (e.g. I think we need to replace rcu_read_lock() > with local_irq_disable() on RISCV only for what this patchset wanted to > do). In the next version, I plan to add a helper, let's name it > huge_pte_walker_lock() for now, and it should be one of the three options: > > - if !ARCH_WANT_HUGE_PMD_SHARE: it's no-op > - else if MMU_GATHER_RCU_TABLE_FREE: it should be rcu_read_lock() > - else: it should be local_irq_disable() > > With that, I think we'll strictly follow what we have with fast-gup, at the > meantime it should add zero overhead on archs that does not have pmd sharing. > > Hope above helps a bit on extending the missing pieces of the cover > letter. Or again if anything missing I'd be more than glad to know.. -- Mike Kravetz