Hi, Mike, On Thu, Nov 03, 2022 at 05:21:46PM -0700, Mike Kravetz wrote: > On 10/30/22 17:29, Peter Xu wrote: > > Resolution > > ========== > > > > What this patch proposed is, besides using the vma lock, we can also use > > RCU to protect the pgtable page from being freed from under us when > > huge_pte_offset() is used. The idea is kind of similar to RCU fast-gup. > > Note that fast-gup is very safe regarding pmd unsharing even before vma > > lock, because fast-gup relies on RCU to protect walking any pgtable page, > > including another mm's. > > > > To apply the same idea to huge_pte_offset(), it means with proper RCU > > protection the pte_t* pointer returned from huge_pte_offset() can also be > > always safe to access and de-reference, along with the pgtable lock that > > was bound to the pgtable page. > > > > Patch Layout > > ============ > > > > Patch 1 is a trivial cleanup that I noticed when working on this. Please > > shoot if anyone think I should just post it separately, or hopefully I can > > still just carry it over. > > > > Patch 2 is the gut of the patchset, describing how we should use the helper > > huge_pte_offset() correctly. Only a comment patch but should be the most > > important one, as the follow up patches are just trying to follow the rule > > it setup here. > > > > The rest patches resolve all the call sites of huge_pte_offset() to make > > sure either it's with the vma lock (which is perfectly good enough for > > safety in this case; the last patch commented on all those callers to make > > sure we won't miss a single case, and why they're safe). Besides, each of > > the patch will add rcu protection to one caller of huge_pte_offset(). > > > > Tests > > ===== > > > > Only lightly tested on hugetlb kselftests including uffd, no more errors > > triggered than current mm-unstable (hugetlb-madvise fails before/after > > here, with error "Unexpected number of free huge pages line 207"; haven't > > really got time to look into it). > > Do not worry about the madvise test failure, that is caused by a recent > change. > > Unless I am missing something, the basic strategy in this series is to > wrap calls to huge_pte_offset and subsequent ptep access with > rcu_read_lock/unlock calls. I must embarrassingly admit that it has > been a loooong time since I had to look at rcu usage and may not know > what I am talking about. However, I seem to recall that one needs to > somehow flag the data items being protected from update/freeing. I > do not see anything like that in the huge_pmd_unshare routine where > pmd page pointer is updated. Or, is it where the pmd page pointer is > referenced in huge_pte_offset? Right. The RCU proposed here is trying to protect the pmd pgtable page that will normally be freed in rcu pattern. Please refer to tlb_remove_table_free() (which can be called from tlb_finish_mmu()) where it's released with RCU API: call_rcu(&batch->rcu, tlb_remove_table_rcu); I mentioned fast-gup just to refererence on the same usage as fast-gup has the same risk if without RCU or similar protections that is IPI-based, but I definitely can be even clearer, and I will enrich the cover letter in the next post. In short, my understanding is pgtable pages (including the shared PUD page for hugetlb) needs to be freed with caution because there can be softwares that are walking the pages with no locks. In our case, even though huge_pte_offset() is with the mmap lock, due to the pmd sharing it's not always having the same mmap lock as when the pgtable needs to be freed, so it's similar to having no lock here, imo. Then huge_pte_offset() needs to be protected just like what we do with fast-gup. Please also feel free to refer to the comment chunk at the start of asm-generic/tlb.h for more information on the mmu gather API. > > Please ignore if you are certain of this rcu usage, otherwise I will > spend some time reeducating myself. I'm not certain, and I'd like to get any form of comment. :) Sorry if this RFC version is confusing, but if it can try to at least explain what the problem we have and if we can agree on the problem first then that'll already be a step forward to me. So far that's more important than how we resolve it, using RCU or vma lock or anything else. For a non-rfc series, I think I need to be more careful on some details, e.g., the RCU protection for pgtable page is only used when the arch supports MMU_GATHER_RCU_TABLE_FREE. I thought that's always supported at least for pmd sharing enabled archs, but I'm actually wrong: arch/arm64/Kconfig: select ARCH_WANT_HUGE_PMD_SHARE if ARM64_4K_PAGES || (ARM64_16K_PAGES && !ARM64_VA_BITS_36) arch/riscv/Kconfig: select ARCH_WANT_HUGE_PMD_SHARE if 64BIT arch/x86/Kconfig: select ARCH_WANT_HUGE_PMD_SHARE arch/arm/Kconfig: select MMU_GATHER_RCU_TABLE_FREE if SMP && ARM_LPAE arch/arm64/Kconfig: select MMU_GATHER_RCU_TABLE_FREE arch/powerpc/Kconfig: select MMU_GATHER_RCU_TABLE_FREE arch/s390/Kconfig: select MMU_GATHER_RCU_TABLE_FREE arch/sparc/Kconfig: select MMU_GATHER_RCU_TABLE_FREE if SMP arch/sparc/include/asm/tlb_64.h:#ifdef CONFIG_MMU_GATHER_RCU_TABLE_FREE arch/x86/Kconfig: select MMU_GATHER_RCU_TABLE_FREE if PARAVIRT I think it means at least on RISCV RCU_TABLE_FREE is not enabled and we'll need to rely on the IPIs (e.g. I think we need to replace rcu_read_lock() with local_irq_disable() on RISCV only for what this patchset wanted to do). In the next version, I plan to add a helper, let's name it huge_pte_walker_lock() for now, and it should be one of the three options: - if !ARCH_WANT_HUGE_PMD_SHARE: it's no-op - else if MMU_GATHER_RCU_TABLE_FREE: it should be rcu_read_lock() - else: it should be local_irq_disable() With that, I think we'll strictly follow what we have with fast-gup, at the meantime it should add zero overhead on archs that does not have pmd sharing. Hope above helps a bit on extending the missing pieces of the cover letter. Or again if anything missing I'd be more than glad to know.. Thanks, -- Peter Xu