On 08/27/22 19:29, Aneesh Kumar K.V wrote: > Mike Kravetz <mike.kravetz@xxxxxxxxxx> writes: > > > > > On 08/25/22 09:25, David Hildenbrand wrote: > >> > Is the primary concern the locking? If so, I am not sure we have an issue. > >> > As mentioned in your commit message, current code will use > >> > pte_offset_map_lock(). pte_offset_map_lock uses pte_lockptr, and pte_lockptr > >> > will either be the mm wide lock or pmd_page lock. To me, it seems that > >> > either would provide correct synchronization for CONT-PTE entries. Am I > >> > missing something or misreading the code? > >> > > >> > I started looking at code cleanup suggested by David. Here is a quick > >> > patch (not tested and likely containing errors) to see if this is a step > >> > in the right direction. > >> > > >> > I like it because we get rid of/combine all those follow_huge_p*d > >> > routines. > >> > > >> > >> Yes, see comments below. > >> > >> > From 35d117a707c1567ddf350554298697d40eace0d7 Mon Sep 17 00:00:00 2001 > >> > From: Mike Kravetz <mike.kravetz@xxxxxxxxxx> > >> > Date: Wed, 24 Aug 2022 15:59:15 -0700 > >> > Subject: [PATCH] hugetlb: call hugetlb_follow_page_mask for hugetlb pages in > >> > follow_page_mask > >> > > >> > At the beginning of follow_page_mask, there currently is a call to > >> > follow_huge_addr which 'may' handle hugetlb pages. ia64 is the only > >> > architecture which (incorrectly) provides a follow_huge_addr routine > >> > that does not return error. Instead, at each level of the page table a > >> > check is made for a hugetlb entry. If a hugetlb entry is found, a call > >> > to a routine associated with that page table level such as > >> > follow_huge_pmd is made. > >> > > >> > All the follow_huge_p*d routines are basically the same. In addition > >> > huge page size can be derived from the vma, so we know where in the page > >> > table a huge page would reside. So, replace follow_huge_addr with a > >> > new architecture independent routine which will provide the same > >> > functionality as the follow_huge_p*d routines. We can then eliminate > >> > the p*d_huge checks in follow_page_mask page table walking as well as > >> > the follow_huge_p*d routines themselves.> > >> > follow_page_mask still has is_hugepd hugetlb checks during page table > >> > walking. This is due to these checks and follow_huge_pd being > >> > architecture specific. These can be eliminated if > >> > hugetlb_follow_page_mask can be overwritten by architectures (powerpc) > >> > that need to do follow_huge_pd processing. > >> > >> But won't the > >> > >> > + /* hugetlb is special */ > >> > + if (is_vm_hugetlb_page(vma)) > >> > + return hugetlb_follow_page_mask(vma, address, flags); > >> > >> code route everything via hugetlb_follow_page_mask() and all these > >> (beloved) hugepd checks would essentially be unreachable? > >> > >> At least my understanding is that hugepd only applies to hugetlb. > >> > >> Can't we move the hugepd handling code into hugetlb_follow_page_mask() > >> as well? > >> > >> I mean, doesn't follow_hugetlb_page() also have to handle that hugepd > >> stuff already ... ? > >> > >> [...] > > > > I think so, but I got a little confused looking at the hugepd handling code. > > Adding Aneesh who added support to follow_page_mask in the series at: > > https://lore.kernel.org/linux-mm/1494926612-23928-1-git-send-email-aneesh.kumar@xxxxxxxxxxxxxxxxxx/ > > > > I believe you are correct in that follow_hugetlb_page must handle as well. > > > > One source of my confusion is the following in follow_huge_pd: > > > > /* > > * hugepage directory entries are protected by mm->page_table_lock > > * Use this instead of huge_pte_lockptr > > */ > > ptl = &mm->page_table_lock; > > spin_lock(ptl); > > > > Yet, if follow_hugetlb_page handles hugepd then it is using huge_pte_lockptr > > to get the lock pointer and is wrong? > > > > Hoping Aneesh can help clear up the confusion. > > > I agree it is all confusing. At some point, the goal was to teach > generic kernel page table walking code about hugepd entries. But looking > at this again and considering we only have hugepd entries for hugetlb, > may be the effort is not worth the complexity it adds. > > ie, instead of teaching generic page table walk about different hugetlb > page table layouts we special case using is_vm_hugetlb_page(vma) > wherever we can. Thanks for your comments Aneesh. I give David credit for suggesting that is would be simpler to just special case hugetlb mappings here. Based on your comments, I believe an arch independent 'hugetlb_follow_page_mask' routine would handle all cases and we can remove the 'if (p*d_huge)' blocks and 'if (is_hugepd)' blocks of code from follow_page_mask. Such a routine would call huge_pte_offset() which can be/is overwritten by arch specific code. In fact, the powerpc version of this already handles hugepd entries. > With respect to huge_pte_lockptr, it is tricky (hugepd entries are not > PMD_SIZE) > > static inline spinlock_t *huge_pte_lockptr(struct hstate *h, > struct mm_struct *mm, pte_t *pte) > { > if (huge_page_size(h) == PMD_SIZE) > return pmd_lockptr(mm, (pmd_t *) pte); > VM_BUG_ON(huge_page_size(h) == PAGE_SIZE); > return &mm->page_table_lock; > } Ok, so I think you confirmed that huge_pte_lockptr would work for hugepd entries as they are never PMD_SIZE. I will be sure to cc you on the proposed changes. Thanks, -- Mike Kravetz